Enabling Content Dissemination Using Efficient and Scalable ... - pursuit

3 downloads 0 Views 268KB Size Report
on the scalability characteristics of MAD and demonstrate through analysis ..... ship tree is designed to be highly structured so that only a bounded small number ...
Enabling Content Dissemination Using Efficient and Scalable Multicast Tae Won Cho1 Michael Rabinovich2 K.K. Ramakrishnan3 1 2 University of Texas at Austin Case Western Reserve University {khatz, yzhang}@cs.utexas.edu [email protected] (Extended Version)

ABSTRACT

ing large amounts of data (e.g., video) have been growing. Efficiency in transporting the content has become a greater concern, and multicast clearly offers significant advantages to the content provider as well as the service provider. Scalability has become the dominant concern with multicast. This comes in two dimensions. • The first is the state that has to be maintained in routers (which may have limited table space). Since group membership may be long-lived, and the criticality of the information disseminated may be independent of the level of group activity, the membership of the multicast group needs to be maintained persistently to support timely information dissemination.

Multicast is an approach that uses network and server resources efficiently to distribute information to groups. As networks evolve to become information-centric, users will increasingly demand publish-subscribe based access to finegrained information, and multicast will need to evolve to (i) manage an increasing number of groups, with a distinct group for each piece of distributable content; (ii) support persistent group membership, as group activity can vary over time, with intense activity at some times, and infrequent (but still critical) activity at others. These requirements raise scalability challenges that are not met by today’s multicast techniques. In this paper, we propose the MAD (Multicast with Adaptive Dual-state) architecture to provide efficient multicast service at massive scale. MAD can scalably support a vast number of multicast groups, with varying activity over time, based on two key novel ideas: (i) decouple group membership from forwarding information, and (ii) apply an adaptive dual-state approach to optimize for the different objectives of active and inactive groups. We focus on the scalability characteristics of MAD and demonstrate through analysis, simulation and implementation that the architecture achieves high performance and efficiency.

1.

Divesh Srivastava3 Yin Zhang1 3 AT&T Labs–Research {kkrama, divesh}@research.att.com

• The second is the number of groups that the underlying architecture can support. Given the increasing amount of electronic content and the need to ensure that only relevant information is disseminated, multicast will need to manage an ever increasing number (e.g., billions or even hundreds of billions) of fine granularity groups, with a distinct group for each piece of distributable content. With the increased use of multicast, it is desirable that we have a way of overcoming the scalability challenges for multicast. In this paper, we propose the MAD (Multicast with Adaptive Dual-State) architecture to provide multicast capability at massive scale, while also being efficient. We focus on the scalability characteristics of MAD and demonstrate through analysis, simulation and implementation that MAD can support hundreds of billions of groups with hardware that is representative of today’s commercial platforms. At the same time, MAD achieves high performance and efficiency, while exploiting the wide range of activity likely to be seen across multicast groups. Large-scale dissemination of information is of course enabled by information aggregators. However, there are a variety of drawbacks from depending purely on information aggregators, mainly in terms of timeliness and coverage. Timeliness depends on how often the information from the producers gets updated at the aggregator; given the dynamic nature of information producers, it is almost impossible for all relevant information to be available at the aggregator in a timely manner. Coverage depends on the set of information producers chosen by the aggregator; given the vast number of information producers, it is infeasible for aggregators to provide complete coverage. These drawbacks of information

INTRODUCTION

Large-scale information dissemination has increasingly become a dominant role of the Internet. Diverse applications like financial markets, access to multimedia content and largescale software update have hastened the migration from the point-to-point communication of information between a single producer and consumer to multicast based methods, which use network and server resources efficiently to disseminate information from producers to groups of interested consumers. As the needs of these applications continue to grow, and newer information-intensive applications like online multiplayer games and cooperative health-care become pervasive, users will increasingly demand continuous access to information from a multitude of distributed information producers. Multicast, whether it is IP-multicast or overlay-multicast, is a key enabler for large-scale information dissemination. Multicast as a technology has been around for a long time. Some of the past barriers to widespread use have been the lack of support by carriers and (possibly as a consequence) a lack of application demand for multicast. However, multicast is seeing a resurgence, especially as applications send1

only at branching routers, while non-branching routers use unicast routing to forward traffic and keep the multicast control table in the control plane. Hop-by-hop Multicast [6] starts with similar ideas of [15], but considers the asymmetric unicast routing and adopts source-specific channel abstractions. However, the number of branching routers can grow as the size of group increases in above schemes, increasing the multicast state. Also, multicast forwarding tables are stored by soft-state, which entails high control overhead for groups with little data activity. Explicit multicast (Xcast) [3] focuses on the large number of small multicast groups. Xcast encodes the list of destination nodes into every packet, thus it does not require any address allocation or forwarding states at routers. Instead, multicast routers have to parse the protocol header of every packets at every hops, imposing limits on supporting large-size groups. At the other extreme, to maintain state on a more permanent basis (and address the second limitation, above, of IP multicast-style approaches), “hard state” protocols such as ATM have been designed. State is retained unless explicitly torn down by the end-points. Control overhead is minimal, but the nodes in the network have to potentially keep a lot of state around. These approaches also face severe limitations for scaling in the number of groups and the state associated with each group; hence are inappropriate for our problem. We seek improvements over these traditional approaches. In contrast to IP multicast-style approaches, we wish to minimize the amount of control overhead associated with keeping state up over a long time, especially when groups are inactive. However, for active groups, we wish to take advantage of the structures that IP multicast designs have adopted. Thus, we seek the best of both worlds – forwarding efficiently (a la IP multicast) when information is frequently generated, while also enabling the membership of a group to scale to large numbers where the group membership may be long-lived. To this end, in this paper, we describe Multicast with Adaptive Dual-state (MAD), our novel architecture that can scalably support a vast number of multicast groups with varying activity over time, on today’s commercial hardware in an efficient and transparent manner.

aggregators make it desirable to enable the information producers themselves to make their information available over the information-centric network to any and all consumers, on an as-needed basis. Information-centric networking is thus the key next step for a communication network to become a large-scale utility for information access and dissemination. Efficiency and scalability of multicast is a key enabler of such a networking paradigm.

1.1 Related Work and Limitations Supporting fine granularity information-centric multicast raises many challenges that are not met by today’s IP and overlay multicast technologies. IP multicast has focused on efficient forwarding of information to a large active group of recipients, with the goal of efficient lookup for forwarding. IP multicast-style approaches (at the network layer [2] or at the application layer with “overlay multicast” [4, 5]) try to keep a relatively small amount of state (limited number of groups and the associated interfaces downstream with recipients for the group). However, this state is maintained at every node in the multicast tree of the group for efficient forwarding. Thus, maintaining state is expensive, and these existing models for multicast use considerable control overhead (periodic refresh and pruning) to try to minimize the amount of state retained. IP multicast-style approaches are inappropriate for our problem, for several reasons. • First, IP multicast-style approaches are appropriate for a relatively small number of groups, but are not feasible for the massive scale we consider here — billions of groups — with reasonable amounts of memory at individual network nodes. • Second, when groups are long-lived, but have little or no activity over long periods of time, maintaining the membership state in IP multicast-style approaches requires a lot of control overhead (relative to the activity) to keep it from being aged-out. Recently, Ratnasamy et al. [11] proposed Free Riding Multicast (FRM), which speeds up forwarding and reduces routing state by caching all source-based tree edges of some groups at the sources and including a Bloom filter with all the edges into the message itself. This allows other nodes to forward the message without storing any routing state, using only the information in the message header itself, and to quickly determine the next hops just by checking which of its interfaces are included in the Bloom filter. However, FRM does not effectively meet the needs of groups with large membership sets because this can result in an excessively large packet header to encode the identity of all the members that are recipients of the message (even with the help of the Bloom filter). Moreover, FRM can also impact forwarding efficiency negatively. There have been several efforts to reduce multicast forwarding states in routers. REUNITE [15] reduces forwarding states at the data plane by keeping forwarding entries

1.2 Contributions and Outline This paper makes the following key contributions. • First, we propose the MAD architecture that separates group membership state from the information forwarding state. We show that group membership state can be maintained scalably using a distributed, hierarchical membership tree (MT). • Second, MAD treats active and inactive groups differently, based on a recognition that a predominant number of MAD groups are expected to be inactive at any given time. Active groups use IP multicast-style dissemination trees (DT) for efficient data forwarding while inactive groups use MTs for this purpose, without affecting overall forwarding efficiency. 2

• Third, the MAD protocol ensures a seamless transition between the use of the DT and the MT for information forwarding, with no end-system (application or user) participation in the determination of when a group transitions between being active and inactive.

MAD Site MAD Site

• Finally, we did a detailed analytical and experimental evaluation of the MAD architecture.

MAD Communication Module

Logical MAD Overlay Routers

Physical MAD Overlay Router

MAD Site

Subscription Manager

◦ Our analysis demonstrates the scalability of MAD in maintaining group membership state.

MAD Site Subscription DB

◦ Our large-scale simulations show that MAD achieves both forwarding efficiency and scalability in terms of group membership state maintenance.

MAD Site

End Users

Figure 1: MAD environment

◦ We built a prototype implementation of MAD using FreePastry [13]. We evaluated this prototype implementation in the Emulab testbed [17] to demonstrate the feasibility of scaling the MAD architecture.

Percent of Updates

100 Publishing Rate

1000

The rest of this paper is organized as follows. Section 2 describes an overview of the MAD architecture, motivated by two key design tradeoffs. The details of the MAD protocols are provided in Section 3. An analysis of the scalability of the MAD architecture is discussed in Section 4. Experimental results based on our prototype system are presented in Section 5. We conclude in Section 6.

2.

MAD Site

100 10 1

80 60 40 20 0

0

20 40 60 80 100 Percent of Publishers

(a) Publishing rate (number of updates per month)

0

20 40 60 80 100 Percent of Publishers

(b) CDF of feed updates

Figure 2: Publishing characteristics of RSS feeds logical overlay routers that it owns can be taken over by other physical overlay routers (see Section 3.5). User management functionality is concentrated in the external subscription manager, which maintains subscriptions for all users connecting at a given MAD site. The subscription manager partitions all its subscriptions among the logical overlay routers owned by the site and initiates or cancels group subscriptions with corresponding logical overlay routers on behalf of the end users. Thus, every logical overlay router serves a single aggregated local subscriber representing all users assigned to it by the subscription manager. From the perspective of a MAD overlay router, no knowledge of end users is required and the only entities an overlay router communicates with are other MAD overlay routers and its single aggregated local subscriber.

MAD ARCHITECTURE & TRADEOFFS

MAD seeks to provide efficient multicast service to a billion users and hundreds of billions of multicast groups (with potentially long-lived membership information) on today’s commercial hardware platforms. In this section, we provide an overview of our architecture to realize this ambitious goal, along with the core tradeoffs that we exploit in the design of MAD. While we envisage MAD to begin as an overlay multicast service, we believe that the scalability and efficiency it offers for large-scale information dissemination will eventually result in the architecture migrating to the underlay. The primary barrier to such an evolution, (e.g., to IP multicast) is the current address space constraints in network layer protocols such as IPv4. While we believe that these issues will be overcome in the future, it is beyond the scope of this paper to address them. However, we believe such a constraint should not preclude the creation of a more scalable architecture that is applicable for the future Internet. Publish-subscribe multicast services that MAD is targeting must support user profile management, authentication and fine-grained access control, among other functions. In this paper, to focus on the routing aspect of the problem (which is challenging in itself), we explicitly decouple user management from the multicast service. Specifically, we assume the environment illustrated in Figure 1. The MAD multicast service overlay consists of logical overlay routers, which reside on (or are owned by a provider of) the physical overlay routers. When a physical overlay router fails, all the

2.1 MAD Architecture Overview Twenty years of networking research has produced efficient mechanisms for multicast communication. In order to deliver data to a multicast group, two pieces of information are required: (i) group membership, i.e., which end hosts are members of the group, and (ii) forwarding information, i.e., how to reach these group members. Most existing multicast solutions combine group membership maintenance and forwarding route discovery into a single protocol. Unfortunately, these have well-known scalability problems with regard to the number of groups and group members. To achieve both scalability and efficiency, we observe that the amount of data traffic generated by different multicast groups often exhibits a skewed distribution in practice. 3

Specifically, despite the large total number of groups, we expect most (e.g., 90%) groups to generate relatively infrequent and/or small amount of data traffic. Meanwhile, we expect the small fraction (e.g., 10%) of active groups to account for the vast majority of data traffic. Using the measurement data from [9], we extract the publishing behavior of 429 RSS feeds. Figure 2(a) illustrates that only 5% of the groups publish more than 100 updates/month, whereas the median update rate is below 10 updates/month. Meanwhile, 75% of the entire feed updates come from only 10% of the groups as shown in Figure 2(b). Consequently, the overall data forwarding efficiency is likely to be dominated by the forwarding efficiency of active groups, whereas the overall state requirement and control overhead are largely determined by the state requirement and control overhead of the inactive groups. MAD is based on a recognition that there is a continuum in terms of activity for groups. Hence, the MAD architecture treats active groups and inactive groups differently, achieving both efficiency and scalability. We leverage two key ideas: (i) decoupling group membership and forwarding information, and (ii) applying a novel adaptive dual-state approach to optimize for different objectives for active and inactive groups.

performance or overlay unicast routes, as long as there is no change in the group membership or tree node failure. Infrequent updates may be performed to provide reasonable forwarding efficiency while delivering data to inactive groups. In addition, to minimize the state requirement, the membership tree is designed to be highly structured so that only a bounded small number of overlay nodes maintain membership information for each group. In contrast, a traditional multicast dissemination tree can potentially involve a large fraction of nodes, which would then have to maintain state. [7] proposes the use of aggregation of multiple base multicast groups into a single tree to achieve scaling. In the case of MAD, we see the use of such aggregation to be complementary to our approach of separating the multicast state between forwarding and membership states. The membership state for inactive groups may be maintained in secondary storage, while the forwarding state (for the dissemination tree) uses valuable primary storage (main memory). The idea of aggregation may be used for the active groups utilizing the DT, to reduce the amount of forwarding state even further. This would be done particularly during the transition of a group from inactive (MT) to active (DT) state. A transitioning group may be placed in an existing DT group if they are “related” in some way. The tradeoff of course is that there is the potential of the DT having additional branches because of the groups being combined, and of sending irrelevant content to some subset of the members of an active group.

• MAD uses a hierarchical, distributed state management approach for efficiently storing group membership state at very large scale, resulting in one to two orders of magnitude state reduction compared to CBT. A small number of logical overlay routers are organized in a membership tree rooted at a core router for each group, and maintain state for that group. Except for the core, a node at each level in the membership tree (i.e., a root of a subtree) is selected so that it contains at least a minimum number of subscribers for that group in the subtree. Messages to an inactive group are forwarded along the membership tree.

2.2 Example To illustrate the state-maintenance efficiency of the membership tree sub-protocol, we consider an example illustrated in Figure 3(a), which shows a set of routers that are part of a traditional IP multicast tree. This example is for a single multicast group, with a set of 5 users subscribed to the group. With IP multicast style approaches (e.g., PIM-SM [8], CBT [2]), we observe that every intermediate router in the path, from the root (A) to the first-hop routers that users are connected to, has to maintain the state for this group. In this example, there are 11 routers that have to maintain state. MAD’s membership tree (MT) protocol significantly reduces state requirement by limiting the number of distinct routers that have to keep multicast group state. As shown in Figure 3(b), the membership tree consists of only four (shaded) nodes: A, B, C and E, which is significantly fewer than the 11 routers that have to maintain state for the CBT shown in Figure 3(a). The membership tree itself is illustrated in Figure 3(c). In this example, the core (i.e., root) of the membership tree is selected to be the overlay router A, based on the hash of the group ID. Once the core is determined, the membership tree is constructed on top of a base tree rooted at the core. A base tree is a balanced k-ary tree that comprises all logical overlay routers. For each node, MAD defines a single base tree with this node as its root. All groups that take this node as the core then use this base

• MAD can use any existing multicast protocol for efficient forwarding — it just needs a small hook to interface with such a protocol (see Section 3.4.3). Our specific instantiation of MAD uses the Core Based Tree (CBT) [2] protocol for efficient forwarding for active groups. We refer to the CBT of an active group as the dissemination tree, in contrast to the membership tree that MAD maintains for every group. The MAD membership tree is created on a per group basis (as opposed to applying a centralized scheme or having a single shared tree across all the groups). The root of each tree is selected by computing a hash function of the group ID, thus avoiding the need for a separate resolution process to map a group to its associated root. The group’s root then becomes responsible for monitoring the group activity and changing its state between active and inactive. To minimize control overhead, the membership tree is constructed on top of a static, virtual hierarchy (i.e., the base tree; see Section 3.2). As a result, the membership tree requires no modification when there are changes in network 4

A C

E

N

J S3 M S1

B

H

K S4

I S5

D

core A and the core adds M to its list of FH routers with attached subscribers (i.e., mtFHs). Subsequently, when another subscriber S2 (attached to FH router O) joins, the core notices that the subtree rooted at node C in the base tree now has at least 2 FH routers with subscribers (i.e., M and O). Therefore, the core informs router C to create membership state. Node C then updates its list of FH routers to include M and O. Meanwhile, the core sets a bit indicating that there are subscribers downstream of child C. After subscribers S3 (attached to FH routers J) and S4 (attached to FH router K) join, router B and E create membership state, respectively. Additional subscribers joining a group at the same FH router are handled by the local subscription manager maintaining the subscription state. No additional state has to be maintained at the upstream routers for such subscribers. Eventually, after subscriber S5 (attached to FH router I) joins, only 4 routers (A, B, C, and E) have to maintain membership state. The final state at all the routers that have to maintain state in the membership tree is summarized in Figure 3(d). Thus, even this limited topology demonstrates that we have fewer (only 4) routers maintaining membership state in contrast to the multicast-style approaches that have more (11) routers in the topology maintaining state. We will compare the state efficiency of different approaches in more detail through both large-scale simulations and mathematical analysis in Section 4.

A L

G

F

O S2

(a) IP multicast tree

C

E

N

J S3 M S1

B

H

K S4

I S5

L

G

F

O S2

D

(b) Membership tree nodes A

B D

C E

F

G

H

I J K L M N O S5 S3 S4 S1 S2 (c) Membership tree and the corresponding base tree Nodes on MT A B C E

Membership tree state mtChild (children in MT) mtFHs (FHs with subscribers) {B, C} (encoded as 11) {} {E} (encoded as 01) {I} {} (encoded as 00) {M, O} {} (encoded as 00) {J, K}

(d) Final membership tree state

Figure 3: Examples of MAD trees

2.3 Design Tradeoffs tree as the basis to construct their membership trees. Our construction of the base tree ensures that the entire tree can be computed on the fly as a function of the core ID without requiring any node to maintain any state for the base tree (see Section 3.2). The membership tree construction process is as follows. When a subscriber decides to join a group, it issues a join message, which is forwarded towards the core along the base tree rooted at the core until it reaches the first node already on the membership tree. An en route logical overlay router becomes part of the membership tree whenever the subtree rooted at this node in the base tree contains at least a minimum number (2 in this example) of first-hop routers with attached subscribers. Each node on the membership tree keeps the following state: (i) mtChild – children of this node in the base tree that belong to the MT, and (ii) mtFHs – a list of first-hop (FH) routers with attached subscribers that are downstream of this node in the base tree. Since the base tree can be computed on the fly as a function of the core ID, mtChild can be efficiently encoded as a bit-vector, with one bit for each child. The bit is set to 1 when that child is also on the membership tree. Our protocol further ensures that each FH router appears in the mtFHs field of only one node, which is the first node that belongs to the membership tree and lies on the leaf-to-root path from the FH router to the core in the base tree. In our example, when the first subscriber S1 (attached to FH router M) joins, its join message is propagated to the

MAD exposes and exploits the following core tradeoffs: Control efficiency vs. forwarding efficiency: A key design decision in MAD is to achieve a tradeoff by moderately reducing data forwarding efficiency for inactive groups while obtaining a significant reduction in the overall state requirement and control overhead. We believe such a tradeoff is essential in order to provide multicast service to hundreds of billions of groups (with potentially long-lived membership information) on today’s commercial hardware platforms. CPU processing vs. state consumption: To minimize state consumption and control overhead, MAD combines the membership tree with unicast routes to forward data to inactive groups. As a result, it requires more CPU processing than traditional multicast for forwarding data to inactive groups. However, we transition to the the more efficient dissemination tree for forwarding data once an inactive group becomes more active. The additional overhead related to the membership tree seems acceptable given the importance of state reduction in order to support a vast number of groups.

3. MAD PROTOCOL DESIGN The MAD protocol consists of five major components: (i) the membership tree protocol, which specifies how to create, maintain, and send messages over the membership tree, (ii) the dissemination tree protocol, which specifies how to create, maintain, destroy, and send messages over the dissemination tree, (iii) state transition mechanisms for switching 5

Symbol L F H(s) gid(g) M T (g) DT (g) core(g) BT (ℓ)

0

Meaning number of logical overlay routers, with IDs ∈ [0, L) (we assume that L = 2b is a power of two) first-hop router for a subscriber s 128-bit group ID for group g membership tree for group g dissemination tree for group g (when g is active) common core (i.e., root) of M T (g) and DT (g) base tree rooted at logical overlay router ℓ

1 k+1

2 2k 2k+1

k 3k

k+k 2 k+k 2+k3

k+k 2+1

Table 1: Summary of key notations.

L−1

between the membership tree and the dissemination tree, (iv) failure recovery, and (v) mechanisms for operating across domain boundaries. Below, we first introduce notation and then provide details of each component.

Figure 4: Base tree BT (0). BT (ℓ) is constructed from BT (0) by XORing ℓ with each logical overlay router ID in BT (0). router ℓ, which is a balanced k-ary tree (with each internal node having k children) that is rooted at ℓ and includes all L logical overlay routers. We then construct M T (g) from BT (core(g)) by including only those logical overlay routers that have a sufficient number (Smin) of downstream first-hop routers with attached subscribers to group g. Abstractly, this is much like the formation of a static stucture with overlay peer-to-peer networks, such as with Chord [16]. The primary difference here is the maintenance of a static kary tree that enables rapid computation of a node’s parent and children, and avoid maintaining the state for the base tree. Figure 3(c) shows an example of a membership tree and the corresponding base tree. In this example, the core is selected to be logical overlay router A, based on a hash of gid(g). The base tree BT (A) is a balanced k-ary tree (with k = 2) that includes all the logical overlay routers A, B, · · · , O. There are 5 subscribers to the group: S1 , · · · , S5 (whose first-hop routers are M, O, J, K, and I, respectively). M T (g) is constructed from BT (A) by including only those shaded nodes (A, B, C, and E) such that the subtrees of BT (A) rooted at these nodes contain at least Smin = 2 first-hop routers with attached subscribers to group g. Base tree construction: At each logical overlay router ℓ, we statically construct a balanced k-ary base tree BT (ℓ) as follows. • We first construct BT (0) by sequentially placing logical overlay routers 0, · · · , L−1 onto a regular (i.e., constantfanout) k-ary tree (as shown in Figure 4). Specifically, we place 1 logical overlay router at depth 0 (i.e., the root), k logical overlay routers at depth 1, k 2 logical overlay routers at depth 2, . . . , until all L logical overlay routers have been placed. The logical overlay routers placed at depth d have IDs ranging from Kd + 1 to Kd + Pd−1 k d , where Kd = i=1 k i .

3.1 Preliminaries We first introduce some notations (summarized in Table 1). The MAD multicast service overlay (illustrated in Figure 1) consists of L logical overlay routers, where L = 2b is a power of two. Each logical overlay router is uniquely identified by a b-bit router ID (ranging from 0 to L − 1). The L logical overlay routers reside on a set of physical overlay routers. Each logical overlay router ℓ ∈ [0, L) is owned by a single physical overlay router, whereas each physical overlay router may own one or more logical overlay routers. When a physical overlay router fails, all the logical overlay routers that it owns can be taken over by other physical overlay routers (see Section 3.5). We use F H(s) to denote the first-hop router that a subscriber s is connected to. Each multicast group g is identified by a unique 128-bit group ID gid(g). As stated above, each group g maintains a membership tree M T (g) to record its set of members. If g is active, it also maintains a separate dissemination tree DT (g). M T (g) and DT (g) share a common core (i.e., root) logical overlay router core(g). To balance the load and reduce traffic concentration, we apply a hash function H(·) to map the 128-bit group ID to a random core ID, i.e., core(g) = H(gid(g)). A benefit of this hash-based scheme is that it obviates the need for a separate resolution procedure for mapping group ID’s to core ID’s, which can become expensive with hundreds of billions of groups. Finally, for each logical overlay router ℓ, MAD defines a virtual base tree BT (ℓ) rooted at ℓ that includes all L logical overlay routers. BT (ℓ) is the basis for constructing M T (g) for every group g with core(g) = ℓ (see Section 3.2).

3.2 Membership Tree Sub-protocol Given a multicast group g, MAD’s membership tree construction protocol is designed to ensure that (i) the membership tree M T (g) only comprises a small number of ontree nodes that maintain group state (so that the total state requirement is reduced), and (ii) M T (g) stays largely unmodified when there are changes in network performance or overlay unicast routes (so that the control overhead is minimized). To achieve these two design goals, we first statically construct a base tree BT (ℓ) for every logical overlay

• We then construct BT (ℓ) from BT (0) by substituting each logical overlay router r in BT (0) with logical overlay router r′ = ℓ ⊕ r, where ⊕ denotes bitwise exclusive or (XOR). For example, the root of BT (ℓ) is ℓ ⊕ 0 = ℓ, and the set of depth-1 nodes in BT (ℓ) are ℓ ⊕ 1, ℓ ⊕ 2, . . . , ℓ ⊕ k. With our construction of BT (ℓ), for any given logical

6

overlay router r, we can compute its parent and children in BT (ℓ) as a function of ℓ without requiring any node to maintain any state for BT (ℓ). Specifically, it is easy to verify that (i) the parent of r in BT (0) is ⌈r/k⌉−1, and (ii) the children of r in BT (0) are rk + 1, rk + 2, · · · , rk + k. To obtain r’s parent and children in BT (ℓ), we just need to first compute the parent and children of r′ = ℓ ⊕ r in BT (0) and then XOR ℓ with the resulted router IDs. Meanwhile, note that BT (ℓ) is statically defined independent of network characteristics such as network performance and overlay unicast routes. Therefore, it can help M T (g) to stay largely unmodified when there are changes in these network characteristics, thus significantly reducing the control overhead.

1 2 3 4 5 6 7

8 9 10 11

// node p receives MtJoinMsg(F H(s),g) from child n in BT (core(g)) if (p 6= core(g) and p 6∈ M T (g)) p forwards the MtJoinMsg towards core(g) along BT (core(g)) else p.g.mtF Hs = p.g.mtF Hs ∪ {F H(s)} Sn (g) = the set of first-hop routers in p.g.mtFHs that are downstream of n in BT (core(g)) (including n itself) if (|Sn (g)| ≥ Smin ) p informs n to create state for g and first-hop routers in Sn (g) by sending a MtNodeCreateMsg to n // n may in turn inform its child in BT (core(g)) to create state p.g.mtChild[n] = 1 p.g.mtF Hs = p.g.mtF Hs − Sn (g) end end

Figure 5: Handling the MtJoinMsg

Worst-case bound on the size of M T (g): By using a balanced k-ary base tree BT (core(g)) as the basis for constructing M T (g), we can bound the maximum depth of M T (g) Pd to Dmax = min{d : i=1 k i ≥ L} = ⌈logk (L−L/k +1)⌉. For example, with L = 216 and k = 16, the depth of M T (g) is bounded by Dmax = 4. Meanwhile, by requiring each node on M T (g) to have at least Smin downstream first-hop routers with attached subscribers, the number of on(g) , where tree node at the same depth is bounded by NSFmin N F (g) is the total number first-hop routers with attached subscribers to g. Therefore, the number of nodes in M T (g) (g) even in the worst case. For exis bounded by Dmax × NSFmin ample, with Smin = 16, k = 16 and L = 216 , M T (g) never has more than 4 × N F16(g) = N F (g)/4 nodes. In contrast, a IP multicast style dissemination tree could involve all L logical overlay routers in the worst case, and all these routers have to maintain forwarding state for the group. Later in Section 4, we use large-scale simulation and formal analysis to further demonstrate that M T (g) achieves significant state reduction in the average case. Membership tree state: As described in Section 2.2, each logical overlay router r on M T (g) maintains the following membership tree state for group g. (i) r.g.mtChild: a k-bit bit-vector, with one bit for each child of r in BT (core(g)). The bit is set to 1 when that child is also on M T (g). (ii) r.g.mtFHs: a list of first-hop routers (with attached subscribers) that are downstream of r in the base tree BT (core(g)). Our membership tree construction (detailed in Section 3.2.1) ensures that for any first-hop router F H(s) (with an attached subscriber s), F H(s) appears in exactly one node on M T (g), which is the first node that belongs to M T (g) and lies on the leaf-to-root path from F H(s) to core(g) in BT (core(g)). Also note that when multiple subscribers to group g are connected to the same first-hop router, they are under the management of a common subscription manager and appear as a single aggregated subscriber to the first-hop router. Figure 3(d) summarizes the state maintained at each node on the membership tree shown in Figure 3(c).

Joining a membership tree: Messages of type MtJoinMsg are sent by subscribers to join a membership tree. Such messages are forwarded towards core(g) along the base tree BT (core(g)) in a bottom-up fashion. Meanwhile, the decision on node creation is made in a top-down fashion. Specifically, when a subscriber s wants to join group g, it first sends a MtJoinMsg to its first-hop logical overlay router F H(s). F H(s) determines the core for the group as a hash of the group ID gid(g), and then forwards the MtJoinMsg towards core(g) along the base tree BT (core(g)) (with F H(s) included in the message header). Unlike protocols like CBT or PIM-SM, when a logical overlay router r not yet on M T (g) receives a MtJoinMsg for group g, r does not instantiate any state right away. Instead, r simply forwards the MtJoinMsg towards core(g) along BT (core(g)). Eventually, this message will reach a node that is already on M T (g), denoted by p. Note that it is possible to have core(g) = p. Suppose the MtJoinMsg received by p comes from p’s child n in BT (core(g)). Upon receiving this MtJoinMsg, p first adds F H(s) to its first-hop router list p.g.mtFHs. p then checks to see if it has accumulated Smin first-hop routers (with attached subscribers) from child n. If so, p sends a MtNodeCreateMsg to inform n to create membership tree state for group g. The MtNodeCreateMsg also includes the list of Smin first-hop routers that are downstream of n in BT (core(g)). After n creates the membership tree state, p then removes these Smin firsthop routers from p.g.mtFHs, and sets the bit corresponding to n in the bit-vector p.g.mtChild to 1, indicating that n is now on M T (g). Note that n may find that all these Smin first-hop routers are downstream of one of its k children in BT (core(g)). In this case, n would further inform this child to create state for g and those first-hop routers by sending a MtNodeCreateMsg. Figure 5 shows the detailed behavior when a node p receives a MtJoinMsg from child n to subscribe to group g on behalf of first-hop router F H(s). The protocol ensures reliable delivery and processing of the MtJoinMsg with a suitable acknowledgment and retransmission mechanism (not shown here).

3.2.1 Membership Tree Maintenance

Leaving a membership tree: Messages of type MtLeaveMsg are used to leave a membership tree. Specifically, the first7

hop router F H(s) for a subscriber s sends a MtLeaveMsg towards core(g) along the base tree BT (core(g)) until it reaches the first node n that is on M T (g). Upon receiving the MtLeaveMsg, n first removes F H(s) from n.g.mtFHs. n then checks if n.g.mtChild is empty and n.g.mtFHs contains fewer than Smin first-hop routers. If so, n determines that it should no longer stay as a node on M T (g). n then sends a MtNodeDeleteMsg to its parent in M T (g) with the list of first-hop routers in n.g.mtFHs (which will be maintained by this parent henceforth) before n purges the state associated with group g. Note that the parent of n may also decide to delete itself from M T (g) and sends a MtNodeDeleteMsg to his own parent (i.e., the grand-parent of n in M T (g)).

MAD uses a separate dissemination tree for each active group to achieve better forwarding efficiency. The dissemination tree can be maintained using any existing (or future) multicast protocol. Our current instantiation of MAD uses the standard Core Based Tree (CBT) [2] protocol for constructing and maintaining the dissemination tree. For convenience, the dissemination tree and the membership tree share a common core. Dissemination tree state: We assume that the number of overlay unicast neighbors for each overlay router is upper bounded by M axN odeDegree. Since the overlay topology is under our control, it is possible to enforce such constraint. We currently set M axN odeDegree = 31 in MAD, which means each overlay router can have no more than 31 neighbors. Under this assumption, each node n in a dissemination tree simply maintains a bit-vector n.g.dtChild (with M axN odeDegree + 1 bits) to indicate whether there are any subscribers downstream for each neighboring overlay router. An extra bit in n.g.dtChild is used to indicate whether n has any attached subscriber(s) to this group. Finally, a hash table indexed by the 128-bit group ID is used to store all the dissemination tree state for different groups. Dissemination tree construction: The current group status (e.g., whether it is active or inactive) is decided by the group’s core in the membership tree. Once core(g) determines that a group g has become sufficiently active and that a dissemination tree needs to be created for g, it multicasts a DtBuildMsg along the membership tree M T (g) in a topdown fashion to all first-hop overlay routers of M T (g). Every first-hop router that receives this message joins the dissemination tree for g by sending a DtJoinMsg towards core(g) to join the dissemination tree DT (g). The DtJoinMsg is handled according to the standard CBT protocol. Dissemination tree based forwarding: Similar to membership tree based forwarding, in order to send a message M to a group g over the dissemination tree DT (g), M is first sent to core(g) through overlay unicast. From there, data is forwarded down DT (g) according to the CBT protocol. In addition, when the data reaches a node n whose local subscriber bit in n.g.dtChild is set to 1, it is forwarded to the subscription manager at n through underlay unicast. The subscription manager in turn forwards M to the appropriate subscribers under its management.

3.2.2 Membership Tree Based Forwarding The membership tree can be combined with overlay unicast to deliver both control and data messages to the group as follows. • When a node wishes to multicast a message M to group g, it first forwards M to core(g) through overlay unicast. This allows us to easily implement critical functionality such as per group access control, authentication, authorization, accounting and traffic monitoring at the core. Existing multicast protocols such as Scribe [5] take a similar approach. • After receiving M , core(g) first forwards M to any child in BT (core(g)) that belongs to M T (g) (according to bit-vector core(g).g.mtChild) through overlay unicast. In addition, core(g) forwards M to all first-hop routers in core(g).g.mtFHs. Note that core(g) itself may appear in core(g).g.mtFHs when it has attached subscribers. In this case, core(g) also forwards M to its subscription manager through underlay unicast, which then forwards M to the appropriate subscribers that it manages. • After a child n receives M , it repeats the same procedure above to forward M to first-hop routers listed in n.g.mtFHs and its own children in M T (g) (as specified in n.g.mtChild). Eventually, when a first-hop router F H(s) receives M , it then sends M to its subscription manager through underlay unicast. Avoiding redundant overlay unicast messages: In our protocol for multicast forwarding, overlay unicast is used by each node n to forward messages to its children in M T (g) as well as the list of first-hop routers n maintains. A naive implementation of such multicast message forwarding could lead to a situation where the same message is forwarded many times to the same next hop overlay router if this next hop is shared among several overlay unicast routes. MAD supports the option to trade processing overhead for better bandwidth efficiency by packing all these “redundant” messages into a single message and defining a variable-length header to specify all the recipients (as in Xcast [3]).

3.4 Mode Transition A major challenge in the design of MAD is how to ensure the smooth transition between membership tree based forwarding (i.e., the inactive mode) and dissemination tree based forwarding (i.e., the active mode). In particular, it is essential to avoid disruption of the multicast data delivery service during mode transition. Our basic strategy for achieving smooth mode transition is to require every group in the system to always maintain the membership tree even when the group is considered active

3.3 Dissemination Tree Sub-protocol 8

3.4.2 Transition from Active to Inactive Mode

and the dissemination tree has been constructed. Having an always up-to-date membership tree ensures that during the transition period we can use membership tree based forwarding to deliver messages reliably to all the group members, with very little additional control overhead. When a group transitions from being active to inactive, the transition is achieved simply by not forwarding on the dissemination tree and “tearing it down”. However, the transition from inactive to active state for a group has to be handled more carefully to ensure no data is lost. We continue to forward along branches of the static membership tree until we receive a definitive notification at the root of a subtree indicating that all the nodes below have made the transition to the active dissemination tree. To facilitate smooth mode transition, each node n in a membership tree maintains the following additional information for a given group g: (i) n.g.mode, the current mode of the group, which can have value inactive, active, or transient (i.e., in transition from inactive to active mode); and (ii) n.g.trFHs and n.g.trChild, the transient mode state that records the list of first-hop routers and children in M T (g) that have not yet successfully joined the dissemination tree (i.e., cannot receive messages from the core in DT (g)) in the transient mode. We also piggyback the following control information in every data message sent by the core – a tree bit and a flush bit. The tree bit specifies which tree is used to deliver the message (1 for M T , 0 for DT ). When the flush bit is set, a node purges the DT state for the destination group of the message, if any. See below for details on the use of such control information.

The transition from the active mode to the inactive mode is much simpler and can be performed almost instantaneously. Specifically, once core(g) decides that the dissemination tree DT (g) should be destroyed, it immediately changes the current group state from active to inactive and all subsequent messages sent to the group g are delivered over the membership tree M T (g). The tree bit of the message header is set to 1 to indicate that M T (g) is used for message delivery. Meanwhile, the core sets the flush bit to 1 in all messages sent to the group in the inactive mode, informing all the first-hop routers to leave the dissemination tree for the group. Every first-hop router receiving such a message sends a DtLeaveMsg along the dissemination tree towards core(c) to leave DT (g), which is processed in the same way as in CBT [2].

3.4.3 Multicast Forwarding in Different Modes As stated above, a message sent to group g is first sent to core(g) through unicast (unless the sender is the core itself). The core then multicasts the message down either M T (g) or DT (g) or both trees based on the current mode of the group. An on-tree node forwards message in the following way: • Inactive mode: Forward data message to all children in M T (g) (as specified by the bit-vector mtChild) and all first-hop routers listed in mtFHs. If the node itself appears in mtFHs, deliver the message to the subscription manager. • Active mode: Forward data message to all children in DT (g) (specified by dtChild). Deliver data message to the subscription manager when the local subscriber bit in dtChild is 1.

3.4.1 Transition from Inactive to Active Mode Every new group g initially stays in the inactive mode. As group activity becomes high enough, core(g) may decide to improve forwarding efficiency by creating a separate dissemination tree DT (g). A key challenge here is how to determine whether the construction of DT (g) has completed because prematurely sending a message over an incomplete DT (g) can easily cause some group members to miss that message, which may be unacceptable. Our solution is to deliver data messages over both M T (g) and DT (g) during the transition (i.e., in transient mode) and stop delivering data to a subtree of M T (g) only after the root of the subtree is certain that all existing group members residing in the subtree are able to receive data from the dissemination tree. Specifically, once a first-hop router F H(s) successfully receives its first DT -delivered message (with tree bit set to 0), it sends a DtJoinCompleteMsg towards core(g) along the base tree until it reaches the first node on M T (g) (denoted by n). n removes F H(s) from n.g.trFHs and stops forwarding to F H(s) any future M T -delivered messages (with tree bit set to 1). When both n.g.trFHs and n.g.trChild become empty, n then suppresses any future M T -delivered messages from its parent in M T (g) by sending a DtJoinCompleteMsg.

• Transient mode: Forward data message over both M T (g) and DT (g). When forwarding messages along M T (g), use trChild and trFHs instead of mtChild and mtFHs to avoid forwarding messages to those users that can already successfully receive data from DT (g). As mentioned, MAD can utilize any existing multicast mechanism for content dissemination to active groups. The only requirement is that the mechanism supports the following hook: Upon receiving the first message over the dissemination tree (i.e., using the existing multicast mechanism) after joining the group, the router must send a DtJoinCompleteMsg as described in Section 3.4.1.

3.4.4 Handling Joins and Leaves in Different Modes When a new subscriber s wishes to join a group g, s first contacts its first-hop router f = F H(s). If f is already on M T (g), it simply adds f to f.g.mtFHs (if it has not done so before). Otherwise, f sends a MtJoinMsg towards the core along BT (g). The MtJoinMsg keeps propagating until it reaches a node p already on M T (g). The MtJoinMsg is handled by p based on its current mode as follows. • Inactive mode: In this mode, p first updates its mem9

bership tree state (i.e., p.g.mtChild and p.g.mtFHs). p for group g. To take over such a role, p′ sends MtJoinMsg then responds with a MtJoinAckMsg to indicate that the towards the core of group g along base tree BT (g). We MtJoinMsg is successful. p also checks if any of its chilsave bandwidth by aggregating multiple MtJoinMsg with the dren c has at least Smin downstream members in BT (core(g)). same core into a single message with a list of group IDs. If so, p sends a MtNodeCreateMsg to inform the child On-tree node: ℓ may be an internal node of either a memberc to create membership tree state. After c creates the ship tree or a dissemination tree. MAD takes advantage of membership tree state, it may further inform one of its on-tree nodes in both M T and DT sending heartbeat mesown children c′ to create membership tree state (if c′ has sages down the tree. Upon failure detection, the child repairs at least Smin downstream first-hop routers with attached the tree by sending a join message (i.e., either MtJoinMsg or subscribers). DtJoinMsg) to its parent. Core: ℓ may be the core for a group. After a failure, p′ starts receiving join messages from children of ℓ in group g. p′ infers the mode information from the received DtJoinMsg. It is important to note that MAD only involves systemwide keep-alive messages (between each physical overlay router and its shadow routers) as opposed to per-group keepalive messages. Therefore, the control overhead due to keepalive messages is independent of the number of multicast groups. In comparison, IP multicast style protocols such as CBT require per-group keep-alive messages to retain forwarding state. Our evaluation in Section 4 suggests that such per-group control overhead can become quite expensive when there are a large number (e.g., billions) of groups and a large number of them are inactive.

• Active mode: In this mode, p first processes the MtJoinMsg just as in the inactive mode. p then informs f to join the dissemination tree by sending a DtBuildMsg. Upon receiving this DtBuildMsg, f sends a DtJoinMsg to join the dissemination tree. • Transient mode: In this mode, to avoid potential race conditions, p temporarily refuses to take on a new child. Therefore, p responds with a MtJoinNackMsg to inform f that f needs to resend the MtJoinMsg after a short timeout. As we show in Section 5, the transition latency is typically very short, below 2 seconds in all our experiments. As a result, temporarily disallowing MtJoinMsg for such a short amount of time is unlikely to impact user-perceived performance.

3.6 MAD across Administrative Domains

When s decides to leave group g, it contacts its first-hop router f = F H(s). f then tries to remove s from both M T (g) and DT (g). Specifically, to remove s from M T (g), f acts as if it has received a MtLeaveMsg and handles it as described in Section 3.2. Meanwhile, f checks if it is on DT (g). If so, f follows usual CBT mechanisms by sending a DtLeaveMsg towards the core.

MAD groups a set of routers in the same region or network domain (e.g., university network, corporate network, and AS) to form a “MAD domain”. MAD domains serve two goals: (i) enable MAD to operate across multiple administrative domains, and (ii) handle heterogeneity and load imbalance by promoting autonomous decisions in local networks.

3.5 Failure Recovery

3.6.1 Leaders and Super-domains

MAD handles failures through replication, in a manner similar to other overlay based approaches [5, 12]. We exploit the advantage of being able to establish connectivity in the overlay dynamically, in response to a failure. The main concern we address here is the careful management of state specific to MAD. Specifically, for each physical overlay router in our system, we designate a set of physical overlay routers as its shadow routers. To minimize replicated state, we only store state related to leaf nodes of the membership tree. Once the membership tree is repaired, it can perform normal multicast data delivery. For each logical overlay router ℓ owned by a physical overlay router p, the shadow routers of p only need to replicate the user subscription state (i.e., mtFHs) for all enrolled groups. The replicated state is saved in the stable storage (e.g., hard disk drive) of all the shadow routers of p. There is a keep-alive message exchange between a physical overlay router p and its shadow routers p′ for fast failure detection and recovery. For each logical overlay router ℓ that is previously owned by p, p′ needs to recover the role of ℓ in every group that ℓ is involved in.

Given a MAD domain, a subset of routers are eligible candidates to be selected as the leader for a group. For any group with members in this MAD domain, a leader router is selected from the set of candidate leaders uniformly at random (as a hash of the group ID) All communications for the group (both in and out of the domain) pass through its leader, which is similar to the role of MBR (Multicast Border Routers) in PIM-SM [8]. The set of candidate leaders are exposed to outside of the MAD domain. The union of leaders in all of the MAD domains form the super-domain. The super-domain is responsible for forwarding multicast traffic between domains. The group core is selected from participating leaders in the super-domain. To forward multicast traffic, the core first sends traffic to leaders in all participating MAD domains; the leader in each MAD domain then forwards the traffic to members in that domain.

3.6.2 Autonomy MAD supports autonomous decisions in MAD domains. A local MAD domain can make its own decision for spe-

Leaf: ℓ may be a leaf node in the membership tree M T (g) 10

cific groups to communicate using either DT for efficient forwarding or a resource efficient MT. This enables us to exploit (a) the spatial locality of group activity, and (b) the resource efficiency in local administrative domains. Specifically, a group can be in DT mode to efficiently forward frequent updates to a large number of receivers in a local domain, where popular local events are being held. For example, to use resource efficiently, MAD domains in a resource-starved region (e.g., with low-end routers) may not be able to afford the use of the more state-intensive DT communication for all the globally popular groups that are of less interest within the region. MAD domains achieve local privacy by containing sensitive data (e.g., number of subscribers and subscriber IP addresses) to be within the local network. When building MT state, leaders do not export subscriber information to outside of the domain even if the current enrollment size is below threshold.

the simulation of packet-level events) to compute the state requirement, forwarding cost, and control overhead of dissemination tree (CBT), membership tree, and MAD for any given group. Note that our latency results do not include any queueing delays. Topologies: Our simulator constructs overlay network topologies as follows: (i) generate an underlay network topology that comprises 16,000 routers in all; (ii) use shortest hopcount routing to obtain the underlay distance (i.e., hop count) between each pair of routers; and (iii) construct the overlay network topology as the union of m edge-disjoint Minimum Spanning Trees (MSTs) for the logical full-mesh (i.e., clique) over the 16,000 routers. The underlay topology we consider is either a power-law topology (pow-16k), or a transitstub topology with stub (access) nodes and transit nodes (ts16k). In addition to computing hop count between overlay routers, we also use the DS 2 delay space synthesizer [18] to directly synthesize a realistic underlay distance matrix (ds216k).

3.6.3 Additional Details

Simulation setup: In our experiments, we randomly form 100 multicast groups each with a fixed group size. We then vary this fixed size of the multicast group with respect to the number of first-hop routers. For each group, we compute the required state, forwarding cost, and control overhead of CBT and M T as follows.

Mode transition: Instead of having a group change modes from MT to DT across the entire network in an all-or-nothing mode change, each domain can decide to alter the forwarding mode for a group and inform that decision to the core node. Depending on the global activity and resource availability, the core can then decide to use MT or DT to reach leader nodes. Note that core-to-leader communication can use the different mode from leader-to-leaf communication even for the same group.

• We compute the total tree state stored by all on-tree nodes. For CBT , an on-tree node stores a 128-bit group ID and a 32-bit bit-vector dtChild (which specifies all interfaces with members downstream). For M T , an on-tree node stores a 128-bit group ID, a 16-bit bit-vector mtChild (which indicates whether each child in the base tree belongs to M T ), and a list of first-hop overlay routers in the subtree rooted at this node (mtFHs). P • The forwarding cost for a message is measured by e N (e)× D(e), where e enumerates all overlay links, N (e) is the number of times the message traverses e, and D(e) is the distance of e given by either its underlay hop count (for pow-16k and ts-16k) or underlay latency (for ds2-16k).

Leader ID and core ID: The list of leader node IDs are maintained in all the routers within the domain. We prepend the domain ID to the node ID to support multiple MAD domains. MAD routers in super-domain has a special domain ID; the core of the group is selected by picking one from super-domain using a hash value of the group ID. Also, the leader of the given group in each domain is selected from the leader list in a similar manner.

4.

SCALING OF MAD TREES

In this section, we conduct extensive simulations on realistic network topologies to examine the state requirement of MAD trees and the tradeoff between state reduction and forwarding cost. To gain additional insights into the scaling of MAD trees, we also analytically derive the state requirement. In our model, rather than the number of subscribers, the number of distinct subscription managers that are involved in a group (essentially the number of FH routers) reflects the scaling of the system. Thus, our results examine the scaling based on the number of FH routers in the group or system.

• The control overhead is measured by the total number of keep-alive (i.e., hello) messages sent by the group in a second. We use the default keep-alive message interval of 60 seconds in CBT [1]. That is, each node in a CBT sends a keep-alive message to its parent once every 60 seconds. Finally, to compute the state requirement, forwarding cost and control overhead of the MAD protocol, we assume that 10% groups are active, and that they contribute to 75% of the data traffic. These fractions are chosen based on the publishing behavior of RSS feeds as shown in Figure 2(b).

4.1 Simulation Evaluation

Simulation results: Figure 6 compares the state requirement, forwarding cost, and control overhead for ds2-16k and pow-16k, where every data point is the average over 100

Simulator: We developed a simulator (with around 6,000 lines of C/C++ code, that achieves scalability by avoiding 11

100 50 0 0 2000 4000 6000 8000 Number of First-Hop Routers in a Group

(a) Total tree state for a group (ds2-16k)

(b) Total delay (ds2-16k)

150 100 50

MAD MT 20000 CBT 15000 10000 5000

0

0 0 2000 4000 6000 8000 Number of First-Hop Routers in a Group

0 2000 4000 6000 8000 Number of First-Hop Routers in a Group

(d) Total tree state for a group (pow-16k)

200 MAD 180 CBT 160 140 120 100 80 60 40 20 0 0 2000 4000 6000 8000 Number of First-Hop Routers in a Group

(c) Control overhead (ds2-16k)

25000 Total Underlay Hops

Total Tree State (KB)

250

MAD MT 200 CBT

Control Messages Per Second

150

50 MAD 45 MT 40 CBT 35 30 25 20 15 10 5 0 0 2000 4000 6000 8000 Number of First-Hop Routers in a Group

Control Messages Per Second

MAD MT 200 CBT

Total Delay (Seconds)

Total Tree State (KB)

250

(e) Total underlay hops (pow-16k)

180 MAD 160 CBT 140 120 100 80 60 40 20 0 0 2000 4000 6000 8000 Number of First-Hop Routers in a Group

(f) Control overhead (pow-16k)

MAD MT CBT

1E+04 1E+03 1E+02 1E+01 1E+00 0

2000

4000

6000

8000

Number of First-Hop Routers in a Group

(a) ds2-16k

1E+05

Number of Groups (Billions)

Number of Groups (Billions)

Number of Groups (Billions)

Figure 6: Scaling of MAD trees on different overlay topologies. 1E+05

MAD MT CBT

1E+04 1E+03 1E+02 1E+01 1E+00 0

2000

4000

6000

8000

Number of First-Hop Routers in a Group

(b) pow-16k

1E+05

MAD MT CBT

1E+04 1E+03 1E+02 1E+01 1E+00 0

2000

4000

6000

8000

Number of First-Hop Routers in a Group

(c) ts-16k

Figure 7: Maximum number of groups that 216 overlay routers (each with 3GB memory) can hold. random groups (of the same size)1 . The results for ts-16k are quantitatively similar and are omitted in the interest of brevity. Figure 6(a) and 6(d) compare the total tree state required by CBT , M T , and MAD. By combining M T with CBT , MAD achieves nearly an order of magnitude state reduction over CBT . Figure 6(b) and 6(e) show the forwarding cost of CBT , M T , and MAD. The total forwarding cost for MAD is very close to CBT (both in delay and number of hops traversed) and significantly outperforms M T as the size of the group increases. Figure 6(c) and 6(f) show the control overhead of CBT and MAD (measured by the number of keep-alive messages per second from each group). MAD achieves an order of magnitude reduction in control overhead over CBT , because 90% groups are inactive and only have to maintain the M T , which requires no per-group keep-alive messages (see Section 3.5). To further demonstrate the state efficiency of MAD, Fig-

ure 7 shows the maximum number of groups that an overlay with 216 overlay routers (each with 3GB memory) can hold. MAD can easily support hundreds of billions of groups on today’s commercial hardware platform. For example, with a group size of 16, MAD supports 913 billion groups for pow-16k, 564 billion groups for ds2-16k, and 750 billion groups for ts-16k, yielding a factor of 7–9 improvement over CBT . Note that such improvement is close to optimal, because in our simulation 10% groups are active and maintain both M T and CBT . The state reduction is thus bounded by stateCBT 0.1×stateCBT +stateMT ≤ 10. Summary: Our simulation results clearly demonstrate that MAD achieves both the improved forwarding efficiency of CBT and the state reduction and low control overhead of MT .

4.2 Formal Analysis of State Requirement Consider an overlay with L logical overlay routers. Given a multicast group g that has F randomly selected first-hop

1 We have also computed the 95% confidence intervals. We choose not to present them because they are too narrow to plot in the figure.

12

routers to which subscribers to group g are connected, below we analytically derive the expected state requirement for both CBT (g) and M T (g) with respect to F . Membership tree state requirement: Since all the base trees constructed in Section 3.2 are isomorphic, without loss of generality we can assume that core(g) = 0. Therefore, M T (g) is constructed based on the base tree BT (0), which is illustrated in Figure 4. For any logical overlay router i, let Ni be the number of nodes that are in the subtree of BT (0) rooted at i (including node i itself); and let random variable Xi denote the number of first-hop routers with subscribers that are in the same subtree. Assuming that the F first-hop routers with attached subscribers are selected uniformly at random from all L logical overlay routers, then Xi has a hypergeomet) (F )(NL−F i −n ric distribution: Pr(Xi = n) = n M . The mean µi ( Ni ) and variance σi2 of Xi are given by µi = Ni FL and σi2 = Ni (L−Ni ) F L−F L−1 L L , respectively. In order for logical overlay router i to become a member of M T (g), we need to have Xi ≥ Smin . The probability for this to occur can be bounded by applying Chebyshev’s Inequality: 2  σi △ Pr(Xi ≥ Smin ) ≤ = PiMT max(σi , Smin − µi )

Total Tree State (KB)

250

L−1 P i=0

150 100 50 0 0 2000 4000 6000 8000 Number of First-hop Routers in a Group

Number of Groups (Billions)

(a) Total tree state for a group 1E+05

MAD MT CBT

1E+04 1E+03 1E+02 1E+01 1E+00 0

2000

4000

6000

8000

Number of First-hop Routers in a Group

(b) Max number of groups that 216 overlay routers (each with 3GB Memory) can hold

Figure 8: Analytical results on the scaling of MAD trees.

where µi and σi2 are the mean and variance of Xi (see above). The expected number of nodes on M T (g) is therefore PL−1 bounded by i=0 PiMT . Each node n on M T (g) stores a 128-bit group ID (as the key for the forwarding table), a 16bit bit-vector mtChild, and a list of 16-bit first-hop router IDs (mtFHs). Since each 16-bit first-hop router ID is stored only once, the total amount of state devoted to mtFHs is 16 × F bits. Therefore, the expected total number of bits for the membership tree state is: stateMT ≤ 16 × F + (128 + 16) ×

MAD MT 200 CBT

Figure 9: Topology of sprintlink-us Each node n on CBT (g) stores a 128-bit group ID (as the key for the forwarding table) plus a 32-bit bit-vector n.g.dtChild. Therefore, the expected number of bits for the dissemination tree is: PL−1 CBT stateCBT = (128 + 32) × i=0 Pi

PiMT

Dissemination tree state requirement: For simplicity, we only consider the state requirement of CBT (g) in the special case where all the unicast routes destined to core(g) together form a balanced k-ary tree isomorphic to BT (0). This is likely to underestimate the state requirement of CBT in the more general case where the unicast routes do not have such a regular underlying structure. However, our results clearly show that even in this special case, M T and MAD achieve much better state efficiency than CBT . In the special case we consider, let Ni and Xi denote the same as in the analysis for M T (g) (see above). A logical overlay router i becomes an on-tree node of CBT (g) whenever Xi > 0. The probability for this to occur is given by

Numerical results: We numerically compute the state requirement for M T and CBT (with k = 16) for different sizes of the group and the overlay network. As shown in Figure 8, MAD achieves an order of magnitude state reduction over CBT . Moreover, depending on the group size, MAD can easily support hundreds of billions of groups on today’s commercial hardware. These results are consistent with our simulation results on more realistic topologies (i.e., Figure 6 and Figure 7) and clearly demonstrate the MAD’s ability to significantly reduce group state by decoupling membership and dissemination.

(L−F ) ) (F0 )(NL−F i −0 = 1 − NLi L (Ni ) (Ni ) PL−1 CBT The expected number of nodes on CBT (g) is i=0 Pi .

PiCBT = 1 − Pr(Xi = 0) = 1 −

13

5. EVALUATION OF IMPLEMENTATION We implemented the complete MAD protocol in Java 1.4.212. We use FreePastry 1.4.4 [13] as the base infrastructure

of the number of subscribed first hop routers. With our design of the membership tree, only a small number of nodes need to maintain state for the group, and this number grows slowly. In comparison, with CBT , the number of on-tree routers grows much more rapidly as more first-hop routers join the tree. M T thus achieves significant state reduction over CBT . The state efficiency of MAD is close to M T because 90% of all groups are inactive and maintain only the efficient M T .

Number of Nodes

80

MAD MT 60 CBT 40 20 0 0

10

20 30 40 50 60 Number of First-Hops

70

80

Forwarding efficiency: Figure 11 shows the average latency of delivering a data message from the core to all the member routers in a tree. CBT achieves much lower message delivery latency than M T . This is not surprising because M T is designed primarily for state efficiency, not for forwarding efficiency. Meanwhile, the delivery latency of MAD is very close to that of CBT , because most of the data traffic comes from active groups and is thus delivered via CBT in the MAD protocol — recall that we assume active groups contribute to 75% of all the data traffic.

Delivery Latency (msec)

Figure 10: Number of nodes keeping tree state 100 80 60 40 MAD MT CBT

20

Mode transition cost: We measure the cost of mode transition from M T to CBT (as an inactive group becomes acNumber of First-Hops tive) in terms of the transition latency (i.e., the time it takes Figure 11: Average message delivery latency for all nodes in a group to complete mode transition) and the message duplication ratio (i.e., the ratio between the number to provide a convenient abstraction for node maintenance of duplicate messages and the total number of distinct mesand unicast routing, without any tie-in to the DHT architecsages received during the transition period). Note that mode ture. Based on our prototype implementation, we evaluate transition from CBT to M T is almost instantaneous and the state efficiency, forwarding efficiency, and mode transidoes not result in any duplicated messages (see Section 3.4). tion cost in MAD. Figure 12(a) shows the average transition latency from M T to CBT as the number of first-hop routers in the group in5.1 Experimental Setup creases. The errorbars represent the minimum and maxiWe conducted extensive experiments on the Emulab testbed [17]. mum transition latencies observed in our experiments. It is Our experiments involve 105 Emulab nodes that range from evident that the transition latency is very low — the aversystems with an Intel Pentium III with 256MB RAM to a age transition latency is close to 1 second and the worst-case Xeon with 2GB RAM. We use the sprintlink-us network transition latency never exceeds 1.8 seconds in all our expertopology available from Rocketfuel [14], with link latencies iments. Figure 12(b) shows the average, minimum, and maxinferred in [10] and zero link loss. To control the total numimum message duplication ratio during mode transition. The ber of nodes in the network, we vary the number of routers message duplication ratio is less than 1 because as soon as in each city (PoP). Routers in the same city are connected a first-hop router receives the first message from the CBT , via a LAN with 0 latency. Figure 9 illustrates the node asit informs its parent in the M T to suppress all future duplisignment of Emulab (left), and the geographical topology cate messages. Therefore, during the entire transition period, of sprintlink-us (right). For each group, we select first-hop only those messages that are sent before the suppression ocrouters to join the group uniformly at random from the entire curs will be received in duplicate. Since the transition laset of nodes in the experiment. Similar to our simulation, we tency is quite low, the overhead due to message duplication assume that 10% groups are active and that they contribute is acceptable (especially given the increased forwarding efto 75% of the data traffic (based on the publishing characterficiency after the mode transition completes). istics of RSS feeds in Figure 2). Finally, we use the default Summary: Our experimental results clearly demonstrate that settings of FreePastry-1.4.4. MAD achieves both the high state efficiency of M T and the 5.2 Evaluation Results high forwarding efficiency of CBT . Meanwhile, such a benefit comes at a low cost — the mode transition between M T State efficiency: We first look at the results from experand CBT only takes 1–2 seconds and the overhead due to iments to show the benefit of the membership tree (M T ) message duplication during the transition is acceptable. over the dissemination tree (CBT ) with respect to the total state stored at on-tree routers. Figure 10 shows the num6. CONCLUSIONS ber of nodes that maintain state for a group, as a function 0

0

10

20

30

40

50

60

70

80

14

1

1500

0.8

Duplication Ratio

Switching Time (msec)

1800

1200 900 600 300 0

(representing the “sprintlink-us” topology from Rocketfuel), and demonstrated the feasibility of scaling the MAD architecture.

0.6

7. REFERENCES

0.4 0.2 0

0

10 20 30 40 50 60 70 80 Number of First-Hops

(a) Transition latency

0

10

20

30

40

50

60

70

80

Number of First-Hops

(b) Message duplication ratio

Figure 12: Cost of mode transition from M T to CBT Information dissemination will be a key functionality of the evolving network infrastructure. Multicast, which uses network and server resources efficiently for this functionality, will be a fundamental building block for informationcentric networks. Given the ever-increasing amount of information being disseminated, multicast need to manage a vast number of groups, with a distinct group for each piece of distributable content. Since group membership may be longlived, and the criticality of the disseminated information is typically independent of the level of group activity, multicast will need to evolve to support persistent group membership. Supporting these needs raises challenges that are not met by today’s IP and overlay multicast technologies. In this paper, we present our architecture, Multicast with Adaptive Dual-state (MAD), for providing efficient multicast service at massive scale. MAD can scalably support a vast number of multicast groups, with persistent group membership, based on two key novel ideas. First, we decouple group membership state from the state needed for efficient forwarding of information. Group membership state is maintained scalably in a distributed fashion using a hierarchical membership tree (MT). Second, we apply an adaptive dualstate approach to optimize for the different objectives of active and inactive groups. Active groups use IP multicaststyle dissemination trees (DT) for efficient data forwarding while inactive groups use their MTs for this purpose, without adversely affecting the overall forwarding efficiency. We avoid traffic concentration for the DTs by having a root that is selected on a per-multicast group basis, and share the same root for the MT as well. We described the MAD protocol in detail, with particular emphases on the seamless transition of a group between the use of the DT and the MT for forwarding of information and failure recovery. We examined the scaling characteristics of the MAD protocol through analysis, simulation and a prototype implementation. Our analysis demonstrates the scalability of MAD in maintaining group membership state at very large scale. We also performed simulations with realistic network topologies to show that MAD achieves the “best of both worlds”, obtaining the information forwarding efficiency of DT, and the scalability of state-maintenance with MT for supporting a vast number of groups. Finally, we developed a prototype implementation of MAD using the FreePastry infrastructure. Using Emulab, we emulated a reasonable network topology 15

[1] A. J. Ballardie. Core based trees (CBT version 2) multicast routing: Protocol specification. RFC-2189, 1997. [2] A. J. Ballardie, P. Francis, and J. Crowcroft. Core based trees. In Proc. SIGCOMM, 1993. [3] R. Boivie, N. Feldman, Y. Imai, W. Livens, D. Ooms, and O. Paridaens. Explicit multicast (Xcast) basic specification. IETF Internet draft, 2000. [4] A. Bozdog, R. van Renesse, and D. Dumitriu. Selectcast: a scalable and self-repairing multicast overlay routing facility. In Proc. ACM workshop on survivable and self-regenerative systems, 2003. [5] M. Castro, P. Druschel, A.-M. Kermarrec, and A. Rowstron. SCRIBE: A large-scale and decentralized application-level multicast infrastructure. IEEE J. on Selected Areas in Comm., 2002. [6] L. HMK Costa, S. Fdida, and O. CMB Duarte. Hop-by-hop multicast routing protocol. In Proc. SIGCOMM, 2001. [7] J. Cui, M. Geria, K. Boussetta, M. Faloutsos, A. Fei, J. Kim, and D. Maggiorini. Aggregated multicast: A scheme to reduce multicast states. IETF Internet draft, 2002. [8] D. Estrin, D. Farinacci, A. Helmy, D. Thaler, S. Deering, M. Handley, V. Jacobson, C. Liu, P. Sharma, and L. Wei. Protocol Independent Multicast-Sparse Mode (PIM-SM): Protocol Specification. RFC2362, 1998. [9] H. Liu, V. Ramasubramanian, and E. G. Sirer. A Measurement Study of RSS, a Publish-Subscribe System for Web Micronews. In Proc. IMC, 2005. [10] R. Mahajan, N. Spring, D. Wetherall, and T. Anderson. Inferring link weights using end-to-end measurements. In Proc. IMW, 2002. [11] S. Ratnasamy, A. Ermolinskiy, and S. Shenker. Revisiting IP multicast. In Proc. SIGCOMM, 2006. [12] A. Rowstron and P. Druschel. Store management and caching in PAST, a large-scale, persistent peer-to-peer storage utility. In Proc. SOSP, 2001. [13] A. Rowstron and P. Druschel. Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), 2001. [14] N. Spring, R. Mahajan, D. Wetherall, and T. Anderson. Measuring ISP topologies with Rocketfuel. IEEE/ACM Trans. Netw., 2004. [15] I. Stoica, T. S. E. Ng, and H. Zhang. REUNITE: A recursive unicast approach to multicast. In Proc. INFOCOM, 2000. [16] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications. In Proc. SIGCOMM, 2001. [17] B. White, J. Lepreau, L. Stoller, R. Ricci, S. Guruprasad, M. Newbold, M. Hibler, C. Barb, and A. Joglekar. An integrated experimental environment for distributed systems and networks. In Proc. OSDI, 2002. [18] B. Zhang, T. S. E. Ng, A. Nandi, R. Riedi, P. Druschel, and G. Wang. Measurement-based analysis, modeling, and synthesis of the Internet delay space. In Proc. IMC, 2006.

APPENDIX A.

MAD DATA STRUCTURES & MESSAGES

Table 2 summarizes the example data structures of MAD router involved in 4 groups, and Figure 13 enumerates the message types used in the protocol. gid 0011 1100 0110 0111

mode active active inactive transient

MT mtChild mtFHs 0101 f1 , f2 0101 f3 0000 f2 1001 f4 , f5

DT dtChild 1001 1101 – –

transition state trChild trFHs – – – – – – 1000 f5

Table 2: Example data structure 1 2 3 4 5 6 7 8 9 10 11

MtJoinMsg (group, FH router) MtJoinAckMsg (group, FH router) MtJoinNackMsg (group, FH router) MtLeaveMsg (group, FH router) MtNodeCreateMsg (group, FH router list) MtNodeDeleteMsg (group, FH router list) DtBuildMsg (group, FH router list) DtJoinMsg (group, FH router) DtJoinCompleteMsg (group, FH router) DtLeaveMsg (group, FH router) MadDataMsg (group, tree (MT/DT), flush)

Figure 13: List of MAD protocol messages

16