Ajay Kumar
VMware Inc.
ajayk@vmware.com
Abstract
Traditionally, concerns about reliability, scalability, and security have resulted in poor adoption of IP multicast in the Internet. However, data center networks, with their structured topologies and tighter control, present an opportunity to address these concerns. In this paper, I present VSRT—a software-defined networking (SDN) based system that enables multicast in commodity switches used in data centers. As part of VSRT, I develop a new multicast routing algorithm called Very Secure Reduced Tree (VSRT) routing algorithm. VSRT attempts to minimize the size of the routing tree it creates for any given multicast group. In typical data center topologies such as Tree and FatTree, VSRT reduces to an optimal routing algorithm that becomes a solution to the Steiner Tree problem. VSRT leverages SDN to take advantage of the rich path diversity commonly available in data center networks and thereby achieves highly efficient bandwidth utilization. I implement VSRT as an OpenFlow controller module. My emulation of VSRT with Mininet Hi-Fi shows that it improves application data rate by up to 12% and lowers packet loss by 51%, on average, compared to IP multicast. I also build a simulator to evaluate VSRT at scale. For the PortLand FatTree topology, VSRT results in at least a 35% reduction, compared to IP multicast, in the number of links that are less than 5% utilized, when the number of multicast groups exceeds 1,000. My results confirm that VSRT results in smaller trees compared to traditional IP multicast routing.
1. Introduction
Group communication is extensively used in modern data centers. Some examples include Apache Hadoop [1], which uses data replication for higher availability; clustered application servers [2], which require state synchronization; and cloud environments, which require OS and application image installation on a group of virtual machines (VMs). Multicast lends itself naturally to these communication patterns. IP multicast, which has been in existence for several years, is the most common multicast implementation for traditional networks. It is prudent to carefully consider the adoption of IP multicast in data centers.
Traditional IP multicast has remained largely undeployed in the Internet owing to concerns about reliability, scalability, and security. Network protocols such as TCP, which is the de facto standard for reliable unicast, incur significant latencies when applied to multicast. Ymir et al. [5] have studied application throughput for data centers that use TCP for reliable multicast. Address aggregatability, a mechanism for reducing unicast forwarding state in switches, is not feasible with IP multicast addresses. This leads to switch state explosion as the number of multicast groups scales up. IP multicast allows any host to subscribe to a multicast group and start receiving group traffic. Security, therefore, becomes very difficult to enforce. Data center networks with their structured topologies and tighter control present an opportunity to address these concerns. As a result, there has been renewed interest in multicast with specific focus on data centers. In [6], the authors propose reliable data center multicast. They leverage the rich path diversity available in data center networks to build backup overlays. In [7], the authors use multiclass bloom filters to compress multicast forwarding state. Security, though, remains a concern. There are also some additional concerns specific to the adoption of IP multicast in data center networks. IP multicast is not designed to take advantage of path diversity, which, unlike in traditional IP networks, is an integral part of data center networks.
This is likely to result in poor bandwidth utilization. Additionally, IP multicast routing algorithms, of which Protocol Independent Multicast – Sparse Mode (PIM-SM) is the most common, are not designed to build optimal routing trees. PIM-SM builds trees rooted either at the source of the multicast group, or at a predetermined rendezvous point (RP) for the group. Optimal tree building is equivalent to solving the Steiner Tree [8] problem. For arbitrary graphs, which is how traditional IP networks are [[word(s) missing]], the Steiner Tree problem is known to be NP-complete. In structured graphs like those found in data center networks, however, it is possible to build Steiner Trees in polynomial time for some topologies. Subsequently, it is possible to build optimal or near-optimal routing trees.
The rapid emergence of SDN, which has strong industry backing provides the perfect opportunity for innovating multicast in data centers to address the aforementioned concerns. The SDN architecture uses a centralized control plane that enables centralized admission control and policy enforcement, thereby alleviating security concerns. It also provides global visibility, as opposed to localized switch-level visibility in traditional IP networks, thereby enabling greater intelligence in network algorithms. Multicast routing algorithms can thus leverage topology information to build optimal routing trees, and can leverage link utilization state to efficiently exploit path diversity typically available in data centers. Lastly, it is important to note that not all commodity switches used in data center networks have IP multicast support. SDN can be leveraged to enable multicast in such commodity switches.
In this context, I present VSRT—an SDN-based system that enables multicast in commodity switches used in data centers. VSRT leverages centralized visibility and control of SDN to realize secure, bandwidth-efficient multicast in switches that do not have any inbuilt IP multicast support. VSRT, like IP multicast, supports dynamic joins of multicast members. However, unlike IP multicast, it is also able to deny admission to a member based on predefined policies. As part of VSRT, I develop VSRT, a new multicast routing algorithm. VSRT attempts to minimize the size of the routing tree created for each group. Whenever a new member joins a group, VSRT attempts to attach it to the existing tree at the nearest attachment point. For typical data center topologies such as Tree and FatTree, VSRT reduces to an optimal multicast routing (Steiner Tree building) algorithm that can be executed in polynomial time. In this paper, I make the following contributions:
- Detailed design of VSRT
- Implementation of VSRT as an OpenFlow controller module
- Emulation of VSRT for application performance benchmarking
- Design and implementation of a simulator to evaluate VSRT at scale
The rest of this paper is organized as follows. Section 2 discusses the motivation behind developing SDN-based multicast for data centers. In section 3, I present a detailed design and implementation of VSRT. Section 4 describes my experiments and presents both emulation and simulation results. I end with a conclusion in section 5. For the interested reader, an appendix provides my proof showing that the Steiner Tree can be computed in polynomial time for Tree and FatTree topologies.
2. Motivation
Multicast can greatly benefit modern data centers by saving network bandwidth and improving application throughput for group communications. Because IP multicast has been in existence for several years, it is logical to consider the adoption of IP multicast in data centers. However, as outlined in section 1, there are still unresolved concerns about security, path diversity utilization, and routing tree formation that make the adoption of IP multicast prohibitive. In this work, I identify SDN as the architecture that is capable of addressing these concerns, as detailed below.
2.1 Security
SDN uses a centralized control plane. In an SDN network, when a new member sends a request to join a multicast group, the request is forwarded to the control plane. The control plane can either admit this new member and appropriately modify forwarding rules in switches, or deny admission to the member based on predefined policies. In this manner, SDN-based multicast can enable centralized admission control and policy enforcement, thereby alleviating security concerns.
2.2 Path Diversity Utilization
In data center topologies with path diversity, there are multiple, often equal-length, paths between any given hosts. Ideally, for efficient bandwidth utilization, different multicast trees should be spread out across different paths. Traditional IP networks build multicast trees based on localized switch-level views stored in the form of address-based routing table entries, as explained section 2.3 ( Routing Tree Formation). This results in many of the same links being used for different trees, while at the same time leaving many links unused. SDN, on the other hand, can leverage global visibility to take advantage of path diversity and make different multicast groups use different routing trees. This leads to more even distribution of traffic across all links and avoids congestion or oversubscription of links.
2.3 Routing Tree Formation
PIM-SM, the most common IP multicast routing protocol, builds a multicast routing tree by choosing a node in the network as the RP for each group and connecting all group members to this RP. PIM-SM relies on IP unicast routing tables, which are based on localized switch-level views, to find a path from the member to the RP. This results in RP-rooted trees that can be nonoptimal. PIM-SM, for high data rates, provides the option for each member to directly connect to the source. In such a case, instead of RP-rooted trees, there is a combination of source-rooted and RP-rooted trees. This is still likely to be nonoptimal, because each member is routed to a specific node (RP or source) on the existing tree, as opposed to being routed to the nearest intersection on the existing tree. SDN’s global visibility, on the other hand, can be leveraged to build near-optimal routing trees. Whenever a new member joins a group, instead of finding a path from it to the source or the RP, SDN can find its nearest attachment point to the existing tree. This results in trees that use fewer hops, and in the case of topologies such as Tree and FatTree, reduces to optimal trees.

Figure 1. Motivating Example
This is explained with the help of the example in Figure 1 (a) and (b). These figures show an irregular data center topology—increasingly common in unplanned data centers—that is a combination of Tree and Jellyfish [10] topologies. It shows two multicast groups. The first group comprises Tenant 1, whose VMs reside on hosts {HI, HS, H11,HIS}. The second group comprises Tenant 2, whose VMs reside on hosts {H4, HIO, HI4}. H8 has a suspicious tenant that wants to hack into Tenant 2. Figure 1(a) shows the outcome of using IP multicast. For each of the two multicast groups, I assume that PIM-SM chooses the core switch C as the RP. This is reasonable because C is equidistant from all members in either group. The routing trees built for the two groups are shown by the dashed and solid lines respectively. As can be seen, reliance on unicast routing tables to connect each node to the RP leads to the same links being used for each tree.
Note that in the above discussion, I have taken for granted that the switches in question have support for IP multicast. Many commodity switches used in data centers have no off-the-shelf IP multicast support. In such switches, VSRT, most importantly, enables multicast.
3. Related Work and Background
Today, the majority of Internet applications rely on point-to-point transmission. Utilization of point-to-multipoint transmission has traditionally been limited to LAN applications. Over the past few years the Internet has seen a rise in the number of new applications that rely on multicast transmission.
Mininet is a network emulator. It runs a collection of end-host switches, routers, and links on a single Linux kernel. It uses lightweight virtualization to make a single system look like a complete network, running the same kernel, system, and user code. A Mininet host behaves just like a real machine [8]. Cisco Packet Tracer [3] is a powerful network-simulation program that enables experimentation with network behavior. Packet Tracer provides simulation, visualization, authoring, assessment, and collaboration capabilities to facilitate the teaching and learning of complex technology concepts. I used it to perform tests with routers and create networks.
3.1 Reducing Network Load
Assume that a stock-ticker application is required to transmit packets to 100 stations within an organization’s network [2]. Unicast transmission to the group of stations will require the periodic transmission of 100 packets, and many packets might be required to traverse the same link(s). Multicast transmission is the ideal solution for this type of application, because it requires only a single packet transmission by the source, which is then replicated at forks in the multicast delivery tree. Broadcast transmission is not an effective solution for this type of application, because it affects the CPU performance of every end station that sees the packet, and it wastes bandwidth [6].
3.2 Resource Discovery
Some applications implement multicast group addresses instead of broadcasts to transmit packets to group members residing on the same network. However, there is no reason to limit the extent of a multicast transmission to a single LAN. The time-to-live (TTL) field in the IP header can be used to limit the range (or “scope”) of a multicast transmission [9].
3.3 Multicast Forwarding Algorithms
A multicast routing protocol is responsible for the construction of multicast packet-delivery trees and for performing multicast packet forwarding. There is a Cisco packet-tracer tool that can be used for testing any algorithm [3]. I had tried a few algorithms before creating this new algorithm. This section explores a number of different algorithms that can potentially be employed by multicast routing protocols:
- Flooding
- Spanning trees
- Reverse Path Broadcasting (RPB)
- Reverse Path Multicasting (RPM)
These are a few algorithms that are implemented in the most prevalent multicast routing protocols in the Internet today:
- Distance Vector Multicast Routing Protocol (DVMRP)
- Multicast Open Shortest Path First (MOSPF)
- Protocol-Independent Multicast (PIM)
3.4 RPM
RPM is an enhancement to RPB and Truncated RBP. RPM creates a delivery tree that spans only:
- Subnetworks with group members
- Routers and subnetworks along the shortest path to subnetworks with group members
RPM allows the source-rooted spanning tree to be pruned so that datagrams are only forwarded along branches that lead to members of the destination group [7].
3.5 Network Simulator (Ns-2)
Ns-2 is a discrete event simulator targeted at networking research [11]. Ns-2 provides substantial support for simulation of TCP, routing, and multicast protocols over wired and wireless (local and satellite) networks. Nam is a Tcl/Tk-based animation tool for viewing network simulation traces and real-world packet traces. It is mainly intended as a companion animator to the Ns simulator. REAL is a network simulator originally intended for studying the dynamic behavior of flow and congestion. Anything can be used to test the VSRT algorithm.
4. Design and Implementation
VSRT is designed to achieve the following goals:
- Efficiently utilize path diversity
- Enforce admission control
- Build near-optimal multicast trees
- Enable multicast support in commodity SDN switches
- Be easily deployable
4.1. VSRT Algorithm
VSRT is a polynomial time algorithm that builds a routing tree by attempting to attach each new group member to the existing tree at the nearest intersection. Instead of trying to find the shortest path from this member to a specific node, as PIM-SM does, VSRT tries to find the shortest path to the existing tree. This can be trivially accomplished, in theory, by computing the shortest path from the new member to each node on the existing tree. However, this is computationally prohibitive. VSRT performs this attachment using a unique method that completes in polynomial time. Although in theory VSRT might not always be able to find the best attachment point for all topologies, it does so with high probability in practice for most topologies. Specifically, for Tree and FatTree topologies, it does so with probability 1. If VSRT is unable to find the optimal path, it still finds a path that is at least as short as that found by PIM-SM. VSRT first assigns a level to all nodes in the network. This level classifies the node’s distance, in number of hops, from a physical server. Thus, all physical servers are assigned level 0, all top-ofracks (ToRs) are assigned level l, and so on. While creating the routing tree for a group, VSRT iterates through the group members one by one and attaches them to the tree. In this regard, the tree created is a function of the order in which members appear. Regardless of the ordering, though, the tree created is nearoptimal and at least as small as that created by PIM-SM. Optionally, after the group reaches a steady state in terms of number of subscribers, a steady state tree can be reconstructed. The steady state tree can be chosen as the smallest tree obtained from all possible orderings. In my system, I have not implemented steady state tree reconstruction, because the trees created in the first attempt efficiently satisfy all design goals.
Tree building begins when there are at least two members in the group. To connect the first two members, the algorithm chooses the shortest path between them. Subsequently, whenever a new member appears, the algorithm tries to find its nearest intersection to the existing tree. To do so, it first checks if any of its adjacent nodes reside on the existing tree. Thus, when a new member, which would by definition be a level 0 node, appears, all its adjacent (level 1) nodes are checked. If any of these nodes already resides on the existing tree, the new member is simply attached to the tree at this point. If none of these adjacent nodes lies on the tree, the algorithm then looks at all neighbors (level 0, level l, and level 2) of the adjacent nodes. If any of these neighboring nodes lies on the existing tree, the algorithm attaches the new member to the tree at this point. If neither this new member’s adjacent nodes nor their state tree [[word(s) missing]] reconstruction, because the trees created in the first attempt efficiently satisfy all design goals. Tree building begins when there are at least two members in the group. To connect the first two members, the algorithm chooses the shortest path between them. Subsequently, whenever a new member appears, the algorithm tries to find its nearest intersection to the existing tree. To do so, it first checks if any of its adjacent nodes reside on the existing tree. Thus, when a new member, which would by definition be a level 0 node, appears, all its adjacent (level 1) nodes are checked. If any of these nodes already resides on the existing tree, the new member is simply attached to the tree at this point. If none of these adjacent nodes lies on the tree, the algorithm then looks at all neighbors (level 0, level l and level 2) of the adjacent nodes. If any of these neighboring nodes lies on the existing tree, the algorithm attaches the new member to the tree at this point. If neither this new member’s adjacent nodes nor their neighbors lie on the existing tree, then one of the member’s adjacent nodes at the next higher level is randomly chosen.
Note that in this case, the new member has not yet been attached to the tree, so the algorithm continues. Next, this chosen adjacent node is set as the current node. Now, its adjacent nodes (some of which would have already been examined in the previous iteration) and their neighbors are examined to see if any falls on the existing tree. If any of them does, the new member is connected to the tree at this node by tracing the path chosen from the new member onward.

Figure 2. VSRT: Routing Tree Formation
If, on the other hand, neither of them lies on the existing tree, the algorithm continues by randomly choosing one of the current node’s adjacent nodes at the next-higher level. This chosen node is now set as the current node. In this manner, the algorithm continues either until the new member is connected to the existing tree or until the level of the current node reaches the highest level in the topology. If the algorithm has already reached the highest level and has still been unable to attach the new member to the tree, then it resorts to a breadth first search (BFS) with a stop condition that terminates as soon as an attachment point to the tree is found. For typical data center topologies, which are characterized by rich path diversity and large numbers of edges at higher levels, it is unlikely that the algorithm would reach a highest level node in the topology without attaching the new member to the tree. At every iteration for which the algorithm is unable to attach the member, it randomly moves to a higher level, thereby increasing its chances of finding an attachment point (owing to the larger number of edges at the higher level) in the next iteration. If the algorithm randomly selects a node that is headed in a direction away from the tree, in the next iteration a random selection once again is likely to head it back toward the tree. This approach of randomly selecting higher-level nodes, from the perspective of building routing trees for different multicast groups, also contributes to better utilization of the available path diversity, leading to more-balanced link utilizations. If the algorithm does indeed reach the highest level without converging, which as mentioned above is unlikely, the overhead incurred from this unsuccessful search is very small, because typical data center topologies are only three to four levels high. This is computationally far less expensive than using BFS for each new member. Additionally, as demonstrated in section 4, this approach still results in smaller routing trees than PIM-SM. Specifically, for Tree and FatTree topologies, this algorithm always finds the optimal attachment point for each new member without needing to resort to BFS.
I explain VSRT with the help of an example in Figure 2. The example demonstrates how a tree is constructed as new members join a multicast group. Initially, there is one sender S and one receiver R1. The tree is constructed by choosing the shortest path from R1 to S. Subsequently, a receiver R2 also subscribes to the multicast group. None of this receiver’s adjacent nodes are on the tree, nor are the neighbors of these adjacent nodes on the tree. It has only one adjacent node, a level 1 node, which is therefore chosen by default as the node that will lead this member to the tree. Next, setting this level 1 node as the current node, VSRT looks at all its adjacent nodes as well as at their neighbors. Again, this level 1 node has only one neighbor level 2 node, so it is chosen by default. Now this level 2 node becomes the current node, and VSRT looks at its adjacent nodes. None of its adjacent nodes are on the tree. However, at least one of the neighbors of one of these adjacent nodes is on the tree. This adjacent node is the level 2 node located horizontally to the left of the current (level 2) node. Thus, VSRT selects this adjacent node. Finally, VSRT attaches the new member to the existing tree at this adjacent node’s neighbor (the level 3 mode marked by a *). Finally, a last receiver R3 arrives. In the first iteration for R3, VSRT chooses the level 1 node immediately adjacent to it because there is no other choice. In the next iteration, with this level 1 node set as the current node, VSRT first looks at its adjacent nodes. As it turns out, one of its adjacent nodes (the level 2 nodes marked by a *) is on the tree. So, it attaches to the tree at this node.
4.2 VSRT System Implementation
I implement VSRT as an OpenFlow controller module, using the OpenDaylight DN platform, as outlined in Figure 3. The VSRT module listens for subscription requests and topology changes from the network, and dynamically updates the appropriate multicast routing trees. It registers with the IListenDataPacket service of OpenDaylight to indicate that it wants to receive all data packets sent to the IDataPacketService of the controller. In my implementation, I adopt the IP multicast addressing scheme, so hosts still send subscription requests through Internet Group Management Protocol (IGMP) packets. VSRT implements IGMP snooping to learn about hosts that want to subscribe to or unsubscribe from a multicast group. On receiving an IGMP packet, VSRT finds the address of the host as well as the multicast group it wants to join or leave. Subsequently, it examines security policies to ensure that this member can be admitted and, if so, updates the multicast routing tree using VSRT. When a multicast group sender that hasn’t subscribed to the group yet (because senders do not send IGMP packets) starts sending multicast traffic, the controller is notified. Subsequently, VSRT automatically adds the sender to the multicast tree, once again, assuming policies permit this. VSRT also appropriately modifies routing trees whenever a topology change is registered from the ITopologyManager in OpenDaylight. Any time VSRT needs to update the routing tree, it effectively must add, delete, or modify routing rules in appropriate switches. This is done through the IForwardingRulesManager. The OpenDaylight controller’s service abstraction layer uses a southbound plug-in to communicate with network elements. Currently, OpenDaylight has only one southbound plug-in that supports the OpenFlow v1.0 protocol. VSRT can be completely implemented using the features provided by OpenFlow v 1.0. VSRT can work with higher versions of OpenFlow as well.

Figure 3. Architecture of the VSRT OpenDaylight Module
5. Results
5.1 Emulator
To validate and evaluate my implementation of VSRT, I used Mininet Hi-Fi [11], an OpenFlow v1.0 enabled network emulation platform. I created a Mininet network topology and connected it to the OpenDaylight controller. For this emulation, I chose a FatTree topology, as shown in Figure 4. FatTree is a common data center topology that has sufficient path diversity to highlight the benefits of VSRT. My topology comprises 24 hosts, 6 ToR switches, 6 aggregation switches, and 2 core switches. The link capacity for each link in this network is set to 10Mbps. For performance benchmarking, I use Iperf [12].
Throughout this section, host hx refers to the host with IP address 10.0.0.x in Figure 4. First, I seek to validate the functionality of VSRT by ensuring that it is able to enable multicast in a Mininet network. OVS switches used by Mininet do not have any out-of-the-box multicast support. This is demonstrated in Figure 5. A multicast group is created with hosts h1, h4, and h7. An Iperf client (sender) is started on h1, while Iperf servers (receivers) are started on hosts h4 and h7. As is evident, the servers do not receive any multicast traffic. Next, I include VSRT in the OpenDaylight controller and restart the Iperf client and servers. As shown in Figure. 5, multicast has now been enabled. Next, I seek to evaluate VSRT by comparing its Iperf performance with that of IP multicast. Once again, I would like to point out that I adopt the addressing scheme of IP multicast. The Iperf application has been written to use the IP addressing scheme, and I merely leverage that for ease of implementation. Using this addressing scheme for VSRT has no bearing on the results. The VSRT system is completely independent and separate from IP multicast. Because IP multicast is not supported out of the box with OVS switches, I also implement an adaptation of IP multicast for my software-defined environment. Although I still leverage the OpenDaylight controller to create IP multicast routing trees by installing appropriate forwarding rules in switches, there are two important differences in the implementation of IP multicast compared to VSRT. These are to ensure that my implementation of IP multicast mirrors traditional IP multicast:
First, IP multicast does not leverage central visibility available to the OpenDaylight controller. It relies on localized switch-level views. Second, IP multicast uses PIM-SM for routing. I tried to incorporate XORP, a routing engine that implements PIM-SM, along with RouteFlow, a service that facilitates communication between the controller and the routing engine. However, the current implementation of RouteFlow is incapable of converting the multicast routing tree generated by XORP into corresponding Open Flow rules. Hence, in my system, I implement PIM-SM. For Iperf performance comparison, I create six random multicast groups of sizes varying from three to six multicast members. The sender in each group uses Iperf in client mode to send multicast traffic at a rate chosen randomly from {2, 4, 6, 8} Mbps. The remaining members of the group use Iperf in the UDP server mode to bind to the multicast group. The packet loss percentages associated with VSRT and IP multicast are shown in Figure 7, and data rates are shown in Figure 8. The results from my emulation show that VSRT results in throughput increase by up to 12% and packet loss reduction by 51% on average, compared to IP multicast.

Figure 4. Mininet Topology Used for Emulating VSRT
5.2 Simulator
To evaluate the performance of VSRT, I built a multicast simulator that comprises the following modules:
- Topology Generation
- VM Placement
- Multicast Group Generation
- VSRT
- IP Multicast

Figure 5. Average Packet Loss Percent

Figure 6. Average Transfer Rate
The simulator first creates a network topology based on user input specifying the total number of servers, the number of servers per rack, and the type of topology. The topology generation module, currently, is capable of generating two types of topologies: Tree and FatTree. For either type of topology, the topology generator determines the number of switches and arranges them appropriately by creating the necessary host-switch and switch-switch edges. Alternatively, if a topology other than Tree or FatTree needs to be used, it can be supplied as a file to the simulator. This would bypass the topology generation module. Next, the simulator prompts the user to specify the number of VMs running in the data center. Each VM is mapped randomly onto one of the servers. All communication is assumed to be between VMs. Next, the simulator asks the user to specify the number of multicast groups that need to be routed. The simulator also allows the user to supply the member VMs for each multicast group, along with the group’s associated data rate. If member VMs are not supplied, the simulator randomly chooses between 3 and 20 VMs as the members for each group. It also randomly assigns a data rate (in hundreds of kbps) to each group’s traffic. The simulator implements both VSRT and IP multicast. For VSRT, the simulator assumes an SDN environment with centralized visibility into the network. It uses VSRT to build multicast trees. For IP multicast, the simulator assumes localized views derived from routing tables stored in switches. It uses PIM-SM to build multicast trees. In my simulations, I create both Tree and FatTree topologies. For each topology, I specify 11,520 servers assembled into 40 servers per rack, thereby resulting in 288 racks. This distribution of servers is adopted from PortLand [13], which in turn interprets this from. The simulator determines the number of switches required as 288 top-of rack (ToR) switches, 24 aggregation switches, and 1 (for Tree) or 8 (for FatTree) core switches. I specify the number of VMs as 100,000, and the simulator randomly places each VM on the 11,520 servers. Because there is no available real data trace for data center multicast, I let the multicast groups be generated automatically by the simulator. To create these multicast groups, the simulator applies the methodology described in [17]. Additionally, the simulator assigns a random data rate for each multicast group, which is randomly chosen from the range 100kbps to 10Mbps. Here, I present results from our simulation runs on the FatTree topology only. The results from the Tree topology were similar. For the first simulation run, I set the link capacity for each link to 1Gbps and create 1,000 multicast groups. Figure 5 shows the CDF of link utilizations in the network. The following observations are made from this plot: • The percentage of unutilized links is 0% in VSRT, while in IP multicast it is 16%.
- The percentage of links that have less than 5% utilization in VSRT is 49%, while in IP multicast it is 65%.
- The maximum link utilization in VSRT is 73%, while that in IP multicast is 301%. IP multicast has 1.5% links with utilization greater than 100%.
The above observations establish that VSRT is able to take better advantage of the available bandwidth in the network. As the number of multicast groups increases, the inability of traditional IP multicast to efficiently utilize path diversity gets magnified even further. For the rest of the simulation runs, I vary the number of multicast groups from 100 to 10,000. Also, I increase link capacity from 1Gbps to 10Gbps to accommodate this large number of multicast groups.
6. Conclusion
Reliability, scalability, and security concerns have resulted in IP multicast’s poor adoption in traditional networks. Data center networks with their structured topologies and tighter control present an opportunity to address these concerns. However, they also introduce new design challenges, such as path diversity utilization and optimal tree formation, that are not critical in traditional networks like the Internet. In this paper, I presented VSRT, an SDN-based system for enabling multicast in data centers. VSRT leverages global visibility and centralized control of SDN to create secure and bandwidth-efficient multicast. VSRT implements its own routing algorithm, VSRT, that creates optimal routing trees for common data center topologies. My implementation of VSRT as an OpenFlow controller module validates its deployability, and my simulation establishes its scalability. I am currently working on incorporating reliability in VSRT and exploring the adoption of reliability protocols such as PGM [18]. I am also working on porting common group communication applications, such as OS image synchronization and Hadoop, to use VSRT.
Appendix: Steiner Tree in Polynomial Time for FatTree

Figure 7. Steiner Tree Building
Steiner Tree Problem
Given a connected undirected graph G=(V,E) and a set of vertices N, find a sub-tree T of G such that each vertex of N is on T and the total length of T is as small as possible. The Steiner Tree Problem for an arbitrary graph is NP-complete. In this section, I prove that the Steiner Tree problem for FatTree topologies can be solved in polynomial time. Figure 7 shows a FatTree graph with the set of vertices that need to be connected, N, indicated by the black tiles. My proof strategy consists of the following two steps: 1. Build a tree (in polynomial time) that connects all vertices in N. 2. Show that the tree thus constructed is the Steiner tree. Owing to symmetry in FatTree graphs, a given cluster of X nodes at level L connects identically to a given cluster of Y nodes at level (L+1). In other words, there is a mapping X(L) {o} Y(L + 1). Specifically, in the graph shown in Figure 7, clusters of 10 nodes (hosts) at level 0 connect to 1 node (ToR) at level l, clusters of 8 nodes (ToRs) at level 1 connect to clusters of 2 nodes (aggregations) at level 2, and clusters of 4 nodes (aggregations) at level 2 connect to clusters of 2 nodes (cores) at level 3. For each Level L cluster X, one specific node is chosen as the designated node from its corresponding Level (L+ 1) YEY cluster for all group traffic to or from X. The relative orientation of y(L+ 1) with respect to cluster X(L) is kept identical across every mapping X(L) {o} Y(L + 1). This is shown in Figure 7 with the help of red dots. For every vertex in N, which is a level 0 node, the only choice is for a level 1 designated node. The designated level 1 nodes for those level 0 node clusters that have at least one multicast group member are shown in red. Next, for each level 1 cluster (there are 4 clusters of level 1 nodes with 8 nodes in each), the first (left) of the two level 2 nodes is chosen as the designated node. Finally, for each level 2 node thus chosen, the second (right) level 3 node is chosen as the designated node. The choice of designated node doesn’t matter as long as the relative orientation of each level (L+1) designated node with respect to its child Level L cluster is the same throughout the topology. It can be seen that the tree thus created by joining all designated nodes connects all group members to one another, and is thus a multicast routing tree. This tree can be constructed in polynomial time, because each new member can be connected to the existing tree in bounded time (it is just a matter of following designated nodes from it until the tree is reached) and the number of members itself is bounded. Also, it can be seen that the tree thus created connects any given pair of nodes in N by the shortest path between them. Therefore, it is the Steiner Tree. Hence, a Steiner Tree can be created in polynomial time for FatTree topologies. Because Tree is a special case of FatTree, Steiner Trees can be created in polynomial time for Tree topologies by corollary.
References
1. Hadoop. http://hadoop. apache.org/
2. R. Minnich and D. Farbar, “Reducing Host Load, Network Load and Latency in a Distributed Shared Memory”
3. CiscoPacketTracer. https://www.netacad.com/web/about-us/cisco-packet-tracer
4. Microsoft Azure. http://www.windowsazure.com/en-us/
5. D. Basin, K. Birman, I. Keidar, and Y. Vigfusson, “Sources of instability in data center multicast,” in Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware. ACM, 2010, pp. 32–37.
6. D. Li, M. Xu, M.-c. Zhao, C. Guo, Y. Zhang, and M.-y. Wu, “RDCM: Reliable data center multicast,” in Proc. IEEE INFO COM. IEEE, 2011, pp. 56–60.
7. D. Li, H. Cui, Y. Hu, Y. Xia, and X. Wang, “Scalable data center multicast using multi-class Bloom Filter,” in Proc. IEEE International Conference on Network Protocols (ICNP). IEEE, 2011, pp. 266–275.
8. Mininet Hi-Fi. https://github.com/mininet/Mininet.
9. ResourceDiscovery. http://www.jisc.ac.uk/whatwedo/topics/resourcediscovery.aspx
10. A. Singla, C.-Y. Hong, L. Popa, and P. B. Godfrey, “Jellyfish: Networking data centers randomly,” in Proc. Of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI). USENIX Association, 2012, pp. 17–17.
11. The Network Simulator – ns-2 http://www.isi.edu/nsnam/ns/
12. Iperf. http://iperf.fr/
13. R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang, P. Min, S. Radhakrishnan, V. Subramanya, and A. Vahdat, “PortLand: a scalable fault-tolerant layer 2 data center network fabric,” in SIGCOMM Computer Communication Review, vol. 39, no. 4. ACM, 2009, pp. 39–50.