A Fast Topology Inference -- A building block for network-aware parallel processing Tatsuya Shirai University of Tokyo 7-3-1 Hongo Bunkyo-ku Tokyo, Japan Hideo Saito University of Tokyo 7-3-1 Hongo Bunkyo-ku Tokyo, Japan Kenjiro Taura University of Tokyo 7-3-1 Hongo Bunkyo-ku Tokyo, Japan tatsuya@logos.ic.i.utokyo.ac.jp ABSTRACT h saito@logos.ic.i.utokyo.ac.jp General Terms tau@logos.ic.i.utokyo.ac.jp Adapting to the network is the key to achieving high p erformance for communication-intensive applications, including scientific computing, data intensive computing, and multicast, esp ecially in Grid environments. This pap er investigates an approach of representing network as a tree of participating hosts and switches matching or approximating their physical top ology, and describ es a fast, non-intrusive, and p ortable algorithm for inferring such a top ology. This representation and the prop osed inference algorithm serves as a key to building network-aware applications in a p ortable manner. The algorithm is based solely on RTTs of small packets b etween end hosts; it does not rely on p opular but not universally available protocols such as traceroute and SNMP. Another b enefit is that it can handle all layers of network uniformly without any a priori knowledge of cluster configurations. The required numb er of measurements is O(N d) in certain idealizing assumptions made for the purp ose of analysis, where N is the numb er of participating processes and d the diameter of the network, which is usually small in real networks. In our exp erimental environment, the inference algorithm built a top ology of 64 hosts in a single cluster in 4 seconds and and that of 256 hosts across 4 clusters in 15 seconds. It is able to not only identify clusters within a Grid, but also to partially identify the Layer 2 top ology within a cluster. This is imp ortant for optimizing bandwidth-limited op erations such as broadcast. We built several network-aware applications up on the inference system, including efficient bandwidth measurements and long message broadcasts. The top ology is used to schedule as many measurements as p ossible in parallel without comp eting on shared links. We were able to build a bandwidth map of 256 hosts across 4 clusters in 27 seconds. Performance, Measurement, Algorithms Keywords Top ology Inference, Measurements, Broadcast 1. INTRODUCTION The p erformance of communication-intensive applications and many typ es of collective communications dep ends on an effective use of the underlying network. It is crucially imp ortant in large-scale networks consisting of multiple switches and/or multiple local area networks (Grid environment). For a simplest example, it is vital for a broadcast in Grid environments to coordinate p oint-to-p oint communications so as to minimize inter-cluster traffic [14]. Applications or systems that take such factors into account are generally called locality- or network-aware. While there have b een a b ody of literature on p erformance optimization of collective op erations and of individual applications, they are often sp ecific to a single op eration, relying on detailed and manually supplied information, or rather ad-hoc. For example, many prop osed Grid-enabled MPI systems [1, 11, 12, 13] often assume that the system is given configuration information that group nodes reflecting localities (network distances) among them. If this grouping needs to b e done manually, these systems can b e practical only in closed and static environments where all users use a common set of resources that rarely change. For op en and dynamic environments where each user acquires resources from multiple administrative domains, and resource p ools change frequently, it will b ecome tedious to supply such information. Also, these systems often assume there are only two levels of hierarchy; they often only distinguish inter-cluster from intra-cluster links. In reality, a similar problem arises inside a single cluster too, once it has multiple switches. Finally, these systems only provide optimized communications just for builtin communication primitives. Even though the system can obtain some knowledges ab out the environment, the programmer cannot take advantage of them to optimize communication patterns sp ecific to their applications, simply b ecause they are not brought to them. In this pap er we prop ose a step toward addressing these problems. Sp ecifically, we prop ose a framework in which the system infers the top ology of the network as a tree and presents it to the application, assuming that routing paths among participating processes is in fact a tree. The heart of Categories and Subject Descriptors C.2 [Computer-Communication Networks]: Distributed Systems Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HPDC'07, June 25­29, 2007, Monterey, California, USA. Copyright 2007 ACM 978-1-59593-673-8/07/0006 ...$5.00. 11 our prop osal is a p ortable and non-intrusive algorithm that quickly and automatically infers such a top ology. We also demonstrate that a range of network-aware optimizations is p ossible just by utilizing the inferred top ology. The nodes of an inferred tree are either hosts or a switches. Leaves are hosts and internal nodes are switches. The inference algorithm only requires contact p oints (e.g., IP addresses and p orts) of participating hosts and sp ecifically need not b e given information ab out switches. An edge weight represents the latency b etween two nodes (i.e., a host and a switch, or two switches) of a tree. We hereafter use the terms edge weight and edge length interchangeably. The assumption that the network of participating processes is a tree is certainly a limitation of our approach, but we b elieve it gives a good compromise b etween simplicity and generality. For the simplicity side, trees can b e compactly represented than general graphs since they have a small numb er of edges. It can also b e directly interpreted as giving a hierarchical grouping of nodes, which are convenient for many purp oses. By limiting the inferred top ologies to trees, our inference algorithm is made simple and fast. For generality, the physical top ology of local area networks, Ethernet in particular, are usually trees. In wide area, this limitation would lose accuracy when routing paths among participating processes are in fact redundant. If a computation involves only a few clusters it is common that routing paths among processes are indeed trees. The inference algorithm is based solely on RTTs of small packets b etween end hosts. Internal nodes of the tree are automatically discovered (or more precisely, inferred to b e present). In particular we delib erately do not assume any administrative convention (e.g., hostnames or IP addresses) as hints for a top ology. While attractive, they may b e obscured by virtualization layers such as VPN, VLAN, and more recently, virtual machines. Solely using RTTs of small packets has two primary advantages. One is that it does not rely on protocols such as traceroutes and SNMP, which are not universally available in real networks for many reasons such as for security and for legacy devices. The other is that its measurements are not intrusive to the environment. In our algorithm, the numb er of pairs measured is O(N d) with certain idealizing assumptions made for the sake of analysis. Here N is the numb er of participating processes and d the diameter of the network, which is usually small in real networks. It built a top ology of 64 hosts in a single cluster in ab out 4 seconds and 256 hosts in 4 clusters in 15 seconds in our exp erimental environment. The results indicate that it is able to not only identify clusters within a Grid, but also partially identify the Layer 2 top ology within a cluster. For example, nodes directly connected to a single switch are usually identified correctly. As such, our top ology inference gives useful information b oth within and across LANs. Applications built up on the generated top ology include: (1) construction of bandwidth maps, for which the top ology is a prerequisite to safely p erform as many measurements as p ossible in parallel, (2) optimized broadcast of small messages, and (3) optimized broadcast of large messages. The rest of the pap er is organized as follows. Section 2 compares our work with previous work. Section 3 describ es the basic inference algorithm, in certain idealized conditions on measurements. Section 4 concerns implementation details achieving inference sp eed and accuracy in practical settings. Section 5 evaluates our inference algorithm and Section 6 applications using the inferred result. Section 7 finally states conclusions. 2. RELATED WORK Researches on network top ology inference have b een done and they differ in their main ob jectives and thus in their assumptions and approaches. Physical topology inference by management protocols. Many researches aim at discovering and inferring the largescale top ology of the Internet, often using traceroute as the basic tool [7, 9, 18]. The primary purp oses are management and trouble-shooting, which are very different from the present work. As such, they typically collect data by long-running observations to find as many (e.g., millions of ) hosts and routers as p ossible and try to build (mostly) static database. In contrast, our work aims at light-weight, instantaneous inference of top ologies involving (only) hosts participating in a particular computation. Our approach also has to, and is able to, find b oth Layer 2 and Layer 3 top ologies uniformly. There are also researches on finding Layer 2 top ology using management protocols such as SNMP [3, 15]. Black et al. [2] shows a different approach of inferring exact Layer 2 top ology solely based on observations on end hosts, despite b eing sp ecific to Ethernet. To the b est of our knowledge, the primary ob jectives are again management and troubleshooting. This approach, if taken for the purp ose of p erformance optimization of parallel applications, may have an advantage of accuracy and predictability of the inferred top ology, but will suffer from p ortability and deployment issues. Effective/logical topology inference by measurements. There have b een approaches that infer top ologies based solely on end-to-end measurements of some quantities combined with a statistical inference [4, 5, 10]. Our work falls into this category. This approach has p ortability and deployment advantages in that only end hosts run measurement activities and required information can generally b e collected without the administrators' privileges. Another advantage is that it obtains not only the top ology, but also some imp ortant quantities such as latencies and bandwidths during the measurement. It is also able to p otentially adapt to dynamic conditions. A common limitation of these approaches including ours, is that the result is sub ject to measurement errors and thus generally less predictable than approaches in the previous paragraph. Another limitation is that the inferred switches may not corresp ond to any physical ones. Such a top ology is sometimes called an effective or logical top ology as opp osed to a physical top ology. A logical top ology may not match a physical top ology and as a matter of fact, intermediate routers may b e in principle invisible from observations by end hosts. They still convey to the application imp ortant p erformance characteristics such as shared links. While b eing a problem for management or troubleshooting purp oses, it is acceptable and sometimes very appropriate (due to simplicity and p ortability) for p erformance optimization purp oses. Approaches in this category differ by measured quantities and measurement methods. Duffield et al. [10] use packet 12 loss rates to judge if two streams share a link. With this approach, the network must b e observed for a sufficiently long time, and introduces large traffic on the measured network. Coates et al. [5] prop osed a less intrusive approach of using small (a few kilobytes) probing packets from a single host to many participating hosts. They rep orted that it took eight minutes to infer the top ology involving nine hosts in the United States and two hosts in Portugal. Their basic method is to let a fixed measurement host S send three packets in a row; the first packet to a host A, the second packet to another host B, and then the third packet again to A. The gap b etween the two (first and third) packets observed by host A represents the b ottleneck bandwidth of the links shared by the two paths (S-A and S-B). By accumulating these measurements, S can infer the top ology rooted at S. This approach takes O(N 2 ) measurements to find a tree rooted at a single node, thus will b e slow when the numb er of hosts b ecome larger. Another difference b etween the ab ove works and ours is that they often assume measurements are p erformed only b etween a fixed single host and other hosts. This makes it difficult to precisely infer the top ology of hosts far from the fixed host. In Coates et al.'s method, for example, the fixed host will observe a similar (statistically identical) bandwidth to all hosts within a single remote LAN, in the likely cases where the bandwidth of the shared links is smaller than the internal bandwidth of the LAN. In this case, the measurements will give no information ab out the top ology in the LAN. As such, this method is applicable only when hosts are spread sparsely across networks distant from each other. In contrast, our approach involves all participating hosts equally in the measurements and coordinates them to avoid rep eatedly measuring latencies b etween far away hosts for fast inference. Coates et al. [6] prop ose a framework of merging measurements from multiple sources, but coordination of these sources so as to reduce the numb er of measurements is b eyond the scop e of the pap er. (a) Real latencies imp osed by hosts, cables, and switches. (b) Latencies modeled to b e imp osed only by cables (edges) Figure 1: A model of network to applications. Its focus is designing a standard representation, not on the inference. Our work is complementary to this in that we address top ology inference, whereas this work presents more extensive framework for representation of the network. 3. A TOPOLOGY INFERENCE BASED ON RTTS 3.1 Model of Network Topology Our approach models network as a weighted tree whose leaf nodes are participating hosts (more precisely, processes) and internal nodes switches. Edge lengths reflect latencies. The algorithm infers such a tree by measuring Round Trip Times (RTTs) of small messages b etween some hosts. This section describ es our basic inference algorithm in the ideal condition that latencies imp osed on edges are deterministic and the RTT b etween two end hosts is simply twice the length of the path b etween the two. Section 4 describ es imp ortant practical details including how to select host pairs to measure, how to coordinate processes, and how to cop e with measurement errors. Our algorithm outputs an edge-weighted graph whose weights reflect latency of the edges. In reality, of course, the latency of a packet from one host to another are introduced by hosts (Tn ), cables (Tl ), and switches (Ts ), as illustrated in Figure 1 (a). Precisely determining these comp onent latencies is neither p ossible nor necessary. To see this, it suffices to observe that Figure 1 (a) is indistinguishable from Figure 1 (b) solely by end-to-end measurements. Thus, we model this network as Figure 1 (b), i.e., as if latencies are imp osed only on edges of a graph, not on nodes. Framework for network-aware Grid applications. There have b een several attempts and prop osals to design a basis for common framework for network-aware parallel applications. Network Weather Service [19] is a well-known system that provides the application with the status of the environment, such as available end-to-end bandwidths. It mainly focuses on inferring near-future time evolution of the measured quantities, and does not address the issue of top ology inference. Our overall prop osal is in spirit close to [17], which prop oses a framework in which an effective top ology is inferred solely by end-to-end measurements, for the purp ose of optimizing parallel applications. Omitting details, the inference is done by first measuring a bandwidth from a single test host to each of the target hosts, and then grouping target hosts exhibiting similar bandwidths. A single group is further divided into small groups by p erforming simultaneous bandwidth measurements from two hosts in a group to the test host. If two hosts do not observe significant interference, they are considered to b e in separate clusters. As p ointed out by the authors, this method shares a similar problem with those in the last paragraph; if the bandwidth from the test host to a group of nodes is less than the bandwidth inside the group, it gives no information ab out the top ology inside the group. [16] prop oses a carefully designed standard for presenting network information 3.2 Algorithm The algorithm starts with choosing arbitrary two hosts and make the trivial tree just connecting the two. We grow this tree by adding hosts one after another. Let us consider how to insert the third host into the tree (Figure 2). Since the top ology among three hosts is unique , it remains to calculate the length of each edge (the distance to the branching p oint), x, y , and z in the Figure. This can b e carried out by measuring RTTs b etween all the three pairs and by solving the linear system of three equations in three variables. The solution is of course 13 (a) The branching p oint of A, B0 , and H is not an existing branch. Figure 2: Adding the third host (b) The branching p oint of A, B0 , and H is an existing branch (X). H should b e in one of the encircled subtrees. Figure 4: Two possible outcomes of measuring RTTs to two hosts C, and D). This gives us the p oint where the segment XC branches to D, uniquely determining the top ology among the four hosts. Now supp ose a general case of adding a host H to an arbitrary tree b eing constructed. The goal is to uniquely determine the p oint in the tree adjacent to H. To this end, we choose arbitrary two hosts already in the tree, say A and B0 , and solve the equation (1) involving the three hosts A, B0 , and H. This gives us the p oint where the segment AB0 branches towards H. There are two p ossible outcomes. 1. The p oint does not coincide with an existing branching p oint, in which case we create a new branching p oint X (representing a switch). H b ecomes adjacent to X and the p osition of H has b een determined (Figure 4 (a)). 2. The p oint coincides with an existing branching p oint, say X. In this case, it follows that the path from H to A and the path from H to B b oth go through X. In other words, if we regard the tree as rooted at X, we can deduce that H is not in the child tree of X containing A, nor the one containing B. To illustrate, H is in the encircled regions in Figure 4 (b). We thus proceed by choosing another host, say B1 , from the encircled region, and solve the same equation with A, B1 , and H. This further prunes a subtree from the p ossible p ositions of the p oint adjacent to H. We rep eat this until H's p osition is uniquely determined. To summarize, if the measurements of AH and BH determine that the branching p oint among three hosts A, B, and H is X, we have that H is not under in the child trees of X containing A (similarly for B). This prunes the child tree from the search space for the p osition of H. By rep eating this pruning, we will eventually determine the p oint adjacent to H. The sub-procedure for adding a host to a tree (add), as well as the entire top ology inference procedure (infer), are shown in Figure 5. Figure 3: Adding the fourth host x= y= z= (AC + AB - BC)/4, (AB + BC - AC)/4, (BC + AC - AB)/4, (1) where AB, BC, and CA are RTTs observed b etween the resp ective pairs.1 Let us proceed with inserting the fourth host, say D, to the tree consisting of three hosts A, B, and C, and a single switch X (Figure 3). We arbitrarily choose two hosts in the tree, say A and B, and solve the equation (1) involving A, B, and D. It determines the p oint Y, where the segment AB branches towards D. There are two p ossible outcomes here. 1. Y does not coincide with X, in which case the p osition of D is immediately determined (Figure 3 (a)). 2. Y coincides with X, in which case we still have two p ossibilities. One is that D is adjacent (directly connected) to X, as in Figure 3 (b1 ). The other is that there is another switch X' b etween X and C, and D is adjacent to it, as in Figure 3 (b2 ). Stating the cases together, we know that D is adjacent to a point between X (inclusive) and C . To determine the top ology, we need to measure the RTT b etween D and C, and solve the equation (1) involving A, C, and D (or B, 1 Just not to b e confused, an RTT refers to twice the latency b etween the ends, hence the factor 1/4 instead of 1/2. 14 measure() { add(T , H) { /* add host H to tree T */ /* the sender procedure */ M = all hosts (leaf nodes) of T ; min = ; A = choose and remove an arbitrary host from M ; max = -; AH = measured RTT b etween A and H; for (i = 0; i < 3; i++) { do { b est = ; B = choose and remove an arbitrary host from M ; for (j = 0; j < 30; j ++) { BH = measured RTT b etween B and H; rtt = single round trip(); X = find the branching p oint of A, B, and H by equation (1); if (rtt < b est) b est = rtt; L: /* lab el referenced in the proof */ if (b est was not up dated in the last ten iterations) if (X does not coincide with an existing node (switch)) { break; add X to T as a new node; } } if (b est > max) max = b est; if (in the first iteration) { if (b est < min) min = b est; remove from M all hosts under } the child tree of X containing A; return b oth min and max as the result of the measurement; } } remove from M all hosts in the child tree of X containing B; A = either A or B; /* arbitrarily (see Section 4) */ Figure 6: Measurement between a host pair } while (M is not empty); connect X and H; } infer(H ) { /* H is a list of hosts H0 , H1 , . . . */ T = the tree containing only H0 and H1 ; for Hi (i = 2, 3, . . .) { add(T , Hi ); } return T ; } Figure 5: Basic topology inference algorithm Figure 7: Distribution of RTTs in cluster 4 3.3 An Upper Bound of the Number of Measurements Let N b e the numb er of hosts and d b e the maximum numb er of hops b etween all pairs of hosts, and p the maximum numb er of p orts of a single switch. We are going to show that the loop in the procedure add iterates at most p(d - 1) times (), thus the total numb er of measurements the procedure infer uses is (p(d - 1) + 1)(N - 2) + 1. If we regard p as a constant not dep ending on N , it is O(N d). In the four clusters case, the gradient is larger when the numb er of hosts is smaller. This is an effect of keeping the numb er of inter-cluster measurements small. To show (), supp ose we are adding host H to a tree b eing constructed. Let T b e the tree after adding H. Let Ai , Bi , and Xi denote the value of A, B, and X, resp ectively, used by the i-th iteration (more precisely, at program p oint L of the i-th iteration). The following simple observation is the key. Let us regard T as a tree rooted at Xi-1 . The i-th iteration has the following two cases. Case 1: Bi and H are in the same child tree of Xi-1 , in which case the branching p oint of A, Bi , and H is in the same child too, so Xi is strictly closer to H than Xi-1 . Case 2: Bi and H are in different child trees of Xi-1 , in which case the branching p oint of A, Bi , and H is Xi-1 , so Xi = Xi-1 . Now, since the i-th iteration removes all hosts from M in the child tree of Xi containing Bi , the (i + 1)-th iteration never chooses Bi+1 from the same child tree. Thus, from the ab ove observation, at least one out of p consecutive iterations falls in the case 1 of the ab ove. Since case 1 can happ en at most (d - 1) times, we have the total numb er of iterations is at most p(d - 1). 4. MEASUREMENTS AND INFERENCE IN PRACTICE Under the idealizing assumption made in the previous section, b oth the Hi 's order in procedure infer and choice of A and B in procedure add were arbitrary. In practice, however, unavoidable fluctuations and errors in measurements may produce an incorrect result. In particular, fluctuations make it non-trivial for procedure add to decide if a branching p oint X coincides with an existing branch. This section details imp ortant practical issues including this. 15 infer local() { /* the local procedure for top ology inference */ wait for a tree T (= Ti-10 ) from Hi-10 ; A = find close host(T ); wait for a tree T (= Ti-1 ) from Hi-1 ; add(T , Hi ), with choosing A as the first host; send T to Hi+1 and Hi+10 ; } find close host(T ) { M = all hosts in T ; while (M is not empty) { choose and remove an arbitrary host J from M ; x = RTT b etween H and J and record it; for all K M { y = distance b etween K and J on T ; if (y > 3x or y < x/3) remove K from M ; } } return the host that had the minimum RTT in the loop; } Cluster Cluster Cluster Cluster 1 2 3 4 C PU Dual Xeon 2.4GHz Pentium M 1.8GHz Pentium M 1.8GHz Dual Xeon 2.4/2.8GHz NIC driver e1000 sk98lin sk98lin bcm5700 Figure 9: Environment of experiments Figure 8: The node local procedure including a measurement strategy Coordinating measurements and choosing host pairs to measure. Since our algorithm needs to sequentially add hosts one after another, and adding host H requires measurements from H, the tree b eing constructed must b e circulated among hosts. We line up hosts in a linear list H0 , H1 , . . . Process Hi waits for a tree involving H0 , . . . , Hi-1 to arrive at Hi , add itself, and passes the up dated tree to Hi+1 . We call the tree after adding Hi Ti . To minimize the effect of errors and fluctuations in measurements, it is imp ortant to p erform measurements b etween "appropriate" host pairs. Sp ecifically, as suggested in Figure 7, we need b e able to distinguish 90 microseconds RTTs from those of 100 microseconds. Thus, in procedure add, it is vital to choose A and B so that they are close to H . This has another effect of reducing the runtime critical path of the entire inference op eration. Adding a host to a tree will quickly finish if it can select close hosts in the first place. We implement two strategies to this end. · In the end of an iteration in procedure add (the statement "A = either A or B"), we choose whichever was closer to H. · When H selects a host as the target of a measurement (in the b eginning of an iteration in add), it does its b est to choose the host closest to H. To obtain some knowledges to do so, concurrently with the main procedure of adding hosts one after another to the tree, another tree is circulated, slightly ahead of the real one. Sp ecifically, after the tree is built up to Ti , it is passed to Hi+1 and, at the same time, to Hi+10 as well. This way, Hi+10 obtains an approximate and likely view of the tree it is going to get soon, and commences measurements to search for hosts close to itself. It basically measures RTTs to some randomly selected hosts in the received tree, but with the following criterion to reduce the numb er of measurements to far hosts; if an RTT from host H to a host J is measured and turns out to b e x, host K will b e removed from the candidate Measurements between a single host pair. Measurements b etween a single pair of hosts are carried out by rep eatedly exchanging ping-p ong messages via TCP b etween the two hosts. One host sends a ping message, to which the other immediately replies. Each RTT is measured by the sender. The sender keeps track of the b est record obtained so far, and it quits a single round of measurements when the b est record is not up dated in ten consecutive round trips or it p erforms the fixed maximum numb er of exchanges (currently thirty). It rep eats three such rounds and returns the minimum and the maximum. The pseudo code is shown in Figure 6. When procedure add determines the branching p oint adjacent to host H, it actually determines an interval of p ossible branching p oints, by using b oth of the minimum and the maximum rep orted by the measurement. If the interval contains an existing branch, we consider one of them as the branching p oint to H. The accuracy of our method does not critically dep end on the particular set of parameters we have chosen ab ove. Any sufficiently accurate measurement that can distinguish RTTs of different hops will suffice. Figure 7 shows distributions of RTTs b etween pairs of hosts 2, 4, and 6 hops away within a cluster (b eing 2 hops away means they are directly connected to a common switch), and those b etween pairs of hosts in two separate clusters. The two clusters are located in different floors of the same building, and three Layer 3 routers are b etween them. Adding two hops adds approximately 10 microseconds to end-to-end RTTs in this cluster (The switch is DELL PowerConnect 5224). This distribution gives us confidence that our sampling method will b e enough for a host H to distinguish an RTT to one host, from an RTT to another host, if the measurement is done from an appropriate (i.e., close) host. 16 list of measurements if the path b etween J and K (on tree Ti ) is > 3x or < x/3. This is b ecause the distance b etween H and K in the result tree is > 2x in the former case, and > 2/3x in the latter. Intuitively, H excludes hosts close to a far host, and those far from a close host, b oth of which are known to b e far from H. This strategy is esp ecially effective in multi-cluster environments. It will limit the numb er of inter-cluster measurements to one p er cluster pair, in the likely cases where intra-cluster RTTs are much smaller than those of inter-cluster RTTs. Putting the ab ove together, the pseudo code for the local procedure running on host Hi is shown in Figure 8 (infer local). 5. EVALUATION Exp eriments were conducted on up to 256 hosts across four clusters (Figure 9). The Layer 2 physical top ology inside each cluster and the Layer 3 (logical) top ology conveying shared links are shown, along with typical latencies b etween clusters and within each cluster. Within a cluster, latencies vary dep ending on the numb er of hops b etween hosts. All hosts run Linux op erating system and all hosts use a single Gigabit Ethernet network interface. Cluster 3 and 4 are in the same building. Cluster 1 is approximately 25 kilometers away and cluster 2 approximately 31 kilometers away from that building. Note that latencies b etween clusters significantly differ dep ending on pairs. Figure 11: Accuracy of RTTs derived from inferred topologies 5.1 Accuracy of Inferred Topology Just to give some intuitive ideas ab out accuracy, Figure 10 is a typical snapshot of the inferred top ology. Nodes of the same color b elong to a single cluster and the result seems successful in identifying them. It also succeeded in identifying group of nodes within a cluster connecting to a single switch and partially succeeded in inferring a different Layer 2 top ology of different clusters. On the b ottom left is the cluster 2, which has a star-like top ology (a center hub and four switches connecting to it). The upp er right cluster is cluster 3, which has a linear list-like top ology. To quantitatively measure the accuracy, we evaluated the inferred top ologies using two criterion. One is how accurately does the inferred tree predict the end-to-end latencies of host pairs. The other is how accurately does it give information ab out bandwidth sharing, an imp ortant information not readily available without top ology information. For the former, we predict the end-to-end latency b etween two hosts simply by the path length b etween the two on the inferred tree. It is divided by the latency directly measured b etween all host pairs. Figure 11 shows the histogram of the ratios. All graphs show clear p eaks around 1.0, and they are esp ecially sharp in cluster 2 and 3. We observe through our exp eriments that RTTs b etween hosts in these two clusters are more stable than the other two. This is the reason for the b etter results in cluster 2 and 3. For the latter, we used the generated top ology to answer queries: "do the traffic b etween A and B and that b etween C and D share a link?" Our system replies to these queries simply by looking at the inferred tree. We consider two traffics share a link only when their paths share at least one edge, or equivalently, two nodes. We compared the reply derived from the inferred tree to the real answer we know from the real physical top ology. For an inferred tree, 100K queries are randomly generated by uniformly choosing four hosts, and false rates recorded. Exp eriments are conducted on each of the four clusters as well as all the four clusters. For each resource configuration, we p erformed such an exp eriment one hundred times. Graphs of Figure 12 plot distributions of false rates in those one hundred exp eriments. The x-axis indicates false rates and the y -axis the numb er of times (out of one hundred) the false rate falls into a particular range. It separately rep orts two kinds of false rates for each resource configuration. A false positive means the system answered that two traffics share a link even though they actually do not. This typically happ ens when the system wrongly splits a physically single switch into two (or more) separate switches. In this case, traffics across these switches are predicted to share links, even though they actually do not. This tends to make false p ositive rates relatively high even if the numb er of wrongly separated switches are actually small. For example, if a p-p orts switch is split into two (p/2)-p orts switches, and all these p-p orts are connected to hosts, it immediately adds (p/2)2 false p ositive cases (or 0.25 false p ositive rates if they are all the hosts). Such wrongly separated switches account for most part of the false p ositives. A false negative means the opp osite and typically happ ens when switches close to each other are wrongly unified into a single switch. In the exp eriments this is most often caused by wrongly unifying the two switches connecting the clusters on the left (cluster 1 and 2) and the two clusters on the right (cluster 3 and 4) in Figure 9. These two switches are only 5µs apart, and more over, no hosts are sufficiently close to the left one. This makes it difficult for any host to distinguish these two switches. 5.2 Cost of Inference Figure 13 shows the inference time with varying numb er of clusters and hosts involved. For multi-cluster exp eriments, we take nodes from clusters uniformly. In the largest configuration of 256 processes, it took approximately 15 seconds. The results are encouraging in terms of scalability; it is clearly the effect of reducing the numb er of measurements, 17 Figure 10: An inferred topology of 4 clusters and localizing as many measurements as p ossible. The inference time is dominated by RTT measurements b etween far hosts, and the critical path that circulates the tree under construction (Section 4). For comparison, the curve lab eled "all-to-all" refers to the case where measurements are done b etween all host pairs. Figure 14 shows, for each pair of clusters, how many fractions of host pairs are actually measured b etween the two. For example, the leftmost bar indicates that, out of the total p ossible host pairs within cluster 1 (64 × 63/2 in this case), ab out 38% are actually measured. In contrast, it can clearly b e seen that measurements across two clusters were much less. widths is much more intrusive to the environment and concurrent measurements using a shared link are likely to interfere with each other, building a bandwidth map given no information ab out the top ology would have to conservatively p erform most measurements in serial. We schedule necessary measurements in stages. A single stage only contains measurements that are predicted not to conflict by the inferred tree. We choose a switch as the root and classify nodes of the tree by their depths (the numb er of hops from the root, with the root b eing considered in depth zero). In stage i (i = 1, 2, . . . , h where h is the height of the tree), we try to measure links connecting nodes in depth (h - i) and their children. To this end, for each node in depth (h - i), we elect a single host under each of its children. Let's say a node X in depth (h - i) has p children C1 , . . . , Cp . We choose host Hi , the host under Ci having the maximum bandwidth to Ci (according to the bandwidths estimated so far). We make p/2 pairs among Hi 's and let each pair p erform an end-to-end bandwidth measurement. These measurements are done in parallel. If p is odd, we pair the last one with any one of the others, and measure the pair in isolation. The observed bandwidth b etween Hi and Hj is recorded as the bandwidth of the two links XCi and XCj . 6. USING INFERRED TOPOLOGIES 6.1 Building Bandwidth Maps A bandwidth map is a tree whose edges are annotated with their bandwidths and can b e used to predict end-toend bandwidths. Here we are concerned ab out predicting bandwidths of a single traffic not interfered by other simultaneous traffic. We use our top ology to derive and schedule necessary endto-end bandwidth measurements. Since measuring band- 18 cluster 1 cluster 2 cluster 3 Figure 13: Inference time cluster 4 four clusters Figure 14: Percentages of pairs measured between and within clusters and reasonably accurate in many practical scenarios. More intensive measurements (e.g., measuring the bandwidths of simultaneous traffic to know the bandwidth of the uplinks) are of course p ossible to get rid of these assumptions, but will b e more intrusive and slower. Figure 16 shows the time to build a bandwidth map. Overall accuracy is measured by queries on bandwidth b etween a given pair of hosts. The system simply replies the minimum bandwidth on the path, and we record the ratio the reply from the system / the real answer obtained by an actual isolated measurement using ip erf. We separately confirm this numb er matches with our own simple transfer program built on TCP socket. Figure 17 shows the distribution of the ratio. Figure 12: Distributions of false rates to queries on shared links In the last step, the bandwidth observed b etween Hi and Hj is actually the minimum among the bandwidth of Hi Ci , that of XCi , that of XCj , and that of Hj Cj . The ab ove algorithm assigns it as the bandwidth of XCi and XCj . This is certainly not always correct and is based on the following two assumptions we make for the efficiency of the inference. (1) Ports of a single switch are homogeneous (i.e., have an equal bandwidth), and (2) XCi (and, by the consequence of assumption (1), XCj too) is one of the b ottleneck links b etween Hi and Hj . If (2) actually does not hold, it still gives valid bandwidth for a single, not simultaneous, endto-end traffic b ecause it is anyways limited by a b ottleneck b elow Ci or Cj . Thus, the assumption (2) is not significant as we are only concerned with estimating the bandwidth of a single traffic. The assumption (1) is usually true, with an imp ortant exception b eing the uplink of a switch. In this case, however, the uplink is usually more p owerful than other p orts, and the algorithm simply underestimates it as if it equals to other p orts. This is not a problem either, for our purp ose of estimating the bandwidth of a single traffic. Thus, the overall method is still attractive as it is lightweight 6.2 Long Message Broadcast Next, we focus on long messages, which need to b e optimized for bandwidth. When the real top ology is a tree, optimal bandwidth can b e obtained by pip elining the message so that each link is used at most once in each direction. We create such a pip eline by p erforming a depth-first traversal of our inferred top ology. Of course, the top ology is not always a tree, and when it is not, the existence of redundant paths need to b e taken into consideration, as the algorithm describ ed in [8] does. Yet practically our tree-based algorithm is quite useful, b ecause the top ology of local area networks, Ethernet in particular, are typically trees. Moreover, even in a wide area network, it is common for the routing paths among clusters to form a tree when only a few clusters are involved. 19 Figure 15: Schedule for bandwidth measurement Figure 16: Time to build bandwidth map We measured the broadcast bandwidth across the nodes of Cluster 4. Each node was connected to the network by a single Gigabit Ethernet link, and the switches were connected to each other by four Gigabit Ethernet links using link aggregation. Figure 18 shows the results for three different algorithms: the optimal pip eline (i.e., a depth-first traversal of the real top ology), a pip eline based on our inferred top ology and a pip eline in which nodes were visited in a random order. The optimal pip eline achieved a bandwidth of 624Mbps, and our pip eline was able to achieve 88% of that. Our pip eline actually contained a few links that were used multiple times, but it did not have a large effect on the bandwidth b ecause of link aggregation. The random pip eline, however, had so many links that were used multiple times that it was only able to achieve 23% of the bandwidth of the optimal pip eline. Figure 17: Accuracy of bandwidth maps Figure 18: Bandwidth of broadcast in cluster 4 comments for improving our pap er. This work is partially supp orted by "MEXT Grands-in-Aid for Scientific Research on Priority Areas" and "MEXT Grant-in-Aid for Sp ecially Promoted Research." 7. CONCLUSION AND FUTURE WORK We have shown an efficient top ology inference algorithm and its applications to a variety of op erations. The algorithm shows an encouraging scalability prop erty in terms of inference time and good locality prop erty in terms of numb er of local/remote measurements; it inferred a top ology of a cluster of 64 nodes in 4 seconds and of four clusters of 256 nodes in 15 seconds. The result was accurate; it successfully identifies clusters and a group of directly connected nodes within a cluster. Our near future plan includes further acceleration of the inference, by p erforming top ology construction of far away groups of nodes in parallel. We will also investigate a more thorough approach to bandwidth mapping addressing issues p ointed out in Section 6.1. 9. REFERENCES [1] O. Aumage and G. Mercier. MPICH/MADI I I: a Cluster of Clusters Enabled MPI Implementation. 3rd International Symposium on Cluster Computing and the Grid, pages 26­33, 2003. [2] R. Black, A. Donnelly, and C. Fournet. Ethernet Top ology Discovery without Network Assistance. In ICNP, pages 328­339, 2004. [3] Y. Breitbart, M. Garofalakis, B. Jai, C. Martin, R. Rastogi, and A. Silb erschatz. Top ology discovery in heterogeneous IP networks: the NetInventory system. IEEE/ACM Trans. Netw., 12(3):401­414, 2004. 8. ACKNOWLEDGMENTS We thank our anonymous reviewers in the HPDC committee, and esp ecially our shepherd Thilo Kielmann for their 20 [4] J. W. Byers, A. Bstavros, and K. A. Harfoush. Inference and Lab eling of Metric-Induced Network Top ologies. IEEE Trans. Paral lel Distrib. Syst., 16(11):1053­1065, 2005. [5] M. Coates, R. Castro, R. Nowak, M. Gadhiok, R. King, and Y. Tsang. Maximum likelihood network top ology identification from edge-based unicast measurements. SIGMETRICS Perform. Eval. Rev., 30(1):11­20, 2002. [6] M. Coates, M. Rabbat, and R. Nowak. Merging logical top ologies using end-to-end measurements. In IMC '03: Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, pages 192­203, New York, NY, USA, 2003. ACM Press. [7] M. den Burger, T. Kielmann, and H. E. Bal. "TOPOMON": A Monitoring Tool for Grid Network Top ology. In ICCS '02: Proceedings of the International Conference on Computational Science-Part II, pages 558­567, London, UK, 2002. Springer-Verlag. [8] M. den Burger, T. Kielmann, and H. E. Bal. Balanced Multicasting: High-throughput Communication for Grid Applications. In Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, 2005. [9] B. Donnet, P. Raoult, T. Friedman, and M. Crovella. Efficient algorithms for large-scale top ology discovery. SIGMETRICS Perform. Eval. Rev., 33(1):327­338, 2005. [10] N. Duffield, J. Horowitz, F. L. Presti, and D. Towsley. Multicast top ology inference from measured end-to-end loss. IEEE Transactions in Information Theory, 48:26­45, 2002. [11] E. Gabriel, M. Resch, T. Beisel, and R. Keller. Distributed Computing in a Heterogeneous Computing Environment. In Proceedings of the 5th European PVM/MPI Users' Group Meeting on Recent Advances in Paral lel Virtual Machine and Message Passing Interface, pages 180­187, London, UK, 1998. Springer-Verlag. [12] T. Imamura, Y. Tsujita, H. Koide, and H. Takemiya. An Architecture of Stampi: MPI Library on a Cluster of Parallel Computers. [13] [14] [15] [16] [17] [18] [19] In Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Paral lel Virtual Machine and Message Passing Interface, pages 200­207, London, UK, 2000. Springer-Verlag. N. T. Karonis, B. Toonen, and I. Foster. MPICH-G2: a Grid-enabled implementation of the Message Passing Interface. J. Paral lel Distrib. Comput., 63(5):551­563, 2003. T. Kielmann, R. F. H. Hofman, H. E. Bal, A. Plaat, and R. A. F. Bhoedjang. MagPIe: MPI's collective communication op erations for clustered wide area systems. In PPoPP '99: Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of paral lel programming, pages 131­140, New York, NY, USA, 1999. ACM Press. B. Lowekamp, D. O'Hallaron, and T. Gross. Top ology discovery for large ethernet networks. In SIGCOMM '01: Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications, pages 237­248, New York, NY, USA, 2001. ACM Press. B. B. Lowekamp, B. Tierney, L. Cottrell, R. Hughes-Jones, T. Kielmann, and M. Swany. Enabling Network Measurement Portability Through a Hierarchy of Characteristics. In GRID '03: Proceedings of the Fourth International Workshop on Grid Computing, page 68, Washington, DC, USA, 2003. IEEE Computer Society. G. Shao, F. Berman, and R. Wolski. Using Effective Network Views to Promote Distributed Application Performance. In Proceedings of the 1999 International Conference on Paral lel and Distributed Processing Techniques and Applications, pages 2649­2656, 1999. Skitter. http://www.caida.org/tools/measurement/skitter. R. Wolski, N. T. Spring, , and J. Hayes. The Network Weather Service: A Distributed Resource Performance Forecasting Service for Metacomputing. Journal of Future Generation Computing Systems, 15(5-6), pages 757­768, Oct 1999. 21