An Efficient MPI Allgather for Grids Rakhi Gupta Computer Science/Information Technology Depar tment Jaypee Institute of Information Technology University Noida-201307 India Sathish Vadhiyar Supercomputer Education and Research Centre Indian Institute of Science Bangalore-560012 India vss@serc.iisc.ernet.in rakhi.hemani@jiit.ac.in ABSTRACT Allgather is an imp ortant MPI collective communication. Most of the algorithms for allgather have b een designed for homogeneous and tightly coupled systems. The existing algorithms for allgather on Grid systems do not efficiently utilize the bandwidths available on slow wide-area links of the grid. In this pap er, we present an algorithm for allgather on grids that efficiently utilizes wide-area bandwidths and is also wide-area optimal. Our algorithm is also adaptive to grid load dynamics since it considers transient network characteristics for dividing the nodes into clusters. Our exp eriments on a real-grid setup consisting of 3 sites show that our algorithm gives an average p erformance improvement of 52% over existing strategies. Categories and Subject Descriptors D.1.3 [Programming Techniques]: Concurrent Programming--Distributed programming, Paral lel programming General Terms Algorithms, Design, Performance Keywords MPI Allgather, Grids 1. INTRODUCTION Collective communication op erations are widely used in MPI applications and play an imp ortant role in their p erformance. Hence, various pro jects have focused on optimization of collective communications for various kinds of parallel computing environments including homogeneous and This work is supp orted by Indian Institute of Science's 10th Plan Grant SERC Part(2A) Sp ecial Grant (45/SERC) Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HPDC'07, June 25­29, 2007, Monterey, California, USA. Copyright 2007 ACM 978-1-59593-673-8/07/0006 ...$5.00. heterogeneous LAN networks [3, 4, 6, 8, 22] and most recently Grid systems [9, 11, 12, 14, 19, 20]. Allgather is an imp ortant many-to-many collective communication op eration. Allgather on N processes can b e considered as equivalent to N broadcasts of data, each conducted with a distinct root. An algorithm for implementing allgather needs to sp ecify a schedule of the required data transfers. An efficient allgather schedule is dep endent on various factors including message size and network characteristics. Most of the algorithms for allgather are designed for homogeneous networks [3, 4, 6, 8, 22]. These algorithms follow uniform communication patterns b etween all nodes and hence cannot b e used in Grid settings where the network links are highly heterogeneous and the link characteristics change over time. Current p opular techniques for allgather on grids [12,14,18] follow static network hierarchical schemes, where the nodes are divided into clusters on the basis of network top ology and a representative/coordinator node is chosen from each cluster. The inter cluster communications are done through these representative nodes. At any given p oint of time, only one data transfer takes place b etween two representative nodes. These techniques are not efficient since the clusters are separated by wide-area links that can sustain multiple simultaneous data transfers with the same end-to-end bandwidths [5]. Hence the existing strategies for grids do not exploit the total available bandwidths of the wide-area links. In this pap er, we present our cluster based and incremental greedy algorithm called Min3 -Allgather, for allgather on grids. Our algorithm divides the nodes into clusters based on transient network characteristics, namely available bandwidths, and follows a recursive approach where allgather is p erformed at different levels of the hierarchy. Our algorithm allows multiple simultaneous communications b etween 2 clusters separated by slow wide-area links and hence effectively utilizes the available bandwidths of the wide-area links. Our algorithm is also wide-area optimal [12,14] since it ensures that a data segment is transferred only once b etween two clusters separated by a wide-area link. We compared the time taken by allgather schedules determined by this algorithm with current p opular implementations. We also compared our algorithm with a strategy where allgather is constructed from a set of broadcast trees. Our exp eriments on a real-grid setup show that the average p erformance improvement of our algorithm is 52% over existing strategies. In Section 2, we present related efforts in the development 169 of allgather algorithms. In Section 3, we describ e the design principles used in the development of our algorithm. Section 4 explains the communication models used in our algorithm. Section 5 describ es our algorithm, Min3 -Allgather, for grids. In Section 6, we compare our algorithm with existing strategies on a real-grid setup and present results. Section 7 gives conclusions and Section 8 presents future work. 2. RELATED WORK A numb er of generic and theoretically efficient allgather algorithms have b een develop ed for homogeneous clusters [8]. Simple algorithm p osts all the sends and receives and waits for their completion. In this algorithm, a process sends messages to other processes in the order of their ranks. In order to avoid the p otential node and network contention caused by the simple algorithm, spreading simple algorithm was prop osed. In this algorithm, in each iteration i, a process p sends its data to process (p+i) mod N and receives data from process (p-i+N) mod N where N is the numb er of processes. The ring/bucket/circular algorithm [6] was develop ed for architectures where near-neighb or communications can b e b eneficial. At each iteration i, a process sends data corresp onding to index (p-i+1+N) mod N to its right neighb or process, i.e. process (p+1) mod N, and receives data from its left neighb or process, i.e. process (p-1) mod N. The time taken for allgather in simple, spreading simple and ring algorithms are (N-1)*(L + m/B), where L is the latency, m is the message size and B is the bandwidth. Recursive doubling algorithm takes lesser time as compared to previous algorithms b ecause the numb er of transfers (hence latency) is reduced. The numb er of iterations required when N is a p ower of 2 is log N. In each iteration i, processes separated by a distance of 2i-1 exchange data. Recursive doubling algorithm is sub-optimal when the numb er of processes is not a p ower of two, b ecause some data transfers may b e rep eated. The dissemination algorithm develop ed by Benson et. al. [3] is similar to the single p ort algorithm describ ed in work by Bruck et. al. [4]. The algorithm has (log N) iterations. In each iteration i, process p sends data to the process (p + 2i-1 ) mod N. The amount of data sent in all iterations, (except for the last) is 2i-1 *m. For the last iteration i, (p - 2i-1 )*m data is sent. The work by Thakur et. al. [22] gives a detailed analysis of the ab ove algorithms for allgather. It is shown that different algorithms are optimal for different message sizes. MPICH [17], the p opular implementation of MPI, uses recursive doubling for small message sizes when numb er of processes is a p ower of 2. However, if the numb er of processes is not a p ower of 2, dissemination algorithm is used. MPICH uses ring algorithm for large message sizes. Current p opular top ology-aware allgather scheduling strategies for grids divide the network into network hierarchies. The nodes are divided into clusters and a coordinator node is assigned to each cluster. MagPIe [12] prop oses a three phase algorithm for allgather - gather data at coordinators, allgather among coordinators and broadcast of data by coordinators. In the second phase, the coordinators p erform allgather using spreading simple algorithm. At this stage all coordinators have all the required data. In the third phase, coordinators broadcast data using binomial broadcast to processes in the cluster. MPICH-G2 [18] implements a similar algorithm. The algorithm has 2 phases. In the first phase, data is gathered up the hierarchy using recursive dou- bling. Next, data is broadcast downwards using binomial broadcast. The ma jor drawback of these approaches is that data transmission is sequentialized at coordinators. This results in low usage of available bandwidths at higher layers of hierarchy (e.g. WAN links). Also, in these approaches, the network hierarchy for a given grid setup is formed on the basis of information ab out WAN and LAN links. Hence the network hierarchy is static and the same hierarchy will b e used for all the allgather op erations. Our algorithm allows multiple simultaneous data transfers b etween 2 clusters resulting in increased use of available bandwidths on wide-area networks and hence improved p erformance of allgather on grids. Also, our algorithm forms the network hierarchy on the basis of transient network characteristics, namely, available bandwidths. Thus our algorithm is adaptive to grid load dynamics since the network hierarchy can change with the changes in transient network characteristics. 3. DESIGN PRINCIPLES Following are the design principles used in the construction of our allgather algorithm. 1. Multi-level collective communication algorithms are more suitable for grids than single-level algorithms. Collective communication algorithms for homogeneous systems [8,17] and some recent heuristics for distributed systems [16] follow a single-level strategy for build communication schedules. The recent efforts for grids [9, 11,12,14] divide the given set of nodes into clusters/p ools, form hierarchies b etween the clusters and follow different strategies for different levels of hierarchies. The intra-cluster links (LAN links, high p erformance networks) are relatively faster than inter-cluster (WAN, internet, campus networks) links. Such clustering of nodes helps in designing algorithms that carefully avoid transmitting the same data multiple times on a slow link connecting two clusters. This is essential for ensuring wide area optimality of the collective communication algorithms [14]. While some strategies divide the nodes or network based on static network top ologies [11, 12, 14], we divide the nodes based on transient network characteristics similar to our earlier approach for broadcasts [9]. 2. WAN links can sustain many simultaneous transfers without performance degradation In the current p opular algorithms for allgather on grids links [12, 18], a single coordinator node is chosen in every cluster and only the coordinators participate in inter-cluster data transfers across WAN links. However, it is difficult to utilize total bandwidth available on a WAN link by a single data transfer. This is b ecause of TCP b ehavior, where a host sends some packets and then waits for acknowledgment b efore sending next packets. High Round Trip Times (RTT) b etween nodes separated by WAN links cause late arrival of acknowledgment packets resulting in delays in the transmission of packets, and hence lesser bandwidth utilization. It has b een found that many simultaneous transfers on these links can help in effective utilization of underlying bandwidth [5]. 170 Data Flows From Remote Sites Data Flow from Remote Site 0 0 is easy to observe that the second method carefully avoids sequentialization of communication op erations within the nodes. Thus the communication schedules for allgather have to b e incrementally built by taking into account the already scheduled data transfers to avoid node-level b ottlenecks. Data Flow from Remote Site 2 1 4 3 1 2 2 1 2 1 2 4. COMMUNICATION MODELING To estimate the time taken for an allgather op eration and for scheduling the next data transfer, the individual data transfers need to b e modeled. For modeling data transfers, we use 4 network parameters corresp onding to a message size m - latency (L), bandwidth (b), overhead Send (os(m)) and gap (g(m)). The gap parameter is defined as the minimum time interval b etween consecutive message transmissions or receptions [7]. We use parametrized-LogP b enchmark [13] to measure os(m) and g(m) and our own communication b enchmark program to measure L and b. The next asp ect of modeling relates to host-sp ecific parameters. In an allgather op eration, a single host may b e processing multiple messages from different source nodes. Hence it is essential to determine the time at which a host will b e available for the next send and/or receive (recv). There are two p opular models to determine the host available times: single-port half-duplex [10] and single-port ful lduplex [2]. In the single-p ort half duplex model, a host can either p erform 1 send or 1 recv in a single time step. In the single-p ort full-duplex model, the host can simultaneously p erform a send and a receive in a time step. Equations 1 - 3 show the calculations for times when a host, s, will b e available for the next send and receive, after sending a message,m, to host,r. Equation 2 show the calculations assuming a half-duplex model and Equation 3 show the calculations assuming a full-duplex model. The following terms are used in the equations: · TotalTime[m] represents the time at which, r, receives the message. · TransferTime[m,s,r] represents the time duration, for transfer of message of size m form a host, s, to a host, r. · StartTime[m,s,r] represents the time corresp onding to the start of transfer. · CommLinkAvailTime[s,r] represents the time at which communication link from host s, to host r, is available for transfer of next message. Note that this time is indep endent of the time at which communication link, from host r, to host s, is available for message transfer. · HostAvailTime[h] represents the time at which host, h, would b e available for next send or recv. This parameter is used only in the half-duplex model. · HostAvailTimeForSend[s,h] represents the time at which the host s, is ready to send a message to the host h. This parameter is used only in the full duplex model. · HostAvailTimeForRecv[h] represents the time at which the host h, is ready to recv a message. This parameter is used only in the full duplex model. 3 2 2 2 1 3 3 A Cluster Figure 1: Broadcast Trees within a Cluster Moreover, the concept of choosing coordinator nodes for data transfers in WAN links is b eneficial for broadcast communications where only a single message has to propagate to all nodes. In this case, choosing multiple nodes in a cluster to send messages to nodes in another cluster will lead to p oor p erformance of broadcast op erations. However, in allgather op eration, different nodes in one cluster have distinct messages to send to nodes in another cluster. Making the nodes send these distinct messages to the coordinator nodes will lead to severe sequentialization of messages and act as b ottlenecks in communications. In our algorithm, we allow multiple simultaneous intercluster transfers b etween any two clusters separated by a WAN link. 3. Scheduling of current data transfers should be based on previously scheduled data transfers. In an allgather op eration, a single node may b e involved in multiple data transfers corresp onding to different messages from different sources. The data transfers in allgather must b e scheduled such that there is enough parallelization in the work p erformed by different nodes in different time steps. This helps in avoiding node-level b ottlenecks that can b e caused when processing multiple messages received by a node in a single time step. Care must also b e taken such that different nodes p erform equivalent amount of work in different time steps. For example, Figure 1 depicts a situation where a cluster has to broadcast two distinct messages obtained from remote sites during an allgather op eration. Two p ossible methods of broadcasts are shown. The first method of broadcasts uses the same tree for b oth broadcasts, whereas the second uses two different trees. Assuming a homogeneous, fully connected network, unit transfer time for data, single p ort full duplex connections (i.e. a host can do at-most 1 send and receive simultaneously) and a condition whereby a host can send data only if the previous send is complete, the first method completes the broadcasts in `4' time units and the second method completes broadcasts in `2' time units. Note that the numb ers along the data transfers (arrows) indicate the time for completion of data transfer. It 171 Pool1 T otalT ime[m] = S tar tT ime[m, s, r ]+ T r ansf er T ime[m, s, r ] T r ansf er T ime[m, s, r ] = Latency (s, r )+ messag eS iz e(m) B andwidth(s, r ) (3 Mbps) (1) Pool2 (10 Mbps) Pool 3 LAN C (100 Mbps) Single-port Half-duplex Model S tar tT ime[m, s, r ] = max( C ommLinkAv ailT ime[s, r ], H ostAv ailT ime[s], H ostAv ailT ime[r ] C ommLinkAv ailT ime[s, r ] = S tar tT ime[m, s, r ]+ g (m , s , r ) H ostAv ailT ime[s] = S tar tT ime[m, s, r ]+ os(m, s, r ) H ostAv ailT ime[r ] = T otalT ime[m] (2) Single-port Full-duplex Model S tar tT ime[m, s, r ] = max( C ommLinkAv ailT ime[s, r ], H ostAv ailT imeF or Recv [r ], H ostAv ailT imeF or S end[s]) C ommLinkAv ailT ime[s, r ] = S tar tT ime[m, s, r ]+ g (m , s , r ) H ostAv ailT imeF or S end[s] = S tar tT ime[m, s, r ]+ os(m, s, r ) H ostAv ailT imeF or Recv [r ] = T otalT ime[m] (3) Pool 4 LAN A (100 Mbps) Pool 5 LAN B (100 Mbps) Figure 2: Example Pool Tree recursively into sub-p ools. This can b e represented in the form of a pool tree. Figure 2 shows a p ool tree for a network consisting of 3 LANs A, B and C. Pool1, also referred to as root pool, consists of all the hosts in the network and has the lowest threshold bandwidth - 3 Mbps. This is split into two p ools, Pool2 and Pool3. Pool2 has threshold bandwidth of 10 Mbps and, Pool3 has threshold bandwidth of 100 Mbps and corresp onds to LAN C. Pool2 is split into Pool4 and Pool5. These corresp ond to LANs A and B resp ectively. To calculate the bandwidth threshold values, we first obtained bandwidth values b etween all pairs of machines. Then, we sorted these values and created sets, such that bandwidth values in a set are in a range of 10%. For each such set, we found the maximum bandwidth value, and used the maximum values as thresholds for the formation of p ool tree. This procedure of formation of p ools or clusters is similar to the formation of "logical clusters" in earlier efforts by Estefanel and Mounie [1] and Lowekamp et. al. [15]. These efforts divide the network into subnets based on link latencies and throughput. Our current work is complementary to these efforts since our allgather algorithm can also use the subnets formed by these efforts. Given such a p ool tree, the algorithm starts at the root node of the p ool tree, root-pool. Each host i b elonging to the root-p ool has data di for broadcast to other nodes. As a host can b e included in only 1 sub-p ool of a p ool (in this case, sub-p ools of the root-pool ), only 1 sub-p ool has di . We schedule data transfers such that every sub-p ool of the input p ool (root-pool ) will contain data di . To ensure "wide area optimality" [14], data is transmitted exactly once to a subp ool. This implies that only 1 host in each sub-p ool of the input p ool will contain di . However allgather op eration is complete only when all hosts have the data di . To ensure this, we recursively schedule data transfers considering each sub-p ool as an input p ool. The recursion stops when subp ools of the input p ool are individual hosts. Next we describ e details of this algorithm. Section 5.2 describ es the scheduling algorithm within a p ool, and Section 5.3 describ es the complete recursive algorithm. 5. MIN3 -ALLGATHER ALGORITHM We prop ose a heuristic algorithm, Min3 -Allgather for generating efficient schedules for allgather on grids. The algorithm clusters the hosts participating in allgather according to the bandwidth characteristics of the links b etween the hosts, and identifies different levels of network hierarchies. Data transfers are then scheduled at each level of the hierarchy using a recursive, top-to-b ottom approach. The following subsections give details of our algorithm. 5.1 Identification of Network Hierarchy or Formation of Pool Tree As describ ed in Section 3, we cluster the network and identify network hierarchies. A cluster (or p ool) is a set of hosts, such that the average of bandwidths of links from a host to all other hosts is greater than or equal to some threshold bandwidth. Thus, smaller the threshold bandwidth of a p ool, greater the numb er of hosts in the p ool. The p ools at high levels of network hierarchy (e.g. WANs), have a low threshold bandwidth, and can b e split into p ools with a higher threshold bandwidth. For example, a WAN can b e split into constituent LANs. Thus a p ool may b e split 5.2 Finding Allgather Schedule for Sub-Pools of an Input Pool Given an input p ool, InputPool and its subp ools, where each host hosti is contained in one of the subp ools of the input p ool, and has data di , the first step is to determine a schedule of inter-subp ool data transfers such that data di is transferred to the other subp ools. At the end of this step, 172 one host in every subp ool will have data di . We denote the set of di where i set of nodes participating in allgather as DataSet. We denote the host in the input p ool containing data di as InitSource[di ]. At the end of this first step, each subp ool will have the DataSet. We introduce the following terms: · Sources[di ] represents the set of hosts that contain di . Initially, this set contains only one host, InitSource[di ]. · DestPools[di ] represents the set of sub-p ools of the input p ool, such that no host b elonging to these subp ools, contains di . Initially this includes all sub-p ools of the input p ool excluding the sub-p ool containing the source host. · MinTime[di ,j] is the minimum time by which di can b e transmitted to a host b elonging to a sub-p ool j b elonging to DestPools[di ]. · MinDest[di ,j] represents the host b elonging to subp ool j that can receive di in MinTime[di ,j]. · MinSrc[di ,j] is the id of a host in Sources[di ], that can send di to MinDest[di ,j] in MinTime[di ,j]. Function InitSched, shown in Figure 3, shows the initial calculation of these terms. InitSource[di ] is added to Sources[di ] in line 3, DestPools are created in line 6. MinTime, MinSrc and MinDest corresp onding to transfer of di to a SubPool is calculated by function FindMin. 1 2 3 4 5 6 7 8 9 10 11 12 Algorithm: FindMin() input : di , Sources, SubPool, MsgSize output: MinTime, MinSrc, MinDest MinTime = ; for s Sources[di ] do for host SubPool do /* find TotalTime corresponding to transfer of MsgSize data from s to host if TotalTime < MinTime then MinTime = TotalTime ; MinSrc = s ; MinDest = host ; end end end return (MinTime, MinSrc, MinDest) ; Figure 4: FindMin() */ 1 1 Algorithm: InitSched() input : MsgSize, InputPool, DataSet, InitSource output: Sources,DestPools, MinTime, MinDest, MinSrc for di DataSet do add InitSource[di ] to Sources[di ] ; for SubPool InputPool do if InitSource[di ] SubPool then / add SubPool to DestPools[di ] ; /* let j be the index of SubPool in */ DestPools[di ] (MinTime[di ,j], MinSrc[di ,j], MinDest[di ,j] ) = FindMin ( di , Sources, SubPool, MsgSize) ; end end end return (Sources,DestPools,MinTime, MinSrc, MinDest) ; Figure 3: InitSched() 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 2 3 4 5 6 7 8 9 10 11 Function FindMin, shown in Figure 4, calculates TotalTime for transfer of MsgSize data, according to equations 1-3, for each sender in Sources[di ] and each receiver in SubPool. It then finds and returns the minimum TotalTime, MinTime, corresp onding to a MinSrc in Sources[di ] and MinDest in SubPool. After the ab ove initial calculations, we schedule data transfers using a greedy approach as shown in ScheduleInputPool() function in Figure 5. We identify data, SchedData and SchedPool such that MinTime[SchedData, SchedPool] is 23 Algorithm:ScheduleInputPool() input : MsgSize, InputPool, InitSource, DataSet, mo de output: SchedFiles, Sources (Sources,DestPools, MinTime, MinSrc, MinDest) = InitSched ( MsgSize, InputPool, InitSource, DataSet) ; while TRUE do SchedTime = ; ComeOut = TRUE ; for di DataSet && DestPools[di ] = do ComeOut = FALSE ; MinDestPoolTime = ; for p DestPools[di ] do if MinTime[di ,p] < MinDestPoolTime then MinDestPoolTime = MinTime[di ,p] ; MinDestPool = p ; end end if MinDestPoolTime < SchedTime then SchedTime = MinDestPoolTime ; SchedDest = MinDest[di ,MinDestPool] ; SchedSrc = MinSrc[i,MinDestPool] ; SchedPool = MinDestPool ; SchedData = di ; end end if ComeOut == TRUE then break ; ScheduleTransfer (SchedSrc, SchedDest, SchedData, SchedFiles ) ; add SchedDest to Sources[di ] ; remove SchedPool from DestPools[di ] ; (MinTime, MinSrc, MinDest) = UpdateAfterSched (SchedData, SchedSrc, SchedDest , DestPools, Sources, MsgSize, DataSet, Mode) ; end Figure 5: ScheduleInputPool() 173 the minimum of all MinTime values returned by InitSched() function. This involves applying minimization at 3 stages1 . In the first stage, we apply minimization in the FindMin() function to calculate the minimum time required for transfer of data di to a given subp ool, SubPool. This is denoted by MinTime[di , SubPool]. In the second stage, we find the destination subp ool for di , MinDestPool[di ] such that MinTime[di , MinDestPool[di ]] is minimum over all destination subp ools (lines 7-11), i.e. M inS ubP oolallS ubP ools(M inT ime[di , S ubP ool]). In the third and final stage, we find the data, SchedData , such that MinTime[SchedData,MinDestPool[SchedData]] is minimum over all all values of MinTime[di , MinDestPool[di ]], i.e. M indi D ataS et(M inS ubP ool (M inT ime[di , S ubP ool])) (lines 4-17). We denote MinDestPool[SchedData] as SchedPool. We denote the source and destination hosts corresp onding to MinTime[SchedData,SchedPool] as SchedSrc and SchedDest, resp ectively. We then schedule the data transfer of SchedData b etween SchedSrc and SchedDest (line 19). After the data transfer, SchedPool is removed from DestPools[SchedData], and SchedDest is added to Sources [SchedData] (lines 20, 21). We also up date the host model parameters for SchedSrc and SchedDest using equations 2 or 3, for half-duplex or full-duplex model, resp ectively (line 22). The function for up dating the model parameters is shown in Figure 6. These up dates lead to the invalidation of MinTime[di , p ool] if any of the following conditions are true. · di is equal to SchedData (lines 2-6), SchedDest is the new source of SchedData, and can b e equal to MinSrc[SchedData, p] for some p ool, p. · MinSrc[di ,p ool] is equal to SchedSrc or MinDest[di , p ool] is equal to SchedDest (line 10). As parameters for SchedSrc and SchedDest are up dated, MinTime, MinSrc, MinDest for di and p ool need to b e up dated. · MinSrc[di ,p ool] is equal to SchedDest or MinDest[di , p ool] is equal to SchedSrc and the model used is Single Port Half-Duplex (line 10). In this model, sends and receives are sequentialized at a host, hence, the times at which SchedSrc and SchedDest can receive and send messages resp ectively are also up dated. Thus MinTime, MinSrc, MinDest for and p ool need to b e up dated. We check for the ab ove conditions for all di DataSet and corresp onding p ools DestPools(di ). If a condition is true, corresp onding values for MinTime, MinSrc and MinDest are recomputed by calling FindMin function. Scheduling of data transfers and up dates are rep eated till DestPools[di ] is for all di . 1 Algorithm: Up dateAfterSched() input : SchedData, SchedSrc, SchedDest, DestPools, Sources, MsgSize, DataSet, Mode output: MinTime, MinSrc, MinDest /* Update host parameters for SchedSrc and SchedDest according to mode */ if DestPools[SchedData] = then for p DestPools[SchedData ] do (MinTime[SchedData,p], MinSrc[SchedData,p], MinDest[SchedData,p]) = FindMin (SchedData, Sources, p, MsgSize ) ; end end for di DataSet do if di == SchedData then continue ; for p DestPools[di ] do if MinSrc[di ,p] == SchedSrc MinDest[di ,p] == SchedDest ( mode == HalfDuplex && ( MinSrc[di ,p] == SchedDest MinDest[di ,p] == SchedSrc ) ) then (MinTime[di ,p], MinSrc[di ,p], MinDest[di ,p] ) = FindMin ( di , Sources, p, MsgSize ) ; end end end return(MinTime, MinSrc, MinDest) ; Figure 6: UpdateAfterSched() 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 Algorithm:Min3 -Allgather input : MsgSize, Pool, DataSet, InitSource, mode output: SchedFiles (SchedFiles, Sources) = ScheduleInputPool( MsgSize, Pool, InitSource, DataSet, mode) ; if NoSubPools(Pool) = NoHosts(Pool) then for SubPool Pool do for di DataSet do for s Sources[di ] do if s SubPool then SubPoolSources[di ]=s ; break ; end end end Min3 -Allgather(MsgSize, SubPool, DataSet, SubPoolSources, mode) ; end end Figure 7: Min3 -Allgather 2 3 4 5 6 7 8 9 10 11 12 13 14 15 5.3 Finding Allgather Schedule for Pool Tree For calculating the optimal allgather schedule for the entire p ool tree, we calculate the optimal schedule for the subp ools of the root p ool in the p ool tree using ScheduleInputPool algorithm describ ed ab ove. Next for each sub-p ool, we identify the source host of each di within the sub-p ool. Now we call Min3 -Allgather recursively treating each sub-p ool as an input p ool. The recursion stops when the sub-p ools of the input p ool are individual hosts. The recursive algorithm is shown in Figure 7. 1 Hence the name Min3 -AllGather for the algorithm. 174 6. EXPERIMENTS AND RESULTS In this Section, we compare the p erformance of our Min3 Allgather strategy with various existing strategies. These existing strategies include generic allgather algorithms for homogeneous networks including MPICH algorithm [17] and Spreading Simple [8], network top ology-aware methods for grids by MagPIe [14] and MPICH-G2 [18] and strategies that obtain allgather schedule for N nodes by combining N broadcast trees corresp onding to N distinct root nodes. For obtaining allgather schedule from individual broadcasts, we use broadcast trees generated by Mateescu [16] and ClusteredSA [9] algorithms. We first describ e our methodology of obtaining allgather schedule for N nodes given N broadcast trees with N distinct roots. We then describ e a real Grid setup involving 3-sites that we used for our allgather exp eriments. We then compare our Min3 -Allgather algorithm with the other strategies. Table 1: 3-Site Grid Setup Numb er Sp ecifications of machines Torc cluster, 8 GNU/Linux 2.6.8, Dual University PI I I 933 MHz, 512 MB of Tennessee RAM, 40GB Hard Drive, (UT), USA 100 Mbps Ethernet Queen's 4 GNU/Linux 2.4.20, University, AMD Athlon 1532 MHz, Belfast, UK 1 GB RAM, 30GB Hard Drive, Gigabit Ethernet DAS-2, Vrije 8 GNU/Linux 2.4.21, Dual Universiteit, PI I I 996 MHz, 1 GB Netherlands RAM, 20 GB Hard Drive, 100 Mbps Fast Ethernet. Location 6.1 Allgather from Broadcast Trees In this strategy, broadcast trees ti corresp onding to broadcast of data di by each process pi , participating in allgather are generated. To implement allgather, each process sends and receives data on the basis of the generated broadcast trees. For constructing an allgather schedule of N nodes from N broadcast trees, a process pi can first p ost N-1 nonblocking receives. pi can then send its data di to a set of processes that are direct descendants of pi in broadcast tree ti . We denote this set of direct descendants as DirectDescendants( ti , pi ). pi , on receiving data dk corresp onding to a p osted non-blocking receive, can send dk to processes in DirectDescendants( tk , pi ). However, this implementation results in p oor p erformance, on ch p4 device for MPICH. This is b ecause the underlying b ehavior of MPICH gives b etter p erformance if sends and receives are p osted in the order of their actual occurrence. This is corrob orated in the work by Benson et. al. [3]. To achieve b etter p erformance for allgather constructed from broadcast trees, we use a min based algorithm for determining the order of sends at each process. The algorithm produces as output a set of files containing the order of communications of different data segments for each process. During allgather, the processes read from these files and p erform the communications in the sp ecified order. For calculating the transfer times used in the algorithm, we utilized either Single Port Half Duplex or Single Port Full Duplex model for communications. Table 2: Inter-Site Bandwidths(Mbps) and Latencies(Seconds) UT UK NTH UT 85.19, 1.44, 1.25, 0.00006 0.05 0.05 UK 1.28, 289.33, 4.75, 0.05 0.000006 0.076 NTH 1.16, 4.75, 81.86, 0.05 0.076 0.000006 Setup 3. LAN links within a Cluster: For these exp eriments, we utilized machines from Netherlands site. 6.3 Comparison of Allgather Strategies The total overhead in the generation of Min3 -Allgather schedules includes sorting the link bandwidths (sort), finding bandwidth thresholds (thres), formation of p ools or clusters (pools), and determination of communication schedules (sched) using the Min3 -Allgather algorithm shown in Figure 7. The bandwidths on the links are determined using offline measurements and efficient mechanisms exist for the measurement and retrieval of the bandwidths [21]. Hence the bandwidth determination is not included in our Min3 Allgather total overhead. The costs for the various overhead comp onents in our algorithm for the 3 exp eriment setups are shown in Table 3. The times rep orted for our Min3 -Allgather in this section were obtained by adding the corresp onding total overhead costs shown in Table 3 and the time taken for p erforming the allgather using the generated schedules. Figure 8 shows the comparison of allgather schedule gen- 6.2 3-Site Grid In order to evaluate the efficiency of various strategies for allgather on grids, we utilized a grid consisting of 3 sites: 1. University of Tennessee (UT), Tennessee, USA, 2. Queen's University, Belfast, UK and 3. Vrije Universiteit, Netherlands. Details of machine sp ecifications in each site is provided in Table 1. The bandwidths and latencies of the links b etween these 3 sites were measured offline and are shown in Table 2. This grid allowed us to test the p erformance of algorithms under 3 different setups. Setup 1. WAN Links across Continents: For these exp eriments, we utilized machines from all sites. Setup 2. WAN Links within a Continent: For these exp eriments, we utilized machines from b oth the Europ ean Sites, i.e. UK and Netherlands. Table 3: Overhead Costs (usecs.) of Min3 -Allgather Setup sort thres pools sched Total Overhead Setup 1 326.86 15.71 392.29 282270 283010 Setup 2 50.7 9.3 85.8 166780 166920 Setup 3 36.75 9 19.25 108930 108990 175 Total Times for Allgather with Different Algorithms on the 3-Site Grid using Single Port Half Duplex Model 100 Number of Simultaneous WAN Communications Allgather using Mateescu Broadcasts Allgather using clusteredSA Broadcasts MagPIe MPICH-G2 Min3-Allgather Number of Simultaneous WAN Communications for Various Algorithms (256 KB) 30 Allgather by Mateescu Broadcasts 25 Min3-Allgather MagPIe Time Taken (Seconds) 10 20 15 1 10 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB Msg Size (Bytes) 64 KB 128 KB 256 KB 512 KB 5 (a) Single Port Half Duplex Model 0 0 5 10 15 20 25 30 35 Time Progression (seconds) 40 45 50 55 Total Times for Allgather with Different Algorithms on the 3-Site Grid using Single Port Full Duplex Model Allgather using Mateescu Broadcasts Allgather using clusteredSA Broadcasts MagPIe MPICH-G2 Min -Allgather 3 (a) Simultaneous WAN communications for 256 KB 100 Number of Simultaneous WAN Communications for Various Algorithms (512 KB) 30 Allgather by Mateescu Broadcasts Number of Simultaneous WAN Communications Min -Allgather MagPIe 3 Time Taken (Seconds) 10 25 20 15 1 10 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB Msg Size (Bytes) 64 KB 128 KB 256 KB 512 KB 5 0 (b) Single Port Full Duplex Model Figure 8: Grid Allgather Results for Complete 3-Site 0 10 20 30 40 50 60 Time Progression (seconds) 70 80 90 100 (b) Simultaneous WAN communications for 512 KB Figure 9: Number of Simultaneous WAN Communications in Different Algorithms sion of allgather executions. For example, Min3 -Allgather has the lowest time to completion than the other algorithms as already seen in Figure 8. We measured the numb er of simultaneous WAN communications by observing the start and end times of the individual sends and receives relative to the start of the allgather. The overlap in the ranges of these start and end times gives an estimate of the numb er of simultaneous communications3 . As shown in Figure 9, b oth Min3 -Allgather and Mateescu's strategy p erform more numb er of simultaneous WAN communications than MagPIe's. Although allgather using Mateescu broadcasts p erform more numb er of simultaneous WAN 3 Since MPICH-G2's allgather is implemented by MPI Sendrecvs, we were not able to time the individual sends and receives and hence were not able to obtain the numb er of simultaneous WAN communications. However the b ehavior of MPICH-G2's allgather is similar to MagPIe's allgather and hence our general analysis of WAN communications in MagPIe's allgather will b e applicable for MPICH-G2's allgather. eration strategies on the basis of average2 actual run times for allgather for the complete grid setup on 3 sites. Except for small message sizes, the p erformance of Min3 -Allgather strategy is b etter than all other strategies. The average p erformance improvement of Min3 -Allgather algorithm over all the other algorithms is 42% for Half Duplex model and 52% for Full Duplex Model. We also find that the MagPIe and MPICH-G2 strategies give higher allgather times than the allgather based on our earlier develop ed clusteredSA broadcasts. In order to understand the p erformance difference b etween the different algorithms for the 3-site grid setup as shown in Figure 8, we measured the numb er of simultaneous communications on wide-area links during the execution of allgather with a particular algorithm and message size. Figure 9 shows the numb er of simultaneous WAN communications during the executions of allgather with 3 different algorithms, namely, Min3 -Allgather, Mateescu's and MagPIe's, and for 2 message sizes, namely, 256 KB and 512 KB. The results corresp ond to using Full Duplex model for communications. The x-axis represents the time progres2 We take average of 4 run times 176 Total Times for Allgather with Different Algorithms on the European Sites using Single Port Full Duplex Model 20 10 MagPIe MPICH-G2 Spreading Simple MPICH Min -Allgather 3 Total Times for Allgather with Different Algorithms on the Netherlands Cluster using Single Port Full Duplex Model Spreading Simple MPICH Min -Allgather 3 1 Time Taken (Seconds) 1 Time Taken (Seconds) 2 KB 4 KB 8 KB 16 KB 32 KB Message Size (Bytes) 64 KB 128 KB 256 KB 512 KB 0.1 0.01 1 KB 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB Message Size (Bytes) 64 KB 128 KB 256 KB 512 KB Figure 10: Allgather Results for the European Sites using Single Port Full Duplex Model Figure 11: Allgather Results for the Netherlands Cluster using Single Port Full Duplex Model systems where the network consists of significant numb er of WAN links. communications than our Min3 -Allgather, the Mateescu's strategy is not wide-area optimal since it can send a data segment multiple times over a same WAN link. Hence the Mateescu's strategy has large execution times as shown in the large y-axis values in Figure 8 and large x-axis values in Figure 9. Our algorithm is effectively able to exploit the available bandwidths on WAN links and at the same time ensures wide-area optimality. We also find that WAN communications in the MagPIe's allgather strategy start later than in Min3 -Allgather and Mateescu's strategy. This is b ecause in the initial stages of the MagPIe's allgather, localarea communications are p erformed to collect data at the coordinator nodes of local clusters. Only in the later stages, data is transferred b etween the coordinator nodes resulting in WAN communications. Thus, large p ercentages of executions of MagPIe and MPICH-G2 do not utilize the wide-area networks resulting in large execution times. To evaluate the p erformance of our strategy on sub-parts of the 3-site Grid, we conducted exp eriments on a partial grid, consisting of the Europ ean machines (UK and Netherlands sites) and on a homogeneous cluster (Netherlands cluster). In b oth these exp eriments we included p opular homogeneous network allgather algorithms, namely, MPICH and Spreading Simple. Figure 10 shows the comparison of different strategies for the Europ ean site. We can observe that our algorithm gives b etter p erformance than the other algorithms only for message sizes greater than 64 KB. Moreover, when compared to the results for 3 sites, the results for the 2 Europ ean sites show lesser p ercentage p erformance improvement for Min3 -Allgather over other algorithms. This is b ecause, in the 2-site setup, Min3 -Allgather finds lesser opp ortunities for the formation of clusters or p ools based on bandwidths and for multiple simultaneous transfers on WAN links. Figure 11 shows the comparison of allgather strategies for a homogeneous network. Note that for this setting, we have not included MagPIe and MPICH-G2 strategies, as they are defined only for grids. For this setting, the algorithms develop ed for homogeneous networks p erform much b etter. This may b e attributed to the fact that though Min3 -Allgather strategy is able to utilize more bandwidth on WAN links, it does not generate b est p ossible schedules for LAN links. Hence our Min3 -Allgather strategy is applicable to only Grid 7. CONCLUSIONS We have develop ed an algorithm for efficient allgather, a p opular many to many collective communication op eration, on grids. Our Min3 -Allgather algorithm is b oth network top ology aware and network load adaptive. The algorithm follows the design principles of clustering of nodes based on transient network characteristics, parallel communications on wide-area links, and incremental construction of communication schedules. Our algorithm achieves b etter p erformance on grids, as it tries to exploit more available bandwidths on WAN links as compared to other p opular approaches like MPICH-G2 [18] and MagPIe [12] and is also wide-area optimal. Exp eriments indicate that we achieve an average p erformance improvement of 52% over existing strategies. 8. FUTURE WORK We plan to build a service-oriented architecture that constructs application-oriented allgather communication schedules similar to our previous work for broadcasts [9]. We also plan to cache p opular allgather communication schedules. As network loading patterns on a Grid may b e rep etitive, schedules from the cache could b e reused, thus saving re-computation of communication schedules. We also plan to generalize our strategy for allgather to alltoall collective communication that involves transmission of different data to different processes. This presents unique challenges in scheduling decisions as routing of data through intermediate nodes may not b e b eneficial. 9. REFERENCES [1] L. B.-Estefanel and G. Mounie. Identifying Logical Homogeneous Clusters for Efficient Wide-Area Communication. In In Proceeginds of the Euro PVM/MPI 2004, volume LNCS Vol. 3241, pages 319­326, 2004. [2] O. Beaumont, V. Boudet, and Y. Rob ert. A Realistic Model and an Efficient Heuristic for Scheduling with 177 [3] [4] [5] [6] [7] [8] [9] [10] [11] Heterogenous Processors. In Proceedings of 11th Heterogeneous Computing Workshop, 2002. G. Benson, C.-W. Chu, Q. Huang, and S. Caglar. A Comparison of MPICH Al lgather Algorithms on Switched Networks, volume 2840/2003 of Lecture Notes in Computer Science, pages 335­343. Springer Berlin / Heidelb erg, Septemb er 2003. Recent Advances in Parallel Virtual Machine and Message Passing Interface, 10th Europ ean PVM/MPI Users' Group Meeting. J. Bruck, C.-T. Ho, S. Kipnis, E. Upfal, and D. Weathersby. Efficient Algorithms for All-to-All Communications in Multip ortmessage-Passing Systems. IEEE Transactions on Paral lel and Distributed Systems, 8(11):1143­1156, Novemb er 1997. H. Casanova. Network Modeling Issues for Grid Application Scheduling. International Journal of Foundations of Computer Science (IJFCS), 16(2):145­162, 2005. E. Chan, R. van de Geijn, W. Gropp, and R. Thakur. Collective Communication on Architectures that Supp ort Simultaneous Communication over Multiple Links. In PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of paral lel programming, pages 2­11, 2006. D. Culler, R. Karp, D. Patterson, A. Sahay, K. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: Towards a Realistic Model of Parallel Computation. In Proceedings of the fourth ACM SIGPLAN symposium on principles and practice of paral lel programming languages (PPoPP), pages 1­12, 1993. A. Fara j and X. Yuan. Automatic Generation and Tuning of MPI Collective Communication Routines. In ICS '05: Proceedings of the 19th annual international conference on Supercomputing, pages 393­402, 2005. R. Gupta and S. Vadhiyar. Application-Oriented Adaptive MPI Bcast for Grids. In Proceedings of International Paral lel and Distributed Processing Symposium (IPDPS'06), Rhodes Island, Greece, 2006. L. Hollermann, T.-S. Hsu, D. Lop ez, and K. Vertanen. Scheduling Problems in a Practial Allocation Model. Journal of Combinatorial Optimization, 1(2):129­149, 1997. N. Karonis, B. de Supinski, I. Foster, and W. Gropp. Exploiting Hierarchy in Parallel Computer Networks to Optimize Collective Op eration Performance. [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] In IPDPS '00: Proceedings of the 14th International Symposium on Paral lel and Distributed Processing, pages 377­386, 2000. T. Kielmann, H. Bal, S. Gorlatch, K. Verstoep, and R. Hofman. Network Performance-aware Collective Communication for Clustered Wide-area Systems. Paral lel Computing, 27(11):1431­1456, 2001. T. Kielmann, H. Bal, and K. Verstoep. Fast Measurement of LogP Parameters for Message Passing Platforms. In IPDPS Workshops, volume 1800 of Lecture Notes in Computer Science, pages 1176­1183, Cancun,Mexico, 2000. T. Kielmann, R. Hofman, H. Bal, A. Plaat, and R. Bhoedjang. MagPIe: MPI's Collective Communication Op erations for Clustered Wide Area Systems. In PPoPP '99: Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of paral lel programming, pages 131­140, 1999. B. Lowekamp. Discovery and Application of Network Information. PhD Thesis CMU-CS-00-147, Carnegie Mellon University, 2000. G. Mateescu. A Method for MPI Broadcast in Computational Grids. In IPDPS '05: Proceedings of the 19th IEEE International Paral lel and Distributed Processing Symposium (IPDPS'05) - Workshop 13, page 251, Colorado, USA, 2005. Mpich2 home page. http://www- unix.mcs.anl.gov/mpi/mpich2. MPICH-G2. http://www3.niu.edu/mpi. K. Park, H. Lee, Y. Lee, O. Kwon, S. Park, and S. K. H.W. Park. An Efficient Collective Communication Method for Grid Scale Networks. In Proceedings of the International Conference on Computational Science, pages 819­828, Melb ourne, Australia and St. Petersburg, Russia, June 2003. H. Saito, K. Taura, and T. Chikayama. Collective Op erations for Wide-Area Message Passing Systems using Adaptive Spanning Trees. In Proceedings of The 6th IEEE/ACM International Workshop on Grid Computing, 2005. M. Swany and R. Wolski. Building Performance Top ologies for Computational Grids. International Journal of High Performance Computing Applications, 18(2):255­265, 2004. R. Thakur, R. Rab enseifner, and W. Gropp. Optimization of Collective Communication Op erations in MPICH. International Journal of High Performance Computing Applications, 19(1):49­66, Spring 2005. 178