Partial Content Distribution on High Performance Networks Eric Weigle and Andrew A. Chien Computer Science and Engineering and Center for Networked Systems University of California, San Diego {eweigle,achien}@ucsd.edu ABSTRACT We present and analyze techniques to efficiently solve the partial content distribution problem­ distributing a logical data set to receivers which individually desire only subsets of the total data. This is a more general and fundamentally different problem than traditional whole-file content distribution; providing new challenges and new optimization opp ortunities. It supp orts a wider variety of use models, e.g., strip ed file transfer, scatter/gather, or distributed editing. This work develops new metadata management and transfer scheduling techniques providing good results on high p erformance networks. Distributed applications in such systems tend to have data requirements more complicated than just total overlap at every node: transfers desired differ dramatically from whole-file content distribution. Traditional approaches p erform p oorly in such cases. We provide empirical data exhibiting these limitations, evaluate a new BitTorrent-based implementation of our ideas, and show order of magnitude improvements in bandwidth and latency. 1. INTRODUCTION The content distribution problem is traditionally to "get identical copies of some data to every machine in the system" and there is a large b ody of work focusing on such whole-file replication. However, in many situations each machine only requires small parts of a given data set­ downloading the entire file is undesirable. The partial content distribution problem generalizes traditional content distribution. It captures new situations: where nodes desire disjoint p ortions of a logical data set, e.g. a very large transfer strip ed across nodes for sp eed; where there is limited overlap, e.g. distributed collab orative editing of large files; all the way to where there is whole-file sharing, e.g. off-site replication. In general, the question is: how to efficiently satisfy arbitrary constraints on which nodes have/want which data, so as to globally maximize system p erformance? The traditional problem is a sp ecial case at one end of this design space. The more general problem provides new challenges, such as arbitrary data constraints, as well as new optimization opp ortunities, such as constraint structure that can b e exploited. Traditional approaches can not provide the desired solution semantics: when whole-file transfer is applied in low sharing environments, nodes are forced to download useless extra data. The lower the prop ortion of data actually desired, the worse the overhead. The distinguishing problem feature is limited overlap b etween receivers' demands and hence limited opp ortunities to exploit p eer-to-p eer sharing. To achieve good p erformance it is critical to efficiently (1) discover and exploit any such overlap and (2) share bandwidth from overloaded source nodes. Systems built for high-sharing environments do not optimize for these features. We address the problem by adding (1) metadata management features traditionally provided by external applications and (2) enhanced algorithms for scheduling communication. Together these capture the new problem semantics, enhance system expressiveness, and increase global system p erformance. We build off our prior work with the Comp osite Endp oint Protocol (CEP) [23], which focused solely on low sharing environments. This pap er extends those techniques to work more effectively with arbitrary constraints and completely re-implements the protocol. To evaluate the approach, we test a `stock' version of the p opular BitTorrent content distribution protocol and a hybrid BitTorrent/CEP implementation of these ideas. With some BitTorrent clients now supp orting selective file download (inducing partial sharing) this work is particularly relevant. Our results show that this approach is general, for arbitrary levels of sharing, disjoint to whole-file transfers, Categories and Subject Descriptors C.4 [Computer-Communication Networks]: Miscellaneous General Terms Algorithms, Design, Exp erimentation, Measurement, Performance Keywords Content Distribution, High Performance Networks, CEP Supp orted in part by the National Science Foundation under awards NSF Coop erative Agreement ANI-0225642 (OptIPuter), NSF CCR-0331645 (VGrADS), NSF ACI-0305390, and NSF Research Infrastructure Grant EIA-0303622. Supp ort from the UCSD Center for Networked Systems, BigBangwidth, and Fujitsu is also gratefully acknowledged. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HPDC'07, June 25­29, 2007, Monterey, California, USA. Copyright 2007 ACM 978-1-59593-673-8/07/0006 ...$5.00. 137 efficient achieving low-latency and high capacity, and scalable to thousands of nodes. It provides good results on high p erformance networks. The remainder of the pap er is structured as follows: first we explain the problem in more detail. Next, Section 3 explains our approach and its implementation. Section 4 outlines our evaluation process and the tests we have run, while Section 5 gives the results and analysis of our exp eriments. We briefly discuss related work in Section 6 b efore concluding in Section 7. 2. PROBLEM Figure 2: Coordination, Sharing, and Performance that nodes ni and nj have, and || takes the size of the set. Effectively this is a measure of set similarity. When all nodes want the same blocks, this is 1. When all nodes want different blocks, it is 0. When half want one set of blocks and half another, it is 1/2. Our measure of coordination is the global ratio of metadata to data transferred, regardless of whether the system is centralized or decentralized. Many distributed systems, such as DHTs, require significant amounts of communication to maintain their structure; usually this cost is hidden. In particular, CEP and the highly p opular erasure/network coding techniques are at opp osite ends of this sp ectrum: the difference is in the environmental features each approach can exploit. For details on the issues involved see App endix B. In summary, our explicit problem is: given one or more senders, each of which has an arbitrary set of blocks, provide each receiver the arbitrary set of blocks it desires. A good solution provides high global bandwidth, low requestresp onse latency, and high inter-node fairness. Discussion of other traits, such as robustness under failure or p erformance on arbitrary networks is b eyond the scop e of this pap er. Consider a motivating example: a simple strip ed transfer from several servers to several clients; each client/server pair transfer disjoint parts of a file. This is useful to get around TCP problems in the wide area, to distribute data for analysis on a cluster of nodes, or to logically collect otherwise disparate transfers for management purp oses. Figure 1: A Simple Striped Transfer This is the simplest case. Consider if clients wanted arbitrary sections of the file rather than just disjoint subsets, e.g. for tiled simulations which require some overlapping data at the edges. Clients must now locate the server providing their data. Clients desiring the same data, i.e. sharing on the receiver side, provide a new optimization path: p eer-to-p eer transfers. Next, consider servers which provide some arbitrary p ortion of the file; due to caching, load balancing, etc. Clients must determine from which server(s) to download different subsets of their data. Replica selection and load balancing provide other optimization paths. Finally, consider that the client/server distinction is arbitrary. Any node may b oth provide and desire data. Balancing client and server tasks is another optimization path. This is the general form of the partial content distribution problem. The structure in the constraints can and must b e exploited to optimize for b etter p erformance; these challenges and opp ortunities are unavailable to systems providing only whole-file replication. For comparison, other current research (see Section 6) has focused on different issues, such as decentralization, scalability, or robustness. Figure 2 (left) gives a graphical view of some current approaches. For lower sharing (right) the problem space is sparsely explored. To quantify these axes, one can think of sharing as follows: for each node, calculate the average prop ortion of blocks shared with its p eers; then average that across all nodes to get the sharing measure. Explicitly: S har ing = av gi (av gj ((|{bi } {bj }|)/|{bi }|)) For each node ni and for each reachable p eer nj , the overlapping prop ortion of blocks nj has or wants that ni wants, aggregated over all nodes. {bi } and {bj } are the set of blocks 3. APPROACH Our approach targets two central features. First, creating measurement and metadata infrastructure to capture state for the current (network latency, capacity, block location) and desired (block location) configuration. Second, creating transfer scheduling algorithms which determine the b est way to get to the desired state. Efficient detection and representation of metadata is necessary to enable transfers with complicated data requirements. Without it, we can not capture application semantics or p erform transfer scheduling. Whole-file approaches used in low-sharing configurations p erform p oorly for exactly this reason. For example, in BitTorrent, p eer groups b ecome partitioned and rare blocks do not "trickle through" other nodes. Maintaining a separate torrent for each subset of desired data is infeasible. Our integrated system supp orting partial sharing makes this unnecessary. Efficient transfer schedules and algorithms are the core of this work. Without them, we can not provide good transfer p erformance. This is an optimization problem which can b e attacked via linear programming, greedy, implicit block scheduling, or other techniques. We sp end the rest of this section discussing the options. Implementing these algorithms dep ends on the metadata ab ove, and is simplified by use of a central scheduling node with some global knowledge. Our preferred algorithm (the greedy one) also has a corresp onding distributed version. 138 Every algorithm assumes p eer metadata is correct. This system is designed for a single user or application to move data in a coop erative manner maximizing global p erformance. Without incentives for good b ehavior, which are outside the scop e of this work, a malicious node could exploit the system to achieve unfair p erformance. 3.1 Transfer Scheduling in CEP A transfer schedule determines when nodes communicate and what data they send. Schedulers fall into two broad categories: explicit schedulers plan ahead and build a schedule which nodes follow, while implicit schedulers use heuristics or forwarding semantics on-the-fly. CEP b egan as an explicit scheduler but evolved into a hybrid mechanism. Most distributed systems use implicit schedulers, which makes sense for highly dynamic, heterogeneous, "black b ox" networks. Long-term plans in such environments are inappropriate. High p erformance networks tend to b e less heterogeneous, less dynamic (some include reservation features) and network structure is often known. We can improve p erformance on such networks via explicit planning­ even with only limited knowledge. Our scheduler works as follows: 1. A scheduling node (i.e. the BitTorrent tracker, which already keeps track of p eers) determines current and desired state as sp ecified by clients. That node runs a scheduling algorithm on the data to compute a schedule. We discuss this further following sections. 2. Each node retrieves its part of the schedule: a list of p eers, blocks, and rates. It tries to implement the plan but may diverge by optimistically attempting to download data faster or from other nodes, so long as it does not renege on the upload constraints provided by the scheduler. 3. Periodically everything is rescheduled to limit divergence and b etter exploit new system state. In general, algorithms take input of data constraint graphs like Figure 1 or 15 (App endix B). If known, network constraints are are also included. Output is the weight for each edge of the graph­ indicating the rate a node should upload or download blocks, and over which links. Maximizing the sum of egress edges (weights of edges to clients) minimizes the total termination time. This differs from "network flow" problems [4, 6], which maximize instantaneous p erformance. The Linear Programming (LP) algorithm is used as our baseline, as it is known to generate optimal results. Table 1 gives the mapping b etween graph and equations which can b e input to an LP solver. These equations produce a schedule in which all nodes terminate at the same time, and no node could terminate earlier without making another node terminate later. This gives a global minimum aggregate termination time. Unfortunately, we will see in Section 5.5 that the LP algorithm is very exp ensive. The original algorithm as sp ecified in [23] was designed for low-sharing environments and p erformed p oorly when sharing was high. To address this, we have extended the algorithm with a "High Sharing" optimization similar to the two-phase technique used in Bullet [14] or the Sup er Seed mode in some BitTorrent clients. Our algorithm extends these to a more general case: it supp orts arbitrary sharing and data constraints. Figure 3 lists the new algorithm. # Determine proportional demand For each block b: demand[b] = r (down rate[r ][b]) total demand = b demand[b] capacity = r up rate[r ] For each receiver r : allocation = (up rate[r ]/capacity)* total demand # Supply proportional to demand For each block b: If demand[b] allocation and demand[b]>0 and down rate[r ][b]>0: seed[r ][b] = True allocation -= demand[b] demand[b]=0 Figure 3: CEP High-Sharing Optimization At a high level, it works by breaking the transfer into an initial `seeding' phase where blocks are distributed across the network, and a `feeding' phase where remaining nodes download those blocks directly. Both phases apply the original CEP algorithm to schedule the transfer, but in the seeding phase constraints are sp ecially constructed to prepare for an efficient feeding phase. The goal is to distribute blocks such that demand (load) on a given node is prop ortional to its supply (capacity). The input in Figure 3 is down rate[r ][b]: the sp eed at which receiver r wants to download block b, estimated as its total capacity divided by numb er of blocks desired, and up rate[r ]: the total upload capacity of receiver r . The output is seed[receiver][block] which is True if receiver should get block in the seeding phase. This result is passed as the output constraints for the seeding phase and the input constraints for the feeding phase. Thinking of this as acting on constraint graphs, it effectively divides it such that nodes get fair "slices." This distribution works well provided network b ottlenecks are known­ otherwise, the capacity estimates may place inappropriate load on p oorly-connected nodes. When used in a steady-state­ rescheduling as nodes come and go­ one feature of this scheduling technique is that the priority of flows is inversely prop ortional to their age. New flows have the most remaining to transfer, b ecome global b ottlenecks, and are immediately allocated high capacity. In this way they can, e.g. build up a buffer of frames for video playback quickly. Flows near completion do not determine global p erformance and are low priority; the optimistic mechanisms ab ove let them complete without starvation. This also provides moderate incentives for nodes with large amounts of data to stay in the system. 3.2 Greedy Algorithm Our primary "Greedy" algorithm is much faster than the LP approach. It attempts to globally balance supply and demand by appropriately dividing estimated capacity among senders and receivers. At a high level, this optimizes for the last nodes to finish­ the current global b ottlenecks. Other nodes can utilize spare capacity optimistically, as describ ed ab ove. 139 Constraint Type Data demand constraints: All receivers get their data. Capacity constraints: Nodes have limited sp eed. Supply and Demand constraints: Supply matches demand. Fair flow constraints: Nob ody finishes early. Conversion To Equations ri > 0 for each link to a block or block range. P 0 P si 1 where all si links from one server 0 ri 1 where all ri supplies some block/block range P egress capacity = P P P ingress capacity (i.e. weightsnode sp eed) P ingress capacityi / P data downloadedi = data downloaded0 ingress capacity0 / Table 1: Linear Programming Constraint Equations 3.3 Implicit Scheduling Algorithms The CEP approach ab ove is exactly opp osite the BitTorrent approach. BitTorrent p eers slowly ramp up (tit-for-tat when one has no blocks means one can download nothing) and use more wasteful techniques at the end of their life (to avoid the problems with stragglers). The difference b etween approaches is the trust model. BitTorrent tries to enforce sharing explicitly in their algorithm, with p erhaps limited success [18]. CEP assumes this is done by an external agent via, e.g., p erformance evaluation and blacklisting. Many distributed block transfer systems, including BitTorrent, use Random Rarest First (RRF) to select blocks, falling back to linear download only if the sender/receiver ratio is high. This maximizes redundancy in the network, but means that files must typically b e downloaded entirely b efore they can b e used. CEP uses linear downloading, providing partial data to the application immediately. Other implicit mechanisms have nodes determine their location in a structure (such as a tree [8], mesh [14], or ring [21]) and forward along the structure. This implicitly determines the schedule based on location and timing of block reception. Other work has shown how to do this efficiently, but it does not translate to low-sharing situations. 4. EVALUATION Our evaluation methodology is to implement these ideas and then p erform exp eriments on an real cluster infrastructure. We focus on comparison b etween p erformance of our code, a stock BitTorrent client, and the theoretical maximum p erformance. "Performance" here refers to bandwidth : the aggregate bandwidth achieved among all nodes or CDF of nodes' completion, latency to retrieve a given block, or fairness b etween nodes for bandwidth, completion time, or blocks uploaded. Tests are designed to exhibit the features and drawbacks of each technique. BitTornado supp orts selective download in a limited way: one may sp ecify priorities for files in a set or disable their download. Unfortunately, these priorities are merely suggestions. Blocks from disabled files may b e downloaded anyway (b ecause they are needed to checksum desired files), and low priority files may b e downloaded first (b ecause higher priority files' blocks are unavailable). This code was expanded and rewritten to allow sp ecification of sp ecific block requests, enabling arbitrary partial-file downloads. It was also changed to enforce hard limits on block downloads. Next, an API was added to allow use of the modified client as a library. This is imp ortant for practical purp oses, namely test automation, as well as our interest in further integration with applications. We used the simplest interface p ossible: one sp ecifies a global torrent identifier (usually a filename for the .torrent file) and a list of blocks desired (e.g. "1-4,910"). More efficient representations are used internally. The tracker metadata management was extended and partially replaced with code from CEP [23] to determine locally optimal data transfers b etween nodes. Code to collect from and reply to nodes with more useful information was also added (e.g. collecting network bandwidth measurements, and ensuring replies to nodes contained p eers which can supply their requests). Lastly, we also attempted to tune default BitTorrent parameters (e.g. timeouts for p eer request, optimistic unchoke, etc.) and b ehavior (e.g. tit-for-tat) for high-p erformance environments. Generally this involved reducing values which were set by default on the order of minutes for typical Internet applications; this is inappropriate for high bandwidth environments with stronger latency requirements. 4.2 Experiments Our exp eriments are run on a typical "High Performance" cluster. This refers to 1Gbps or faster links in the LAN and 10Gbps in the WAN. The cluster has approximately 300 nodes, a mixture of HP and Dell servers accessed through a batch scheduling interface (which limits the numb er of nodes we can utilize to 128). Each node has a 1.6GHz or faster CPU, 2GB or more memory, and a Gigabit Ethernet NIC. We use Dummynet [19] on a smaller 24-node cluster to model higher-latency networks with slower nodes; this approach is due problems with wide-area infrastructure and lack of administrative access to the larger cluster. Our first tests evaluate system scalability. We p erform a traditional one-to-many full-file transfer varying either file size or numb er of nodes to provide a baseline for comparison. These tests are done on the cluster ab ove. We b enchmark system comp onents to determine b ottlenecks and scaling limits b eyond that p oint. The results also show the appropriate ranges under which to do future testing. 4.1 Implementation Half the implementation task was enhancing a BitTorrent client and tracker with mechanisms to supp ort partialsharing transfers and an transfer scheduling framework. The other half was implementing the transfer scheduling algorithms, extending CEP, and gluing everything together. We chose BitTorrent as it is the most p opular tool for content distribution­ accounting for as much as 80% of the background traffic on the Internet [17] with approximately 50 implementations [24]. Of these, we selected the BitTornado client [12] for our tests. While not the most p opular client, it has source code available, reasonable baseline p erformance, and the features we wanted. BitTornado is written in Python, making exploratory modifications simple. 140 The next set of tests exercise the new sub-file distribution code. For a fixed numb er of p eers p and file size s we vary the data layout from totally disjoint up to totally shared. This is done as follows: First we equally distribute each p eer pi along on a virtual line of length s. Each p eer's data request is now centered at c = (i + 1/2) (s/p). We then vary the amount of data d each node will download from 1 to 2s; for a given d each node pi will download from max(0, c - d/2) to min(c + d/2, s - 1). For d < (s/p) these are totally disjoint; for larger d nodes' requests overlap increasingly with their p eers. Results are analyzed using the sharing measure describ ed in Section 2. Subsequent tests quantify the effect of the plethora of BitTorrent tuning parameters and other changes we have tested (e.g. disabling tit-for-tat). This provides a more direct means of comparing BitTorrent with CEP, which is already tuned for high p erformance in these environments. By comparing p erformance with different features enabled, we discover which provide the most b enefit and which are insignificant. The last set of tests evaluates the hidden costs of the various techniques, particularly the CPU cost involved in computing a transfer schedule. We b enchmark algorithm runtime and the resulting transfer time for common workloads to determine whether the techniques are appropriate in e.g. compute-b ound cluster environments versus network-b ound residential machines. All tests are p erformed with cold caches: no data cached anywhere to allow fair comparison across tests. Short transfers are ultimately limited to disk bandwidth. Longer transfers can theoretically exploit caching, particularly in highsharing configurations, but BitTorrent's RRF block selection interacts p oorly with caching algorithms. These p er-node features are not immediately obvious as our exp eriments focus on aggregate p erformance. 100 Peers Complete (Percent) 80 60 40 20 0 10 20 30 40 50 Time (s) 60 4 Peers 8 Peers 16 Peers 32 Peers 64 Peers 128 Peers Baseline 70 80 90 Figure 4: BitTorrent Nodes Complete vs. Time startup time: for most of a transfer few nodes complete, then suddenly the ma jority of nodes finish with a few stragglers. The primary reason for these results are the source's b ehavior in BitTorrent: it has to verify the entire file b efore it can b e shared, and tries to distribute blocks across the whole network versus just to carefully selected nodes. The next figure compares aggregate completion time of BitTorrent and CEP, b oth for the original "default" greedy algorithm and with the "high sharing" optimization. The linear programming algorithm's p erformance is equivalent to the default greedy algorithm's and is not shown. For small numb ers of nodes, the stock CEP algorithm is b est­ these are effectively low-sharing environments and the source node can supply all nodes at high sp eed. As the numb er of nodes grows, however, the source is saturated and the completion time grows (source transfer rate is consistently around 800Mbps for all tests). BitTorrent scales as in the prior graph. The High Sharing optimization initially p erforms intermediate the others; with few nodes the two-phase transfer induces higher overhead than the default one-phase CEP transfer. As the numb er of nodes grows, however, it outp erforms BitTorrent by 25-40%. Though unable to physically test larger numb ers of nodes, work with ns-2 [22] simulation has shown system scalability to 10,000 nodes. Real life BitTorrent swarms on the order of several thousand nodes are common and work well. Lastly, the internal algorithms and system design work well for large problem sizes, as shown in Section 5.5. 300 Completion Time (s) 250 200 150 100 50 0 10 Total Peers 100 CEP (Default) BitTorrent CEP (High Sharing) 5. RESULTS This section presents the results from the exp eriments we outlined in the Evaluation section. We b egin with simple baseline comparisons, continue with scalability and partialfile distribution tests, compare the effects of different transfer mechanisms and tuning, and finally look at the costs involved in transfer scheduling. In general, the results achieved follow our exp ected p erformance. 5.1 Baseline Performance Our first tests are of system scalability, p erformance as a function of the numb er of nodes. Figure 4 shows a graph of the completion time of full-file transfers for various numb ers of nodes with stock BitTorrent. In this graph, for traditional content distribution, a node is complete up on receiving a copy of the entire file. For reference, the p erformance of a 128-node multicast tree is given as a baseline (this is without pip elining. more intelligent multicast trees or multicast meshes can do even b etter). These results help place the values for our later modifications into context, and show several features of the BitTorrent system that are relevant to later discussion. The completion time is roughly logarithmic in the numb er of nodes, which is desirable. BitTorrent scales well with high receiver sharing. However, the total completion time is longer than necessary­ a naive multicast tree would get data to all nodes much more quickly. Also, there is a large Figure 5: Aggregate Completion Time 141 Also of interest is the distribution of completion times for nodes; a measure of the effectiveness of fairness mechanisms in BitTorrent. Figure 6 shows the actual and sorted node completion time distribution for another run of the 128-node case ab ove. We evaluate fairness via Jain's measure from [13]; the farther this value is from 1.0, the less m fair the system. With ti as the termination tiPe for node i, P fairness is given by: f air ness = ( ti )2 /(n · t2 ). i 90 80 Completion Time (s) 70 60 50 40 30 20 10 0 0 BT Peer Completion Time BT Cumulative Completion Time CEP Peer Completion Time CEP Cumulative Completion Time 20 40 60 80 Peer Number 100 120 This figure demonstrates three things for this environment. First, transfers of 1MB are necessary and sufficient to amortize global overhead such as system initialization, file read time, network connection time, etc. Second, BitTorrent works well with block sizes from 1KB to 2MB, but have issues with blocks 4MB or larger. Third, metadata management can handle up to 100,000 blocks but does not scale well b eyond that. To achieve the same p erformance, higher latency networks require prop ortionally larger transfers and parallelism­ primarily to work around issues using stock TCP for data transp ort. Similarly, larger blocks are desirable on such networks to enable TCP congestion windows to fully op en; this requires further tuning of internal BitTorrent parameters and changing buffer management schemes (see Section 5.4 for more information). Lastly, with fewer than 32 blocks, the scheduling techniques in the system can not b e exploited effectively and p erformance also suffers. Given the ab ove limits and appropriate selection of block size, the system theoretically scales to handling transfers b etween 1MB and 200GB with high p erformance. The default configuration of BitTornado uses the following table to select block size as a function of total torrent size: Torrent Size 8GB 2GB <8GB 512MB <2GB 64MB <512MB 16MB <64MB 4MB <16MB <4MB Piece Size 2MB 1MB 512KB 256KB 128KB 64KB 32KB Pieces 4096+ 2048-8192 1024-4096 256-1024 128-512 64-256 1-128 Figure 6: Distribution of Completion Time The spread of the completion times is quite large; the first node receiving a full copy in 50 seconds while the slowest taking almost 90 seconds­ 80% longer. While we would like to see more consistent results, the ma jority of nodes fall into a uniform distribution ab out the mean quite nicely. We get get fairness values of 0.992 for CEP and 0.988 for BitTorrent­ effectively the same. The next test evaluates the effect of block size and transfer size in BitTorrent. We either fix the block size at 1KB and vary the numb er of blocks or fix the numb er of blocks (32) and vary the file size. The transfer uses 1 seed and 32 p eers. The goal is to find where the mechanisms used start to fail. Bandwidth Achieved (Mbps) 30 25 20 15 10 5 0 1KB 10KB 100KB 1MB File Size 10MB 100MB 1KB Blocks, Variable Number Variably Sized, 32 Blocks This approach keeps the total numb er of blocks on the order of 100s for small torrents and on the order of a few thousand for larger torrents. The values fall in the desirable range for all file sizes up to the 200GB limit. Thus, for files ab ove 1MB in size, BitTorrent p erformance is dictated not by file size, but by block size. CEP p erformance is not directly comparable here; it is range-based internally, so block size is meaningless. However, in general the larger the transfer the b etter the p erformance; transfer of 1MB or smaller files is dominated by network delay and other overheads. 5.2 Partial Content Distribution The first test examines p erformance based on sharing. This is the third p ermutation of axes from Figure 2 and the central feature distinguishing different systems. Using the `virtual line' data layout as describ ed in the Evaluation section, we vary the sharing factor from 0 to 1. We use a fixed numb er of nodes (32) and file size (256MB); these were selected to b e in a range where all systems p erformed reasonably. We plot the mean with error bars to show the standard deviation. With no sharing among p eers, stock CEP is able to most efficiently use the source's upload capacity to satisfy node demands. The bandwidth achieved is invariant under the level of sharing: it is limited by network link capacity and stack p erformance. BitTorrent p erforms p oorly for very low sharing, but soon b egins to p erform b etter: a sharing ratio of 0.3 means each node shares approximately 1/3 of its data with every others (although different subsets), meaning that there are a large numb er of p ossible sources for desired blocks. Figure 7: Performance Effects of Block Size Figure 7 shows average bandwidth achieved p er p eer. Initially p erformance is p oor as overhead dominates; as file size grows, p erformance increases. Eventually, the internal mechanisms start to have problems. Transfers of 256MB would not complete with these parameters­ 32 blocks of 8MB were causing problems with timeouts, and the system could not efficiently manage 262,000 1KB blocks. 142 4500 Aggregate Bandwidth (Mbps) 4000 3500 3000 2500 2000 1500 1000 500 0 0 CEP (High Sharing) BitTorrent CEP (Stock) scale). The stock management schemes in BitTorrent rely heavily on local p eers b eing able to provide copies of blocks. When that assumption fails, the system can not efficiently satisfy complex transfer requirements. In comparison, CEP has an order of magnitude lower latency and smaller result deviation; it was designed for environments where latency is a concern. The next graph extends the ideas ab ove, showing the effects of parallelized downloads. With a single sender we increase the numb er of nodes requesting disjoint blocks out of a 128MB file and rep ort the bandwidth p er node averaged over 5 tests. Error bars are shown to one standard deviation. 1 Aggregate Bandwidth (Mbps) 300 Transfer Rate Actual Failures 250 Expected Failures 200 150 100 50 0 0 30 25 Failures (%) 20 15 10 5 0 10 20 30 40 50 60 70 80 90 100 Total Peers 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Sharing Factor Figure 8: Bandwidth vs. Sharing CEP with the high sharing optimization is roughly equivalent to stock TCP for low sharing; for intermediate sharing the inefficiency of a two-phase transfer mechanism keeps p erformance b elow that of BitTorrent. For higher sharing, however, the initial seeding phase allows for very efficient transfers. The main problem we see here is high variability due to the well-known `straggler' problem: the last node to finish determines the system termination time, and hence aggregate bandwidth. In the future, we are looking ways to minimize the impact of such nodes. Our next test focuses on the lowest-sharing case, when data is strip ed across a cluster (this is how GridFTP [9] transfers typically work). Each node must locate and retrieve a small 1KB block of data from the single source node. Metadata management rather than transfer p erformance is b eing exercised. Figure 9 shows the completion time for increasing numb ers of nodes. Error bars are shown for the min and max over 5 trials. 10 Figure 10: Partial Transfer Rate and Failures As b efore, there is a large amount of variation in the results; while the general pattern is what we exp ect, it is unpredictable. Furthermore, around 60 p eers (the default numb er of nodes the tracker replies with when a node makes a metadata request) failures start occurring­ some nodes do not get the p eer that has their block in the resp onse set, and fail to do so b efore the 20 minute test is over. The flat metadata model in BitTorrent can not cop e with this typ e of disjoint data. Ironically enough, this is tied to a transitory p erformance increase; BitTorrent is empirically known to p erform b est with 40-60 p eers (hence the default) and this is seen in the data. Yet as the numb er of nodes continues to grow, the difficulty in finding a p eer with a desired block b egins to dominate, and p erformance falls drastically. 900 BitTorrent CEP Completion Time (s) 1 0.1 0.01 Aggregate Bandwidth (Mbps) 800 700 600 500 400 300 200 100 0 0 10 20 30 40 50 60 70 80 90 100 Total Peers CEP Stock BitTorrent Scheduling 0.001 5 10 15 Total Peers 20 25 Figure 9: Completion Time vs. Total Nodes (Partial Transfers) As the numb er of nodes grows, the first node to complete its transfer tends to terminate in ab out the same amount of time. However, the distribution of times continually grows, with the last node to detect and download its block taking 10 times longer for 25 nodes than for 2 nodes (note the log Figure 11: Partial Transfer Rate with CEP 143 The "exp ected failures" line plots the exp ected p ortion of nodes to not find a p eer with their desired block. That is, the probability the p eer with their block falls outside the set of p eers they know ab out: (p-60)/p. This is divided by the average numb er of p eer re-request timeouts, 2 for this test. The final graph of this section (Figure 11) shows the same data from the prior graph but includes stock CEP p erformance. Ideally, increased parallelism should increase p erformance up to the sender's capacity and remain consistent until overhead b egins to dominate. We see exactly that: CEP p erformance p eaks around 64 nodes (when each node is downloading 4MB in parallel) and falls b eyond that p oint. Tuning Method Baseline No Double Check More Uploads More Unchokes Sup er-Seed Larger Slices Time Mean 48.1 48.9 52.3 52.5 90.6 154.9 (s) Ratio -- +1% +8% +9% +88% +222% 5.3 Performance on Larger Networks The prior results have focused on local transfers to avoid the confounding factor of TCP p erformance in the wide area, which has well known problems. While a naive Python client/server application achieves 1.7Gbps over the loopback interface, we can achieve at most 250Mbps through Dummynet, which decreases steadily as delay increases (Figure 12): 1000 Bandwidth (Mbps) Without exception, all of this "tuning" hurt p erformance. In fact all variations of parameters we tried, other than the defaults, hurt p erformance­ the system is surprisingly well tuned even though it was designed for Internet top ologies versus high p erformance networks. Sup er seed in general p erforms p oorly as it is artificially limiting transfer options in favor of block diversity. Larger slices p erform p oorly due to problems with blocking and string management overhead in Python (see App endix A). 5.5 Technique Cost More complex transfer scheduling produces b etter transfer p erformance, but at increased computational cost. Figure 13 shows the run-time of algorithms for increasing problem sizes and b est-fit curves when testing b ecame infeasible. Tests were done on a 3.2GHz Pentium 4 processor. The input problem is simple striping: p p eers have data and p want that data. An optimal solution is any set of 1-1 node pairings (or similar solutions at the block level). As CEP scheduling time is basically invariant under the data layout, this layout favors the LP and BT schedulers: it is trivial to solve by hand and should b e easy for the LP library to solve. 3.17 y Time to Produce Schedule (s) Stock TCP performance (100Mb link) Stock TCP performance (1Gb link) 100 10 1 0 10 20 30 Delay (ms) 40 50 60 11.57 d 2.77 h 100 s 1s 10 ms 100 us 1 us 1 Figure 12: Stock TCP performance in the WAN See App endix A for more information on python p erformance. Given this b ehavior, the comparison b etween BitTorrent and CEP do not qualitatively change in the wide area­ the issues with TCP affect b oth approaches equally. Both BitTorrent and CEP implicitly address the issue via parallelism in block downloads, while CEP also has some TCP buffer tuning mechanisms built-in. The metadata management schemes in b oth approaches are delay insensitive so long as block transfer time is significantly longer than the round-trip time. Linear Programming LP Best-Fit BitTorrent BT Fit (in-memory) BT Fit (swapping) Greedy 101 102 103 104 Total Peers 105 106 Figure 13: Algorithm Computation Time The figure shows that the simple greedy algorithm p erforms well, taking less than 100ms to schedule a transfer for up to 10,000 nodes and linear growth for larger problem sizes. For comparison, this scheduling time is less than one round-trip time in the current OptIPuter WAN (160ms San DiegoAmsterdam). The linear programming algorithm is significantly slower. Solving a 8192-node-sized problem took several minutes, which is unacceptable for any but the longest transfers. This is particularly true as the solution quality is no b etter than the greedy algorithm's: b oth, when implemented, result in equivalent data transfer times. Run time is quadratic in the numb er of nodes. 5.4 Effects of BitTorrent Tuning BitTorrent has a variety of parameters accessible via the command-line, and others that can b e tuned by modifying the source code. We look at the time for a 256MB wholefile transfer with 32 nodes. "Baseline" is the default p erformance from CVS, "No Double Check" turns of p otentially costly download verification, "More Uploads" increases the numb er of concurrent uploads to 20, while "More Unchokes" allows more data transfers outside the tit-for-tat scheme, "Sup er-Seed" sends a copy of every block into the network b efore rep eating, while "Larger Slices" increases the maximal transfer size to 1MB. 144 BitTorrent scheduling resolves to selecting random p eers (at the tracker) and maintaining sorted lists of communication partners (at the p eer). Currently, these op erations are efficient up to 50,000 nodes. Beyond that, overhead of Python's automatically sized and typ e-tagged data structures use too much memory: p eaking at over 1.2GB for the 50,000 node, causing the machine to go into swap. Minor implementation changes could easily avoid this. 6. RELATED WORK There is a vast b ody of work on content distribution using assorted mechanisms, some of which we have touched up on in discussion already. Traditional CDNs usually work via multicast­ explicitly by constructing a tree [8] or mesh [14] and forwarding data along it, or implicitly by the action of p eer-to-p eer block selection algorithms [2] or on-demand caching [1]. Such systems focus on whole-file sharing and neither capture nor optimize for the partial-file semantics we have discussed in this work. Distributed filesystems [3, 10, 11], distributed shared memory, and some communication libraries [15, 25] can b e seen as partial content distributors. However, current work in these areas solves different problems. They focus on locking, data consistency, and local-area p erformance for on-demand data requests; we focus on p erformance for primarily static data with more complicated distribution. Such systems capture insufficient data (block request streams) to p erform the kinds of optimization available to a content distribution system (which has more global knowledge of p eers and demand) and in the worst case fall back on making serial copies. DHTs such as Pastry [20] or Chord [21] provide a useful abstraction for distributed data management. They do not in themselves supp ort transfers of large amounts of data, nor provide a single p oint at which transfer optimization can occur. DHTs by their nature suffer problems with high lookup latency, hot sp ots, and data consistency, although these are all b eing actively studied. Lastly, network coding is a highly related technique for whole file content distribution: we discuss some of its features in App endix B. Unfortunately, its p erformance does not translate to partial-file transfers. 7. CONCLUSIONS This pap er introduced the idea of the "partial content distribution problem"­ content distribution when the demand shared b etween receivers is less than total­ and showed how this can occur in distributed applications. Evaluation of current approaches, which are designed for environments with high sharing, has shown p oor p erformance when applied to partial content distribution problems. This work has shown new new techniques using transfer scheduling and limited global knowledge to address these issues, dramatically improving p erformance. Lastly, an implementation of these ideas using p opular commodity software has shown efficient, scalable, high-sp eed transfers under a variety of conditions. 8. REFERENCES [1] Akamai Technologies Inc. Akamai content distribution system. http://www.akamai.com/. [2] B. Cohen. The BitTorrent file sharing protocol. http://bittorrent.com/. [3] Cluster File Systems Inc. Lustre: scalable, secure, robust, highly-available cluster file system. http://www.lustre.org, 2006. [4] J. Edmonds and R.M. Karp. Theoretical improvements in algorithmic efficiency for network flow problems. Transactions of the ACM, 19:248­264, 1972. [5] J. W. Byers, M. Luby, M. Mitzenmacher, and A. Rege. A digital fountain approach to reliable distribution of bulk data. In SIGCOMM, pages 56­67, 1998. [6] L. R. Ford and D. R. Fulkerson. Flows in Networks. Princeton University Press, 1962. [7] R. Dougherty, C. Freiling and K. Zeger. Unachievability of network coding capacity. IEEE Transactions on Information Theory, 52(6):2365­2372, June 2006. [8] S. Floyd, V. Jacobson, C. Liu, S. McCanne, and L. Zhang. A reliable multicast framework for light-weight sessions and application level framing. IEEE/ACM Transactions on Networking, 5(6):784­803, Dec 1997. [9] W. Allcock, J. Bester, J. Bresnahan, A. Chervenak, I. Foster, C. Kesselman, S. Meder, V. Nefedova, D. Quesnel, and S. Tuecke. Data management and transfer in high p erformance computational grid environments. In Paral lel Computing: Advances and Current Issues, 2001. [10] S. Ghemawat, H. Gobioff, , and S.-T. Leung. The Google file system. In 19th ACM Symposium on Operating Systems Principles (SOSP), Octob er 2003. [11] R. H. Inc. Global file system. http://www.redhat.com/software/rha/gfs/. [12] J. Hoffman (a.k.a. TheShad0w). BitTornado. http://www.bittornado.com/. [13] R. Jain, D. Chiu, and W. Hawe. A quantitative measure of fairness and discrimination for resource allocation in shared systems. Technical Rep ort TR-301, Digital Equipment Corp oration, Littleton, MA, 1984. [14] D. Kostic, A. Rodriguez, J. Albrecht, and A. Vahdat. Bullet: High bandwidth data dissemination using an overlay mesh. In Proceedings of the 20th ACM Symposium on Operating System Principles (SOSP 2003), volume 37, pages 282­297, Octob er 2003. [15] J.-Y. Lee and A. Sussman. High p erformance communication b etween parallel programs. In Proceedings of 2005 Joint Workshop on High-Performance Grid Computing and High-Level Paral lel Programming Models (HIPS-HPGC 2005). IEEE Computer Society Press, April 2005. App ears with the Proceedings of IPDPS 2005. [16] S.-Y. Li, R. W. Yeun, and N. Cai. Linear network coding. IEEE Transactions on Information Theory, 49(2):371­381, February 2003. [17] A. Parker. P2P in 2005. Technical rep ort, Cache Logic, 2005. http://www.cachelogic.com/research/p2p2005.php. [18] M. Piatek, T. Isdal, T. Anderson, A. Krishnamurthy, and A. Venkataramani. Do incentives build robustness in BitTorrent? In USENIX Symposium on Networked Systems Design and Implementation (NSDI2007), April 2007. [19] L. Rizzo. Dummynet: A simple approach to the 145 [20] [21] [22] [23] [24] [25] evaluation of network protocols. ACM Computer Communication Review, 27(1):31­41, 1997. A. Rowstron and P. Druschel. Pastry: Scalable, decentralized ob ject location, and routing for large-scale p eer-to-p eer systems. Lecture Notes in Computer Science, 2218:329­, 2001. I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable p eer-to-p eer lookup service for internet applications. In Proceedings of ACM SIGCOMM 2001, pages 149­160, Aug 2001. Various. ns, the network simulator, version 2.1b7. http://www.isi.edu/nsnam/ns/. E. Weigle and A. A. Chien. The comp osite endp oint protocol (CEP): Scalable endp oints for terabit flows. In Proceedings of IEEE Conference on Cluster Computing and the Grid (CCGRID), May 2005. Wikip edia. Comparison of BitTorrent software. http://en.wikip edia.org/wiki/Comparison ofBitTorrent software. J. S. Wu and A. Sussman. Flexible control of data transfers b etween parallel programs. In Proceedings of the Fifth International Workshop on Grid Computing (GRID 2004), pages 226­234. IEEE Computer Society Press, Novemb er 2004. B. ERASURE CODES AND SHARING Erasure and network coding systems are p opular research areas which address similar issues to this work. Erasure coding combines blocks to create redundant data, which then provide L ditional distribution flexibility. Consider figure ad 15; the symb ols refer to erasure coding combining blocks (e.g. an XOR) and symbol represents just the block itself. Comparison b etween this structure and Figure 1 reveals that erasure codes add yet another level of generality to the distribution problem: nodes may provide/desire not just blocks, but combinations or functions of blocks. Figure 15: Erasure Coding with Low Sharing Erasure coding can dramatically improve p erformance: linear codes can achieve the maximal transfer rate for wholefile multicast [5, 16], but they are not a panacea. The scheme only works for whole-file transmissions. Second, networks exist where it is imp ossible to achieve the network capacity by network coding [7] (unsurprisingly, their main example is one with partial file sharing). Third, max-flow alone does not guarantee globally minimal termination time; other constraints are necessary, as we discussed in Section 3.1. The problem with erasure coding for partial content distribution is that it combine bits from disparate locations in a file. With low sharing these are probably not all bits a given node wants­ and it will have to download excess blocks to decode its data. This overwhelms any p erformance gains from the coding. For example: client 1 wants block 3 in the figure ab ove. To decode it, it must download block 3 5, and then solve for block 5. That entails downloading block 5 7 and solving for block 7. This chain continues until it downloads (at a minimum) 7 blocks: it needs all but 7 8 to decode block 3: (3 5) (5 7) (7 6) (6 4) (4 2) (2 1) (1) = 3. This is unacceptable. The problem is the highlighted "cross-constraint dep endency:" an encoding for a given block (3) that falls outside the set of blocks a node desires (with 5). These are easy to avoid if client sharing is known: (1) split the file into chunks such that if a client wants any part of a chunk, it wants all of it (2) erasure code only over those chunks. Step 1 can b e done in time O(nlog (n)) with the right data structures, with n b eing the numb er of unique ranges desired by p eers. In the ab ove example, simply switching the edges lab elled `B' and `C' removes the cross-constraint dep endency chain, and neither node has to download extra data. After removing cross-constraint dep endencies, each chunk has total sharing among nodes that desire it: reducing the problem to the already solved whole-file transfer problem (provided there are few chunks, and they are large). APPENDIX A. PERFORMANCE OF PYTHON Python, as a partially interpreted language, may seem inappropriate for high-p erformance tasks. However, our results in the main text have shown high transfer sp eeds can b e attained: up to 1.7Gbps, comparable to native C code. For networking the primary problem with Python is use of the `string' typ e for buffering data. Python strings are an immutable typ e, so common network tasks such as adding headers or reassembling data from multiple packets is p otentially exp ensive. Figure 14 shows the bandwidth for string concatenation using an algorithm which iteratively joins buffers one-by-one and a "one-shot" algorithm which collects buffers into an array and joins then with one `join' call. We see that for older versions of Python this processing can b e a significant but avoidable p erformance b ottleneck. 100000 Bandwidth (Mbps) 10000 1000 100 10 0 50 100 150 8KB Buffers Joined 200 250 Python 2.4.4, One-shot Python 2.4.4, Iterative Python 2.3.4, One-shot Python 2.3.4, Iterative Figure 14: Performance of Python Buffer Concatenation 146