PeerStripe: A P2P-Based Large-File Storage for Desktop Grids Chreston Miller, Patrick Butler, Ankur Shah, and Ali R. Butt Dept. of Computer Science, Virginia Tech. Blacksburg, VA 24061 {chmille3, pabutler, ankur77, butta}@cs.vt.edu 2. DESIGN First, PeerStrip e uses the communication substrate provided by Pastry [5] to arrange the participating nodes in a p2p overlay network and enable storage sharing. Second, to store a file PeerStrip e splits it into chunks and stores the chunks in the storage p ool. This enables PeerStrip e to store very large data files as well as only retrieve needed p ortions of a file and not the entire file. Third, PeerStrip e employs erasure codes [2, 4] to provide fault tolerance against loss of chunks. Each chunk is named as filename ChunkNo, e.g., testImageFile 2 represents the second chunk of the file testImageFile. This naming convention is chosen as a means for easily determining the name of the file a chunk b elongs to. Each chunk is then encoded using erasure codes. A chunk to b e encoded is passed to an error coding algorithm that divides the chunk into n equal size blocks, calculates erasure codes across the blocks, and generates m encoded blocks. The encoded blocks for the chunk X are named filename X ECB, where E C B is the error coded block numb er and ranges from 1 to m. The error coded blocks are then stored in the p2p storage system using techniques similar to that used in PAST [6]. Due to the built-in redundancy of erasure codes, PeerStrip e can retrieve the original chunk even if some of the m encoded blocks are lost due to failures of some kind. Fourth, to determine the size of a chunk, PeerStrip e uses our chunk naming convention and the information ab out the currently used erasure codes, and queries the nodes on which the encoded chunks will b e stored using a getCapacity message. A resp onse to the getCapacity message is the maximum amount of data a remote node can store. PeerStrip e uses this information as the first chunk size. The process rep eats until all the chunks of the file are stored. In summary, to store a file, it is first split into chunks, and each chunk is then divided into n blocks and error coded to give m encoded blocks. The encoded blocks are then stored in the PeerStrip e shared storage p ool as shown in Figure 1. Similarly, to retrieve any p ortion of a file, PeerStrip e first determines the numb er of the chunk to retrieve and the name of the required encoded blocks using our naming convention. Next, enough blocks p er chunk are retrieved to allow decoding of the chunk. These chunks are then assembled together and returned to the user. ABSTRACT In desktop grids the use of off-the-shelf shared comp onents makes the use of dedicated resources economically nonviable and increases the complexity of design of efficient storage systems that are required to address the exp onentially growing storage demands of modern applications that run on these platforms. To address this challenge, we present PeerStrip e, a storage system that transparently distributes files to storage space contributed by participants that have joined a p eer-to-p eer (p2p) network. PeerStrip e uses structured p2p routing to yield a scalable, robust, reliable, and self-organizing storage system. The novelty of PeerStrip e lies in its ingenious use of striping and error coding techniques in a heterogeneous distributed environment to store very large data files. Our evaluation of PeerStrip e shows that it can achieve acceptable p erformance for applications in desktop grids. Categories and Sub ject Descriptors: D.4.7 [Organization and Design]: Distributed systems General Terms: Design, Reliability, Exp erimentation. Keywords: p eer-to-p eer, storage, striping, error-coding, desktop grids. 1. INTRODUCTION In this pap er, we prop ose PeerStrip e, a p2p storage system that provides an economical and efficient storage solution for large data files. PeerStrip e supp orts an elegant and simple design that allows for files to b e stored on participating nodes that have joined a p2p overlay network. Our use of p2p networks ensures that PeerStrip e has the features of scalability, self-organization, reliability, and comp osability for target environments of various sizes. Inspired by the local-area technique of RAID [3], a unique feature of PeerStrip e is that instead of storing entire files on individual nodes, it splits the files into varying sized chunks and then stores these chunks separately on nodes distributed across a wide-area network. As a result, unlike previously prop osed approaches such as PAST [6], the size of a file that can b e stored in PeerStrip e is not limited by the capacity of an individual node. Moreover, to provide fault tolerance against data loss due to losing a chunk of a distributed file, PeerStrip e employs error coding at the granularity of chunks. Copyright is held by the author/owner(s). HPDC'07, June 25­29, 2007, Monterey, California, USA. ACM 978-1-59593-673-8/07/0006. 3. EVALUATION We implemented PeerStrip e with ab out 6000 lines of Java code using a freely available version of Pastry [5]. For these tests, we simulated a 500-node directly connected PeerStrip e 221 Get capacity from the nodes Data file m blocks/ chunk Number of failed stores 16 % 14 % 12 % 10 % 8% 6% 4% 2% 0% 0 PAST PeerStripe.0 PeerStripe 30 % 25 % No error code XOR code Online code Unavailable files 10 20 30 40 Number of files (x1000) 50 60 20 % 15 % Splitter n blocks/ chunk Encoder 10 % 5% 0% 5 10 15 20 25 30 35 Number of failed nodes 40 45 50 x Chunks x*m Error Coded Blocks Nodes Figure 1: The various steps of storing a file in PeerStripe. Figure 2: Total number of failed file stores as files are inserted. Figure 3: The number of unavailable files as no des fail. network using Pastry's built-in simulator. We also set up PAST [6] to run in our simulator environment. Moreover, we compared PeerStrip e against a simpler scheme (PeerStrip e.0) that used fixed size chunks of 4 MB. Based on a recent study of available space on typical desktop machines [1], each simulated node was assigned a capacity using a normal distribution with mean and standard deviation of 8 GB and 2 GB, resp ectively, with a total capacity of 3915 GB. We drove our simulations using a file system trace collected from our departmental machines, which contains information for ab out 60k files with sizes ranging from 4 MB to 15.23 GB, with a mean and standard deviation of 39.4 MB and 247.4 MB, resp ectively. The total storage size required to store all the files in the trace is 2311 GB. In the first set of exp eriments we measured the numb er of successful file stores as files from the trace were inserted into the system. Figure 2 shows the results for the three cases of PAST, PeerStrip e.0, and PeerStrip e. We observed that as the system utilization increases, the numb er of failed stores in PAST starts to increase, and it fails to store 14.8% of the total files. Similarly, PeerStrip e.0 is able to p erform slightly b etter, although it still fails to store 12.4% of the total files. Finally, PeerStrip e is able to remedy the ill-effects of b oth PAST and PeerStrip e.0, and results in only 4.1% failures; an improvement by a factor of 3.6 and 3.0 compared to PAST and PeerStrip e.0, resp ectively. In terms of size, we observed that PAST and PeerStrip e.0 are unable to store as much as 24.7% and 19.8% of the total size of data, resp ectively. In contrast PeerStrip e failed to store 12.2% of the total data. This is an improvement by a factor of 2.0 and 1.6 compared to PAST and PeerStrip e.0, resp ectively. In the next exp eriment, we determined the numb er and size of chunks created under PeerStrip e.0 and PeerStrip e. On average, PeerStrip e.0 created 2.6 times the numb er of chunks created under PeerStrip e. The resp ective size of chunks was a factor of 5 larger in PeerStrip e. The reduced numb er of chunks enables PeerStrip e to avoid an unnecessarily large numb er of slow p2p look-ups, and shows the advantage of using varying size chunks over fixed size chunks. Next, we evaluated the effectiveness of error coding in PeerStrip e by distributing the files to the nodes and counting the total numb er of available files in the system as 50 randomly chosen nodes fail one-by-one. For this exp eriment we used a (2,3) XOR code as well as an online code that could tolerate two simultaneous failures p er chunk. We counted a file as available only if all the chunks of the file could b e retrieved. Figure 3 shows the p ercentage of total files that b ecame unavailable as nodes failed under the three cases. The use of error coding resulted in 58.9% and 95.5% less failures for XOR code and online code, resp ectively, when 50 nodes failed. The overall numb er of failures for online code was negligible (1.3%), and almost zero for up to 25 failed nodes. Hence, error coding is an effective means for ensuring fault tolerance in PeerStrip e. In the next exp eriment, we determined the effect of participant churn on PeerStrip e by studying the amount of data that is regenerated from other replicas/error-coded chunks as nodes leave the system due to failure. We failed up to 20% of the total participating nodes without any node recovery. We found that on average 3.43 GB of data was regenerated p er failure after up to 20% of the nodes had failed, with a total of 343 GB b eing regenerated, and 9.17 MB of data lost under this extreme case. Finally, compared to the total data size of 2311 GB, the data recreated p er failure is quite small, i.e., 0.15%. This shows that PeerStrip e can handle participant churn well. 4. CONCLUSION In this pap er, we have presented the design and evaluation of PeerStrip e. PeerStrip e uses p2p overlay networks to establish a robust, scalable, and reliable distributed storage system. It employs the techniques of striping and error coding to supp ort transparent storage of very large data files across distributed nodes, and exp orts a simple yet effective interface to users and applications. The evaluation of the system shows that it can store files that are larger than the capacity of individual participants, and gives acceptable p erformance in a dynamic setting. The efficient and simple design of our approach implies that it can b e readily deployed and interfaced with different applications, and therefore can serve as a storage system for today's desktop grid environments. 5. REFERENCES [1] A. R. Butt, T. A. Johnson, Y. Zheng, and Y. C. Hu. Kosha: A Peer-to-Peer Enhancement for the Network File System. Journal of Grid Computing, 4(3):323­341, 2006. [2] P. Maymounkov. Online Co des. Technical Rep ort TR2003-883, New York University, Nov. 2002. [3] D. A. Patterson, G. Gibson, and R. H. Katz. A case for redundant arrays of inexp ensive disks (raid). In Proc. SIGMOD, 1988. [4] J. S. Plank. Erasure co des for storage applications. Tutorial at USENIX FAST-2005, 2005. [5] A. Rowstron and P. Druschel. Pastry: Scalable, distributed ob ject lo cation and routing for large-scale p eer-to-p eer systems. In Proc. IFIP/ACM Midd leware, 2001. [6] A. Rowstron and P. Druschel. Storage management and caching in PAST, a large-scale, p ersistent p eer-to-p eer storage utility. In Proc. SOSP, 2001. 222