SIGIR 2007 Proceedings Poster An Analysis of Peer-to-Peer File-Sharing System Queries Linh Thai Nguyen, Dongmei Jia, Wai Gen Yee, Ophir Frieder Illinois Institute of Technology Chicago, IL 60616, USA {linhnt, jia, yee, ophir}@ir.iit.edu ABSTRACT Many studies focus on the Web, but yet, few focus on peer-topeer file-sharing system queries despite their massive scale in terms of Internet traffic. We analyzed several million queries collected on the Gnutella network and differentiated our findings from those of Web queries. two million. The average query length is 3.57, which is one term longer than Web queries as popularly reported (e.g., [1][5]). The statistics are summarized in Table 1. Table 1. General Query Statistics. Total Queries No constraints 15,552,645 Audio 6,000,273 Video 1,439,272 Image 180,547 Document 49,196 Application 141,227 Total 23,363,160 40% P2P queries 30% 20% 10% 0% 1 2 3 4 5 6 7 8 9 10 Que ry length W eb queries Unique Queries 7,811,380 3,295,648 598,578 100,992 33,976 95,899 10,762,716 Unique Terms 1,565,763 910,950 235,843 69,206 31,248 50,281 2,091,464 Avg. Len. 3.86 3.18 2.52 1.99 2.22 2.49 3.57 Categories and Subject Descriptors H.3.5 [Information Storage and Retrieval]: Online Information Services ­ Web-based services. General Terms: Measurement, Experimentation, Verification. Keywords Query log analysis, Information retrieval, Peer-to-peer. 1. INTRODUCTION Knowledge of user search patterns on a search system can be used to improve search performance. As such, many studies exist on user query logs, most of which are for the Web (e.g., [1][2]). Recently, some simulations of peer-to-peer (P2P) systems such as [3] assume that P2P queries are similar to Web queries. This is not necessarily the case. We analyze a multi-million P2P query log and highlight the differences between it and Web query logs. 2. QUERY LOG ANALYSIS To collect a representative query log, we used a Gnutella network crawling tool [4], which mimics a peer that can satisfy all user queries. Consequently, all "nearby" queries are routed to our peer. Our query log was collected during a month-long span starting on September 14, 2006 and includes query terms, desired file type, and timestamp. Queries were preprocessed by ignoring case difference and replacing punctuations with white space. This and other query logs are published on our Web site (http://ir.iit.edu/~waigen/proj/pirs/). As they are collected from a public network, we encourage their use for research purposes. As well, the tools used to collect and analyze these logs are available on our Web site. Percentage of queries Figure 1. Query Length Distribution. Sixty-six percent of queries do not specify a desired type. Of the queries that do, more than 75% are for audio soundtracks, 20% are for videos and the other 5% are for images, programs, and documents. This is different from Web queries, where 80% of the Web multimedia searches on Alta Vista are for images, 15% are for videos, and only 5% are for audio soundtracks [2]. The reason for this phenomenon might be that many of the freely downloadable, possibly illegal, audio and video files shared in P2P networks are not available on the Web due to copyright issues. The P2P query length distribution is shown in Figure 1, together with that of Web queries as reported in [5]. About 80% of the P2P queries contain from 2 to 5 terms, while only 8% of queries are single term. In contrast, the authors in [5] report that 76% of Web queries contain from 1 to 3 terms and 28% of them are singleterm. That queries in P2P networks are longer than those from the Web may support the assumption that searches in P2P file-sharing systems are for known items and are therefore more precise. In many cases, the song title and the singer's name are used as the query for a particular song. One of such query in our query log is "Celine Dion I am alive." More evidence of this behavior comes from the fact that audio queries are longer than all others. 2.1 Overall Statistics There are more than 23 million queries in our data set, of which 47% are distinct. We consider queries with an identical set of terms as similar (i.e., not distinct), regardless of term order and term frequency. The total number of unique query terms is about Copyright is held by the authors/owner(s). SIGIR'07, July 23 ­ 27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. 855 SIGIR 2007 Proceedings Poster 2 Query Volume (x10 6) 1. 6 1. 2 0. 8 0. 4 0 9/ 14 9/ 18 9/ 22 9/ 26 9/ 30 10/ 4 10/ 8 10/ 12 V ol um e Te rm overlap 60% 50% 40% 30% 20% 10% 0% Que ry overlap Overl ap Percentage of Query Traffic 20% 16% 12% 8% P2P Weekend 4% 0% 0-3 3-6 6-9 9-12 12-15 15-18 18-21 21-24 Hour of day P2P Weekday W eb Figure 2. Query Volume and Query Overlap Over a Month. Figure 3. P2P Query Volume on Weekdays and Weekend. Table 2. Queries, Terms and Correlated Terms. Top correlated terms hip hop 50 cent chain hang low nip tuck anatomy greys break prison next door naar op geluk weg jan smit Harry potter credit card ay papi Top terms love dj remix movie sex dvd qsh nude naked serial key music Top queries white and nerdy smack that chicken noodle soup pthc ptsc adult qsh ptsc lsm incest fansadox osprey 2.2 Changes Over Time In Figure 2, we illustrate how queries vary over a month. Query volume varies widely and averages about 750K per day. Term and query overlaps are measured by the Jaccard coefficient1 between each day's query set and term set with those of the first day. There is a 10% to 20% overlap between query sets and a 40% to more than 50% overlap between term sets. Term overlap is naturally greater than query overlap, as the set of terms is more limited than the possible term combinations. The downward tendency of overlap reflects the gradual change in user desires. In Figure 3, we show how the percentage of query volume changes throughout an average day. We report the average percentage volume of weekends and weekdays over two weeks starting from September 14. For the purpose of comparison, we also show in Figure 3 results of Web queries reported in [1]. On weekdays, P2P query traffic decreases from midnight to 9am, and starts increasing after 6pm. On weekends, overall P2P query traffic volume is high during the day and only decreases after midnight. These results are different from those of Web analyses [1] and suggest that much of the P2P usage is for recreational purposes. Audio Video Image Document 2.3 Most Frequent Queries and Terms For each type of query, we record the most frequent queries and query terms, and the most correlated terms (identified using Chisquare test with a 95% confidence level.) Due to space limits, we report in Table 2 the top 3 of them for some query types only. In Table 2, one sees that most of video and image-constrained queries are porn-related. Also shown is that some documentconstrained queries are likely related to illegal activities. Our findings suggest that using Web logged queries as a test set to evaluate P2P search systems may be misleading. In addition, since P2P queries change slowly over time, result caching is beneficial. Further, since different query types have very different sets of frequent terms and correlated terms, these sets can be used to classify non-constrained P2P queries (that comprise over 66%) to improve search accuracy. Other analyses can be performed on our data. One is to identify and to analyze the differences in file annotations and query compositions of users. This knowledge would be helpful in applying IR techniques in the P2P search domain. 4. REFERENCES [1] S. M. Beitzel, E. C. Jensen, A. Chowdhury, D. Grossman, and O. Frieder. Hourly Analysis of a Very Large Topically Categorized Web Query Log. ACM SIGIR'04, July, 2004. [2] B. J. Jansen, A. Spink, and J. Pedersen. An Analysis of Multimedia Searching on AltaVista. ACM MIR'03, Nov., 2003. [3] J. Lu and J. Callan. Content-based retrieval in hybrid peerto-peer networks. ACM CIKM'03, Nov. 2003. [4] L. T. Nguyen, W. G. Yee, D. Jia, and O. Frieder. A Tool for Information Retrieval Research in Peer-to-Peer File-Sharing Systems. IEEE ICDE'07, Apr., 2007. [5] P. Reynolds and A. Vahdat. Efficient peer-to-peer keyword searching. ACM Conf. Middleware, 2003. 3. CONCLUSIONS This study mainly focuses on examining a large set of queries collected from the Gnutella network, with the goal of revealing the nature of queries and the users' searching behaviors. The statistical results show that P2P queries are longer than Web queries, and most searches are for music and movies. We also found that the total query traffic varies in magnitude over a month, but the level of term and query overlap is relatively stable. In addition to user queries, we have used our tool to collect information on several million files shared in the Gnutella network. Due to space limits, we omit our analyses of this data set. 1 The ratio of the intersection to the union of the two sets. 856