SIGIR 2007 Proceedings Doctoral Consortium Global Resources for Peer-to-Peer Text Retrieval Hans Friedrich Witschel witschel@informatik.uni-leipzig.de University of Leipzig P.O. Box 100920 D-04009 Leipzig ABSTRACT The thesis presented in this pap er tackles selected issues in unstructured p eer-to-p eer information retrieval (P2PIR) systems, using world knowledge for solving P2PIR problems. A first part uses so-called reference corp ora for estimating global term weights such as IDF instead of sampling them from the distributed collection. A second part of the work will b e dedicated to the question of query routing in unstructured P2PIR systems using p eer resource descriptions and world knowledge for query expansion. Categories and Sub ject Descriptors: H.3.3 [Information Search and Retrieval]: Selection Process, C.2.4 [Distributed Systems]: Distributed Applications. General Terms: Algorithms, Exp erimentation. Keywords: Peer-to-p eer information retrieval, term weighting, query routing. word list" which can b e robustly sampled from a reference corpus. Some collection or domain-sp ecific "stop words", however, cannot b e found that way and need to b e sampled from the retrieval collection. 3. QUERY ROUTING For further exp eriments, I prop ose a new P2PIR testb ed which has a realistic distribution of content and queries (similar to [1]): Given that the data shared by nodes in a p eer-top eer network should reflect the interests of p ersons running p eers, I prop ose to identify p eers with authors of documents using a data set where authoring and citation information is available (e.g. the CiteSeer database). To generate queries, each p eer will ask for keywords of pap ers that are referenced in its own pap ers. Since there are no relevance judgments for these queries, I prop ose to measure the p erformance of distributed retrieval algorithms using a new evaluation measure which ­ roughly ­ tells us how high the b est k documents that the distributed search finds are ranked ­ on average ­ by a centralised search engine. Using this exp erimental setup, I would like to address the question if there is a way to keep p eers' resource descriptions very compact and still guarantee good recall when matching queries against compressed profiles. To this end, I will prop ose various query expansion strategies based on world knowledge, including the WWW, other large collections or thesauri, similar in spirit to what has b een done in [3], but using the new evaluation framework and more data sources. Expansion using these global knowledge sources will then b e compared to other expansion strategies such as local feedback on p eers. 1. INTRODUCTION My thesis aims at solving problems in P2PIR for b oth precision- and recall-oriented retrieval, the guiding question b eing to what extent global "world" knowledge can b e applied in this context, i.e. data that is indep endent of the collection shared by p eers in a particular P2PIR system; it is assumed that this knowledge can b e gathered ­ once and for all ­ from sources such as the WWW and then used in many different P2PIR systems. 2. GLOBAL TERM WEIGHTS In a first series of exp eriments [2], I investigated the question whether one can globally estimate collection or document frequencies of terms ­ indep endent of a given collection ­ well enough in order not to degrade retrieval p erformance seriously when using them for computing e.g. IDF. The results indicate that weights estimated from reference corp ora slightly degrade retrieval results, but this is often not statistically significant. Weights can also b e improved by mixing them with estimates derived from very small samples of the retrieval collection. Finally, a large fraction of infrequent terms can b e pruned (i.e. treated as if they had not occurred) from the resulting term lists without any ill effects. All in all, this indicates that what is really needed for global term weighting is just an "extended stop Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. 4. REFERENCES [1] I. A. Klampanos, J. M. Jose, V. Poznanski, and P. Dickman. A Suite of Testb eds for the Realistic Evaluation of Peer-to-Peer Information Retrieval Systems. In ECIR 2005, pages 38­51, 2005. [2] H.F. Witschel. Estimation of global term weights for distributed and ubiquitous IR. In Proc. of UKDU'06, 2006. [3] H.F. Witschel and T. B¨hme. Evaluating Profiling o and Query Expansion Methods for P2P Information Retrieval. In Proc. of the P2PIR Workshop at CIKM, 2005. 923