Term Proximity Scoring for Ad-Hoc Retrieval on Very Large Text Collections Stefan Buttcher Charles L. A. Clarke Brad Lushman ¨ School of Computer Science University of Waterloo, Canada ABSTRACT We prop ose an integration of term proximity scoring into Okapi BM25. The relative retrieval effectiveness of our retrieval method, compared to pure BM25, varies from collection to collection. We present an exp erimental evaluation of our method and show that the gains achieved over BM25 grow as the size of the underlying text collection increases. We also show that for stemmed queries the impact of term proximity scoring is larger than for unstemmed queries. Collection size (#docs) Avg. doc. length (terms) P@10 (BM25/BM25TP) P@20 (BM25/BM25TP) TREC45-CR 528,155 561 0.382/0.381 0.308/0.309 GOV2 25,205,204 1,721 0.529/0.600 0.494/0.561 Categories and Subject Descriptors H.2.4 [Systems]: Textual databases; H.3.4 [Systems and Software]: Performance evaluation Table 1: Collection characteristics and retrieval effectiveness for TREC45-CR and GOV2. Effectiveness for BM25 and BM25TP (BM25 + proximity) was evaluated using 100 topics from the TREC 2003 Robust track (TREC45-CR) and 50 topics from the TREC 2004 Terabyte track (GOV2), respectively. (TREC disks 4&5, without the Congressional Record), consisting of 528,000 documents with an average length of 561 tokens, and the GOV2 collection used in the TREC Terabyte track, consisting of 25.2 million documents with an average length of 1,721 tokens. We found that, while the use of term proximity improved the retrieval effectiveness significantly for GOV2 (paired t-test: p < 0.02 for P@10; p < 0.01 for P@20), the smaller TREC45-CR collection seemed unimpressed efforts and did not divulge more relevant documents to our term-proximity-based retrieval method than to plain Okapi BM25 (details in Table 1). Based on the characteristics of the collections, this lead us to two hyp otheses: 1. Term proximity is more imp ortant when the search engine is dealing with longer documents. 2. Term proximity b ecomes more imp ortant as the size of the text collection increases. We p erformed additional exp eriments, with the goal of validating (or refuting) these two theories. We present an exp erimental evaluation that supp orts the second hyp othesis. For the first hyp othesis, no such supp ort could b e found. We also show that term proximity is more imp ortant for stemmed queries than for unstemmed queries, an asp ect that we had ignored in our initial exp eriments. General Terms Exp erimentation, Performance Keywords Information Retrieval, Query Processing, Term Proximity 1. INTRODUCTION Document retrieval functions based on the vector space model, such as Okapi BM25 [2] [3], have b een shown to b e highly effective in ad-hoc information retrieval tasks. One of their shortcomings is that their bag-of-words approach does not take the proximity of query terms within a document into account and consequently gives the same score to a document regardless whether the query terms app ear close to each other within that document or far apart. This contradicts the intuitive understanding that, in a relevant document, query terms app ear relatively close to each other and not in completely unrelated parts of the document. Rasolofo and Savoy [1] were able to show that integrating term proximity into existing vector space retrieval methods can improve the quality of the search results significantly. While trying to reproduce their results on different text collections and to find new ways of integrating term proximity into vector-space-based retrieval functions, we found that term proximity only improves the quality of the search results on some text collections, while it leaves the search system's retrieval effectiveness unaffected on others, or even causes a slight deterioration. Two of the text collections we used in our exp eriments were the TREC45-CR collection Copyright is held by the author/owner. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. 2. COMBINING PROXIMITY AND BM25 Given a query containing the terms T1 , T2 , . . . , Tn the BM25 relevance score of a document D is: n X fD ,Ti · (k1 + 1) wTi · , ScoreBM25 (D) = fD ,Ti + K i=1 K = k1 · ( 1 - b + b · |D| ), a v g dl 621 (a) Precision at 10 documents (absolute) 0.6 0.55 0.5 0.45 0.4 0.35 10% 20% 30% 40% 50% 60% 70% 80% 90% Relative corpus size BM25TP (with stemming) BM25 (with stemming) BM25TP (without stemming) BM25 (without stemming) 1.08 1.06 1.04 1.02 1 0.98 0.96 (b) Precision at 10 documents (relative) 0.55 0.5 0.45 0.4 BM25TP vs. BM25 (with stemming) BM25TP vs. BM25 (without stemming) 10% 20% 30% 40% 50% 60% 70% 80% 90% Relative corpus size 0.35 0.3 (c) Precision at 20 documents (absolute) 1.08 1.06 1.04 1.02 BM25TP (with stemming) BM25 (with stemming) BM25TP (without stemming) BM25 (without stemming) 10% 20% 30% 40% 50% 60% 70% 80% 90% Relative corpus size 1 0.98 0.96 (d) Precision at 20 documents (relative) BM25TP vs. BM25 (with stemming) BM25TP vs. BM25 (without stemming) 10% 20% 30% 40% 50% 60% 70% 80% 90% Relative corpus size Figure 1: Precision after 10 and after 20 documents ­ for plain Okapi BM25 and the proximity-enhanced BM25TP, with stemming either turned on or off. where fD ,Ti is the numb er of occurrences of the term Ti within D, |D| is the length of D (numb er of terms), av g dl is the average document length in the collection, and wTi N is Ti 's IDF weight: wTi = log NT , where N is the total i numb er of documents in the text collection, and NTi is the numb er of documents containing Ti . Our integration of term proximity into the BM25 scoring function is very similar to that presented by Rasolofo and Savoy [1]. For the presentation in this pap er, we chose to use our own method instead of theirs, since it produced slightly b etter results in almost all of our exp eriments. Supp ose a user submits the query Q = {T1 , . . . , Tn }. Then our implementation of BM25 fetches the p osting lists for all query terms from the index and arranges them in a priority queue. It then starts consuming p ostings from all p osting lists, one p osting at a time, in ascending order, to find matching documents and simultaneously compute the relevance scores of all matching documents found (documentat-a-time approach). If an index with full p ositional information is used, Term proximity can b e integrated into this process without much effort. With every query term, we associate an accumulator that contains that term's proximity score within the current document. Whenever the search system encounters a p osting that b elongs to the query term Tj , it looks at the previous p osting, b elonging to the query term Tk , and determines the distance (numb er of p ostings) b etween the current p osting and the previous one. If Tj = Tk , then b oth terms' accumulators are incremented: acc(Tj ) := acc(Tk ) := acc(Tj ) + wTk · (dist(Tj + Tk ))-2 , acc(Tk ) + wTj · (dist(Tj + Tk ))-2 , From these chunk, we built sub collections containing 10%, 20%, . . ., 90% of the documents in the whole collection. For each size, we constructed 20 such sub collections by randomly picking an appropriate numb er of chunks and combining them. We then fed 100 queries from the 2004 and 2005 TREC Terabyte ad-hoc retrieval tasks into our system and executed them using different system configurations (BM25 with or without term proximity; stemming enabled or disabled). Note that this is different from our initial exp eriments, where we only used the 50 topics from 2004. The results depicted in Figure 1 represent the mean precision values over all sub collections of the resp ective size. The figure shows that the relative gain achieved by BM25TP, compared to the original BM25, increases as the underlying text collection grows. This observation is true for b oth P@10 and P@20. It also shows that the relative gain is greater for stemmed queries than for unstemmed queries. We surmise that this is b ecause term proximity helps distinguish b etween stem-equivalent terms that represent the same semantic concept and stem-equivalent terms that don't. We also p erformed exp eriments for sub collections of GOV2 containing documents of different average length. However, the results obtained did not indicate any correlation b etween average document length and the effectiveness of term proximity scoring. Our explanation of the fact that term proximity is more imp ortant for bigger text collections is that for larger collections the likelihood of finding non-relevant documents that contain the query terms by chance is greater than for smaller collections. Term proximity, as an additional feature, helps distinguish b etween these documents and documents that are actually relevant. An imp ortant implication of this finding is that, although using a document-level index instead of a p ositional index can reduce b oth time and space complexity greatly when dealing with very large text collections, it is advisable to keep full p ositional information, as this can significantly increase the system's retrieval effectiveness. where wTi is Ti 's IDF weight (cf. equation 1). For Tj = Tk , the accumulators remain unchanged. When the end of the current document is reached, the document's score is computed, and all proximity accumulators are reset to zero. The score of a document D is: ScoreBM25TP (D) = ScoreBM25 (D) + X T Q min{1, wT } · acc(T ) · (k1 + 1) , acc(T ) + K 4. REFERENCES [1] Y. Rasolofo and J. Savoy. Term Proximity Scoring for Keyword-Based Retrieval Systems. In Proceedings of the 25th European Conference on IR Research (ECIR 2003), pages 207­218, April 2003. [2] S. E. Rob ertson, S. Walker, and M. Hancock-Beaulieu. Okapi at TREC-7. In Proceedings of the Seventh Text REtrieval Conference, Gaithersburg, USA, Novemb er 1998. [3] S. E. Rob ertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference, Gaithersburg, USA, Novemb er 1994. where k1 and K are are the same as in the original Okapi equation. The difference to the strategy followed by Rasolofo and Savoy is that in our approach only neighb oring query term's can affect each other's accumulator and that the impact of term proximity on the document score is smaller b ecause the term weight wT is limited to 1. 3. EXPERIMENTAL RESULTS For our exp eriments, we split up the GOV2 collection into 100 random chunks, containing 252,000 documents each. 622