SIGIR 2007 Proceedings

Poster

Boosting Static Pruning of Inverted Files
IRLab, Computer Science Depar tment University of A Coruña, Spain

Roi Blanco

rblanco@udc.es

IRLab, Computer Science Depar tment University of A Coruña, Spain

Alvaro Barreiro

barreiro@udc.es

ABSTRACT
This pap er revisits the static term-based pruning technique presented in [2] for ad-hoc retrieval, addressing different issues concerning its algorithmic design not yet taken into account. Although the original technique is able to retain precision when a considerable part of the inverted file is removed, we show that it is p ossible to improve precision in some scenarios if some key design features are prop erly selected.

Table 1: Collection TREC disks 4&5 WT2g WT10g

Collections and Topics size Topics # 2G 301-450+601-700 2G 401-450 10G 451-550

queries 250 50 100

Categories and Subject Descriptors
H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval

General Terms
Performance, Exp erimentation

Keywords
Pruning, indexing, efficiency

threshold. The cut-off value is set to zt   where zt is the k-th highest score for a given term. Results presented at [1] and [2] prove that the technique conveys excellent results for maintaining precision when a high quantity of the data is removed. Our b elief is that by issuing some key features, the technique can go further than just faithfully preserving the document ranking, and b e able to increase precision values at some pruning levels and under certain conditions. We used a probabilistic-based scoring function (BM25) instead of Smart's tf-idf, addressed some features not considered in the original model, and identified such conditions through trec-style intensive evaluation.

2. EXPERIMENTS AND RESULTS
We exp erimented with three different document/topics sets, describ ed in table 1, and measured P@10 and MAP. Terms were stemmed using Porter's algorithm. As well, we considered three typ es of queries: short (title only), medium (title + description) and long (title + description + narrative). The k and  values in the original description of the algorithm must b e selected carefully, as they hold imp ortant prop erties regarding to ranking similarity b efore and after pruning; esp ecially,  should b e low. In this pap er we consider k and  just as parameters that determine the numb er of p ointers removed (p ercentage of pruning), without focusing on theoretical guarantees. Choosing different k values yields very correlated results, although higher values (k = 30) are preferable for obtaining b etter MAP values whereas lower (k = 10) values seem more adequate if a good b ehaviour at P@10 is desired, which goes accordingly with the top-k preserving prop erty construction of the algorithm. Some of the other considerations we took into account are summarised next. Terms with document frequency df > N/2 can b e discarded from the inverted file, where N is the total numb er of document". This val"e comes naturally from BM25's idf fors u - +0 mula, log N dfdf 0.5.5 , as those terms have a negative score + for every document. It turned out that ruling out terms

1.

INTRODUCTION

Static pruning of inverted files is an attractive technique to reduce storage needs and query processing times. Pruning is intended to remove the pointers (term-document pairs) most likely not to affect query p erformance. The algorithm presented in [2] yields excellent results at keeping the topk answers for a given query and scoring function (Smart's tf-idf ), making this approach suitable for high-demanding environments. It involves two parameters, k and , that set the numb er of top-documents (k) that are scored the same (within an error of ) whether the original or the pruned inverted file is used, for any query less than a certain size ( 1 ). Hence, the rankings produced are similar. In brief,  the pruned inverted file is produced by discarding, for every term, the documents that score lower than a certain

Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007.

777


SIGIR 2007 Proceedings

Poster

0.46 0.45 0.44 0.43 0.42 0.41 0.4 0.39 0.38 0.28 0.27 0.26 0.25 0.24 0.23 0.22 0.21 0.2 0.19 0

c1 (p10) c2 (p10) baseline (p10)

c1 (map) c2 (map) baseline (map) 10 20 30 40 Pruning 50 60 70

Figure 1: MAP and P@10 for short queries at different pruning levels, baseline and different settings (WT2g collection)

with df > N/2 is always a good option. This holds true esp ecially for long queries, where those terms are more likely to app ear. In the original work ([2]) it is stated that every score must b e shifted, otherwise the pruning level obtained is negligible. This shifting is done by subtracting the global minimum score. As we considered  as just a parameter, we omitted the shifting step. Following the description of the idealised pruning algorithm in [2], document lengths (d l ) should b e the same in b oth the original and the pruned inverted file. However, it makes sense to up date them in order for the inverted file to b e coherent. Exp eriments demonstrated that the b est pruning and precision tradeoffs are obtained when the document lengths are up dated. A last design consideration was whether to up date the average document length avgd l in the pruned inverted file or not. Correcting this value p ortrays a new term frequency (tf ) normalisation factor. Up dating avgd l p erforms slightly b etter than not doing it for long queries, whereas for short queries up dating is outp erformed by not doing it. Therefore, the new term frequency normalisation introduced by the algorithm seems adequate for a short-queries scenario. To b e more concrete, we present next some of the precision vs. pruning results using standard MAP and P@10 measures. To illustrate the effects of some of these design considerations figure 1 shows three precision curves obtained using the WT2g collection and short queries. Our baseline is the result using BM25, not up dating the d l s, not up dating the avgd l and not removing terms with df > N/2 (this setting is the one that retains b etter the original ranking). The other two curves are obtained removing terms, up dating the d l s (the b est setting found empirically), and c2 up dates the avgd l while c1 does not. With the baseline settings, the method is b etter for maintaining the top-k results, which is easier for shorter queries, but the precision curves are decreasing. On the other hand, a different parameter selection improves significantly over the original precision values, up to a 50% pruning level for MAP and 70% for P@10. Overall, using a suitable combination of parameters pruning is able to improve MAP and P@10. Considering the precision values obtained with the original not pruned inverted

file as our baseline, it can b e retained up to a 35-50% pruning level in most cases, although the method tends to favour short queries and the P@10. In that case, MAP(P@10) is over the baseline up to a 30-50%(60-70%) pruning level. Maximum improvements go up to 12% for MAP and 10% for P@10. These values are collection-dep endent and lower with long queries. The b ehaviour is b etter in web collections than in disks 4&5, where MAP improvements are very small, even though the original precision value can still b e maintained up to a 40-60% pruning level. Parameter selection is crucial, and different algorithm settings lead to totally different p erformances. P@10 presents a good b ehaviour with any query size, whereas MAP b ehaves b etter with short queries. Some parameter combinations (document length up date, selective term removal) are consistently b etter than the rest, although the improvements are more noticeable in the web collections (WT2g esp ecially). Finally, the curves are not always monotonically decreasing, and high pruning values may increase precision, probably due to the score function ranking high bad document-term p ointers, which are removed at those pruning levels. We end this section showing that the effect of pruning up dating the d l and without up dating the avgd l in BM25's score, is to alter the tf contributions in a particular fashion. t The tf normalisation in BM25 is tfn = nf where nf = f dl (1 - b) + b avgdl , and b  [0..1] is a constant (typically 0.75). If  terms are removed from a document by the pruning algorithm, the new normalisation factor would b e nf  = nf - b av dl . This has the global effect of softening the d l g contribution, and it can b e compared with selecting a lower b value. In this case, b = b -  which implies that nf  = dl nf -  ( avgdl - 1), and hence b oth normalisation factors have the same analytical form for long documents (dl > av g dl), and different otherwise.

MAP

P10

3. CONCLUSIONS
In this work we outlined and exp erimented with several new variants of the most well-known inverted file pruning algorithm [2], and stated how it is p ossible to tweak the technique so that the precision is improved when some information is removed selectively. As a main conclusion, discarding p ointers from an inverted file off-line and indep endently of the queries, is able to devise satisfactory results, efficiency and effective-wise, for ad-hoc retrieval. As well, carefully addressing the algorithmic issues involved brings some new intuitions regarding to weighting schemes or term frequency normalisation. Acknowledgements. The work rep orted here was co-funded by SEUI and FEDER under pro ject MEC TIN2005-08521C02 and "Xunta de Galicia" under pro ject PGIDIT06PXIC10501PN.

4. REFERENCES
[1] D. Carmel, E. Amitay, M. Herscovici, Y. S. Maarek, Y. Petruschka, and A. Soffer. Juru at TREC 10 Exp eriments with index pruning, 2001. [2] D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, Y. Maarek, and A. Soffer. Static index pruning for information retrieval systems. In Proc. of SIGIR 2001, pages 43­50.

778