SIGIR 2007 Proceedings Poster Towards Robust Query Expansion: Model Selection In The Language Modeling Framework Mattan Winaver, Oren Kurland and Carmel Domshlak Faculty of Industrial Engineering and Management, Technion ­ Israel institute of technology mattanw@tx.technion.ac.il, kurland@ie.technion.ac.il, dcarmel@ie.technion.ac.il ABSTRACT We prop ose a language-model-based approach for addressing the p erformance robustness problem -- with resp ect to free-parameters' values -- of pseudo-feedback-based queryexpansion methods. Given a query, we create a set of language models representing different forms of its expansion by varying the parameters' values of some expansion method; then, we select a single model using criteria originally prop osed for evaluating the p erformance of using the original query, or for deciding whether to employ expansion at all. Exp erimental results show that these criteria are highly effective in selecting relevance language models that are not only significantly more effective than p oor p erforming ones, but that also yield p erformance that is almost indistinguishable from that of manually optimized relevance models. Categories and Sub ject Descriptors: H.3.3 [Information Search and Retrieval]: Retrieval Models General Terms: Algorithms, Exp erimentation Keywords: query expansion, language models, query clarity, model selection, query drift for improving the robustness of automatic expansion methods. Given a query, we (i) create a set of models that represent different forms of its expansion, (ii) estimate which model will yield the b est retrieval p erformance, and (iii) use the chosen model for retrieval. The set of models considered for a query can b e derived, for instance, by using different expansion methods, and/or by varying the values of the free parameters that these methods incorp orate. Our estimates of which model will p erform b est utilize measures that were prop osed to either predict the resultant p erformance of using the original query [6], or to decide whether the query should b e expanded at all [7]; sp ecifically, we use the language modeling framework [5] and adapt the query clarity [6] and distance b etween retrieved document lists [7] measures, resp ectively. Our exp erimental results show that using these selection criteria over sets of relevance models [9] results in p erformance that is significantly b etter than that of the p oorp erforming relevance models, and is in many cases almost as good as that of a manually optimized one. 2. MODEL-SELECTION CRITERIA 1. INTRODUCTION Automatic query expansion based on pseudo feedback [12] is an effective approach for addressing issues rising from the use of short queries. The idea is to construct a query model using terms from b oth the original query and documents that are the highest-ranked ones by some initial search. However, since typically the documents from the initial search are not all relevant, and do not necessarily exhibit all query aspects [4], the constructed query model is often far from b eing "optimal". Indeed, the resultant retrieval p erformance is often highly sensitive to the choice of the numb er of documents and the numb er of terms used for defining the model. Furthermore, for some queries, retrieval based on the original query results in p erformance sup erior to that resulting from the use of an expansion model. Thus, recent research on improving the robustness of expansion methods has focused on either predicting whether a given expansion will b e more effective for retrieval than the original query [2, 7], or on improving the p erformance robustness of sp ecific expansion methods [10, 13]. Inspired by work on combining multiple, mainly b ooleanbased, query representations [3], we prop ose a new approach Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. Throughout this section we assume that the following have b een fixed: a corpus C of documents, a query q , and a set of language models M = {M1 , . . . , Mn } that represent "expanded forms" of q . Our first model-selection criterion, denoted clr , is based on the query clarity measure [6] -- we choose the model in M that maximizes the KL divergence D(p(·|Mi )||p(·|C )) = P p( w | M i ) w p(w |Mi ) log p(w|C ) , where p(w |C ) is the maximum likelihood estimate of term w with resp ect to C . Note that in the original prop osal of query clarity [6], the divergence is calculated for a single expansion model, and is shown to correlate with the retrieval p erformance of using the original query q . In contrast, here we assume that the divergence is correlated with the p erformance of using the expansion model itself. The selection criterion just describ ed does not, however, handle the query drift problem [11] -- an expansion model can b e "clear" (i.e., distant from the corpus model), and yet represent topic(s) different than those exhibited by q . We therefore exp eriment with the ndr if t selection criterion that utilizes an approach originally suggested for deciding whether to expand a query at all [7]. Sp ecifically, for each (expanded) query model Mi , we construct a (unigram) language model Mi by (i) constructing the unsmoothed unigram language models of the top-100 retrieved documents in a search p erformed over C using the KL-retrieval method [8] with Mi as the query model, and (ii) smoothing the uni- 729 SIGIR 2007 Proceedings Poster AP89 AP88+89 WSJ SJMN TREC8 queries 1-50 101-150 101-150 51-150 401-450 disks 1 1,2 1,2 3 4,5 ­ CR AP89 AP88+89 WSJ SJMN TREC8 w or st 17.5 20.26 15.21 17.22 17.6 Relevance Mo del best c lr 24.36w 22.56w b w 29.35 27.65w w 28.2 26.57w b 23.96w 21.97w b 24.22w 22.19w b ndr if t 23.44w b 28.62w 27.65w 23.5w 24.01w Interp olated Relevance Mo del w or st best c lr ndr if t 19.7 24.76w 23.24w 23.89w b b w w 26.61 30.39 29.44 30.36w w w 23.25 29.63 28.11b 29.11w 19.78 24.58w 22.78w 24.16w b 24.23 27.17w 26.05w 27.07w b Figure 1: The left table presents the details of the TREC corpora used for experiments. The right table depicts the MAP performance numbers of the (interpolated) relevance model that was chosen from a set of candidates either manually (wor st, best) or by using a model-selection criterion (clr , ndr if t). Statistically significant differences with wor st and best are marked with w and b respectively. form interp olation of these 100 language models with the Jelinek-Mercer approach wherein the smoothing parameter is set to 0.2. We then choose the Mj for which D(Mq ||Mj ) is minimal, where Mq is an unsmoothed unigram language model constructed from q , and Mq is constructed as ab ove using the top-100 retrieved documents from a search with Mq as the query model. Once a query model M is selected from M by either of the two criteria just describ ed, we can rank documents in the corpus using the KL-retrieval approach [8] with M . 4. CONCLUSION We showed that using language-model-based selection criteria for choosing a query model from a set of candidates results in p erformance that is significantly b etter than that of a p oor-p erforming query model and is almost (statistically sp eaking) as good as that of a manually optimized one. Acknowledgments We thank the anonymous reviewers for their comments. The pap er is based up on work done in part while the second author was at Cornell University and up on work supp orted in part by the National Science Foundation under grant no. I IS-0329064. Any opinions, findings, and conclusions or recommendations expressed are those of the authors and do not necessarily reflect the views of any sp onsoring institutions or the U.S. government. 3. EXPERIMENTS To derive the set of query models M, we use either a relevance model [9], or an interpolated relevance model [1] -- the latter interp olates the former with the original query model. We set the numb er of top-retrieved documents and the numb er of terms, for constructing the (interp olated) relevance models, to values in {5, 10, 20, 30, 40, 50, 75, 100} and {50, 100, 500, 1000} resp ectively1 . Thus, for each query we get two sets of 32 (interp olated) relevance models. For each set we identify the best and wor st p erforming models in terms of MAP as measured over all tested queries2 . We ran our exp eriments on TREC corp ora (details in Figure 1) using the Lemur toolkit (www.lemurpro ject.org). We used topics' titles as queries, applied the Porter stemmer and removed INQUERY stopwords. Statistical significance was determined using the two-sided Wilcoxon test at the 95% confidence level. Figure 1 depicts the MAP p erformance results.3 Our first observation is that, indeed, a p oor selection of a relevance model results in p erformance that is much inferior to that of a manually optimized one. (Note that best outp erforms wor st to a statistically significant degree in all cases.) We also see in Figure 1 that b oth selection criteria result in p erformance that is always significantly b etter than that of wor st. Furthermore, the p erformance of ndr if t is in most cases statistically indistinguishable from that of best. While the resultant p erformance of using ndr if t is always b etter than that resulting from using clr , we note that the former requires running all the different query models while the latter only requires running the selected model. 1 For constructing the relevance models, the value of the smoothing parameter of the top-retrieved documents' language models is set to 0.2. 2 The interp olation parameter of al l interp olated relevance models, which controls the reliance on the original query model, is fixed to the (manually optimized) value chosen for the best model. 3 Similar p erformance patterns are obtained with resp ect to precision@10. The actual p erformance numb ers are omitted due to space considerations. 5. REFERENCES [1] N. Ab dul-Jaleel, J. Allan, W. B. Croft, F. Diaz, L. Larkey, X. Li, M. D. Smucker, and C. Wade. UMASS at TREC 2004 -- novelty and hard. In Proceedings of TREC-13, 2004. [2] G. Amati, C. Carpineto, and G. Romano. Query difficulty, robustness, and selective application of query expansion. In Proceedings of ECIR, pages 127­137, 2004. [3] N. J. Belkin, C. Co ok, W. B. Croft, and J. P. Callan. The effect of multiple query representations on information retrieval system p erformance. In Proceedings of SIGIR, pages 339­346, 1993. [4] C. Buckley. Why current IR engines fail. In Proceedings of SIGIR, pages 584­585, 2004. Poster. [5] W. B. Croft and J. Lafferty, editors. Language Modeling for Information Retrieval. Numb er 13 in Information Retrieval Bo ok Series. Kluwer, 2003. [6] S. Cronen-Townsend, Y. Zhou, and W. B. Croft. Predicting query p erformance. In Proceedings of SIGIR, pages 299­306, 2002. [7] S. Cronen-Townsend, Y. Zhou, and W. B. Croft. A language mo deling framework for selective query expansion. Technical Rep ort IR-338, Center for Intelligent Information Retrieval, University of Massachusetts, 2004. [8] J. D. Lafferty and C. Zhai. Do cument language mo dels, query mo dels, and risk minimization for information retrieval. In Proceedings of SIGIR, pages 111­119, 2001. [9] V. Lavrenko and W. B. Croft. Relevance-based language mo dels. In Proceedings of SIGIR, pages 120­127, 2001. [10] X. Li and W. B. Croft. Improving the robustness of relevance-based language mo dels. Technical Rep ort IR-401, Center for Intelligent Information Retrieval, University of Massachusetts, 2005. [11] M. Mitra, A. Singhal, and C. Buckley. Improving automatic query expansion. In Proceedings of SIGIR, pages 206­214, 1998. [12] I. Ruthven and M. Lalmas. A survey on the use of relevance feedback for information access systems. Know ledge Engineering Review, 18(2):95­145, 2003. [13] T. Tao and C. Zhai. Regularized estimation of mixture mo dels for robust pseudo-relevance feedback. In Proceedings of SIGIR, pages 162­169, 2006. 730