SIGIR 2007 Proceedings Poster Improving Weak Ad-Hoc Queries using Wikipedia as External Corpus Y. Li, R.W.P. Luk, E.K.S. Ho and F.L. Chung Depar tment of Computing The Hong Kong Polytechnic University Kowloon, Hong Kong SAR {csyhli, csrluk, csksho, cskchung}@comp.polyu.edu.hk 2. WEAK QUERIES The TREC Robust Track [5] was started in 2003 to focus on p oor p erforming queries. Several new measures were introduced to evaluate the effectiveness on weak queries. Among them, `#0p5', `#0p10', `#0p15', and `#0p20' are the numb er of queries that have zero precision at top 5, 10, 15, and 20 retrieved documents, resp ectivety. `Area' is a weighted sum of the average precision of the 25% worst p erforming queries (e.g. 12 out of 50). The weight is x=r 1/k, k where r is the query rank (weakest has r = 1) and x is size of set. `Area' measures the overall p erformance of weakest queries in a set and weight weaker ones heavier. Since 2004, another new measure Geometric MAP (GMAP) [6] was introduced as an alternative to the mean average precision (MAP). GMAP takes the geometric mean of average precisions of all the queries instead of their arithmetic mean, in order to emphasize scores close to 0. The following table shows a comparison b etween our initial (BM term weights [7]) and PRF results, on the TREC Robust 2005 Track. 40 terms are picked from the top 20 documents for PRF query expansion. As can b e seen from the figures, although PRF improves MAP significantly, most other measures favoring weak queries are lower. init PRF #p5,10,15,20 11 6 5 4 15 10 8 6 MAP .2106 .2733 GMAP .1366 .1436 Area .0315 .0297 ABSTRACT In an ad-hoc retrieval task, the query is usually short and the user exp ects to find the relevant documents in the first several result pages. We explored the p ossibilities of using Wikip edia's articles as an external corpus to expand ad-hoc queries. Results show promising improvements over measures that emphasize on weak queries. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval--Relevance Feedback, Retrieval Models General Terms Algorithms, Exp erimentation Keywords pseudo-relevance feedback, external corpus, Wikip edia 1. INTRODUCTION In web retrieval tasks, the numb er of terms in a query is usually small (two to three on average) [1]. According to [2], if the terms cannot provide enough information of the user's need, the retrieval result may b e p oor. These are known as weak queries [3]. Also, the relevant documents are likely to b e scattered along the retrieval list. In this case, the user may give up after insp ecting the first one or two result pages without finding a relevant document. Pseudo Relevance Feedback (PRF) is a well-known method for improving retrieval effectiveness. However, it is based on the assumption that top retrieved documents are relevant, and thus may actually harm p erformance p ossibly when the initial retrieval's top ranked documents are irrelevant. Some previous works have b een done to address the issue of weak queries in ad-hoc retrieval. Web assistance and data fusion method [3] prob e a web search engine (e.g. Google) to form new queries, and then combine the corresp onding retrieval lists. Our exp eriments, however, use a local rep ository of Wikip edia [4] articles as external corpus. New queries are formed by analyzing Wikip edia articles and a second retrieval on the target corpus is then p erformed. Results show that retrieval effectiveness, esp ecially for weak queries, are improved. Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. The p erformances of certain queries are significantly harmed by PRF. Since PRF is based on the assumption that top N documents are relevant, when the initial retrieval is unsatisfactory, the top N documents are likely to b e irrelevant and may produce ineffective PRF terms. 3. WIKIPEDIA AS EXTERNAL CORPUS Wikip edia is a multilingual, web-based, free-content encyclop edia. Its articles are contributed by users worldwide via Internet. Therefore, unlike traditional encyclop edias, the articles in Wikip edia are constantly b eing up dated and new entries are created everyday. Wikip edia also has detailed guidelines governing its content quality. These characteristics make it an ideal rep ository as background knowledge. All the articles in Wikip edia are available for download through their weekly data dumps. We keep a local copy of such data for faster access. We indexed the articles using the Indri search engine [8]. Retrieval is p erformed using the Markov Random Field model for Term Dep endencies [9] following the query formulation in [10]. 100 articles are retrieved for each query. 797 SIGIR 2007 Proceedings Poster The Wikip edia articles are stored in their own markup language called Wikitext which preserves features such as categories, hyp erlinks, subsections, tables, pictures, etc. We utilized the categorical information in each article to re-rank the retrieval list. Each category c is given a weight Wc which equals to the numb er of articles in the retrieval list that bPongs to c. Each article d is then given a weight el Wd = dc Wc . A new document score Sd is then given: Sd = × Sd - min Sd Wd - min Wd + (1 - ) × max Sd - min Sd max Wd - min Wd 4. RELATED WORKS Many other works have used external corpus to improve retrieval. However, many of them use large corpus based on the b elief that the likelihood of finding good expansion terms increases as the database size increases. Some prob e the web which have billions of documents. Diaz and Metzler [11] stated that the quality of the corpus also matters, as they were able to significantly improve the TREC Robust 2005 track [10] with the BIGNEWS corpus (6.4 million documents), which consists of similar content (b oth news articles). Wikip edia, with 1.6 million documents at the moment, is much smaller than BIGNEWS. However, with all the p olicies and guidelines regulating Wikip edia, as well as the constant addition of articles, we b elieve it can b e a high quality corpus for general purp ose term expansion. Also, with most Wikitext markup features (e.g. links, sections, references) remain unexplored, it certainly has a lot of p otential. The challenge is how to make use of such features to find the information accurately and select terms more effectively. where Sd is the original score and is a constant equals to 0.8. The 100 articles are re-ranked according to Sd and 40 terms are picked from the top 20 articles to expand the query. New retrieval is p erformed and shown as Wiki b elow. init 0.4840 0.4720 0.4320 0.4030 11 6 5 4 0.0315 0.2106 0.1366 PRF 0.4800 0.4860 0.4707 0.4600 r 15 10 8 6 0.0297 0.2733 0.1436 Wiki 0.5640 0.5200 0.4947 0.4720 4 4 4 3 0.0465 0.2555 0.1735 P@5 P@10 P@15 P@20 #0p5 #0p10 #0p15 #0p20 area MAP GMAP 5. CONCLUSION We explored Wikip edia as external corpus to expand the query. Wikip edia is esp ecially useful to improve weak queries which PRF is unable to improve. Despite its comparatively smaller size, we b elieve the content quality and the evolving nature make it a good resource. In the future, we will look for more effective ways to find information and integrate it with existing systems. Although PRF outp erforms Wiki in MAP, Wiki b eats PRF in most other measures favoring weak queries. A look into the p er query result shows that, among the 15 queries that are hurt by PRF, 14 are improved by Wiki. For the other 35 queries, PRF and Wiki b oth produced b etter result over initial retrieval, though Wiki is not as effective as PRF. This is probably b ecause of the differences in language and context b etween Wikip edia (up-to-date general knowledge Encyclop edia) and the target corpus (news archive within a time range). Also, since the size of Wikip edia is limited, some topics may not b e covered well. 15 PRFPRF Wiki .1166 .1813 .0256 .1042 35 PRF+ PRF Wiki .3413 .2868 .2645 .2157 MAP GMAP init .1436 .0761 init .2396 .1756 For the 15 queries hurt by PRF, Wilcoxon signed-ranks test with p < 0.05 shows significant increases of Wiki over PRF. For the rest, the test shows that b oth have significant increases over initial retrieval. Similarly, there are also 15 queries hurt by Wiki. The difference b etween those hurt by PRF, however, is that these queries p erformed well in the initial retrieval, with M AP = 0.2094, and GM AP = 0.1279. Manually insp ecting the Wikip edia articles reveals that some queries are not well covered while for the other queries, term selection failed to pick a good set of terms for expansion. The ab ove result shows two p ossible ways for improvement. The first is to identify topics that are not well covered in Wikip edia. This could b e done by analyzing the top N retrieved Wiki articles. The second is to integrate Wiki and PRF methods, hop efully to eliminate queries hurt by either. [1] A. Spink, B. J. Jansen, D. Wolfram, and T. Saracevic. From E-Sex to E-Commerce: Web search changes. IEEE Computer, 35(3):107­109, 2002. [2] C. Buckley. Why current IR engines fail. In SIGIR, pages 584­585, 2004. [3] K.L. Kwok, L. Grunfeld and P. Deng. Improving weak ad-hoc retrieval by web assistance and data fusion. In AIRS, pages 17­30, 2005. [4] Wikip edia, the free encyclop edia. http://www.wikip edia.org. [5] E.M. Voorhees. Overview of TREC 2003. In TREC, pages 1­13, 2003. [6] S.E. Rob ertson. On GMAP: and other transformations. In CIKM, pages 78­83, 2006. [7] S.E. Rob ertson and S. Walker. Some simple effective approximations to the 2-p oisson model for probabilistic weighted retrieval. In SIGIR, pages 232­241, 1994. [8] T. Strohman, D. Metzler, H. Turtle and W.B. Croft. Indri: A language-model based search engine for complex queries. In International Conference on Intel ligence Analysis, 2005. [9] D. Metzler and W.B. Croft. A markov random field model for term dep endencies. In SIGIR, pages 472­479, 2005. [10] D. Metzler, F. Diaz, T. Strohman, and W.B. Croft. Umass robust 2005: Using mixture of relevance models for query expansion, 2005. [11] F. Diaz and D. Metzler. Improving the estimation of relevance models using large external corp ora. In SIGIR, pages 154­161, 2006. 6. REFERENCES 798