Improving QA Retrieval using Document Priors James Mayfield and Paul McNamee The Johns Hopkins University Applied Physics Laboratory 11100 Johns Hopkins Rd. Laurel MD 20723-6099 USA +1 (240) 228-6944 +1 (240) 228-3816 james.mayfield@jhuapl.edu ABSTRACT We present a simple way to improve document retrieval for question answering systems. The method biases the retrieval system toward documents that contain words that have appeared in other documents containing answers to the same type of question. The method works with virtually any retrieval system, and exhibits a statistically significant performance improvement over a strong baseline. paul.mcnamee@jhuapl.edu might augment the WHERE question with the word north. Because north is unlikely to be as important in relevant documents as the words from the original query, we will give it a small weight relative to the weights of the original terms. With this approach, we can simulate the use of document priors using almost any retrieval system. 2. Implementation Three questions must be answered to complete this approach to augmenting queries: 1. What question types will we use? Augmenting all questions with the same words is not likely to lead to large improvements (north is unlikely to help with WHO questions). Conversely, using too many question types would necessitate large training sets. 2. How should the set of augmentation words for a given type of question be selected? 3. How should augmentation words be weighted relative to original query terms? We used the TREC-2002, -2003, and -2004 questions and judgments as training data [5]. For our experiments, we used a taxonomy of seven question types: HOW, HOW_MANY, WHAT_IS, WHEN, WHERE, WHO and OTHER. This provided us with categories that might reasonably benefit from different augmentation words, while leaving a reasonable set of training data for each category. To automatically assign a question to its question type, we first parse the question using the Charniak Parser [1]. We then use a simple pattern-matching approach to map linearized parse trees onto the first six question types. Any question that does not match a pattern is assigned to an 'OTHER' category. Once the training questions are partitioned into question types, we want to use those assignments to identify augmentation words. However, the straightforward approach of identifying important terms in relevant documents would undoubtedly identify content words particular to the training questions, rather than more general words relevant to a broad range of questions. The relevance judgments (qrels) from TRECs 2002 through 2004 are not restricted to relevant documents though; they also list documents that were put forward by one or more systems but judged not relevant by the assessors. We therefore use the full range of relevance judgments from the training data to select augmentation words. For each question type, we identify two document sets: those that were listed as 'relevant' for questions of that type, and those that appeared in the judgments but that were judged not relevant. The TREC QA judgments are not binary, but have four relevance values (descriptions taken from the track guidelines): 1. incorrect: the answer-string does not contain a correct answer or the answer is not responsive; 2. unsupported: the answer-string contains a correct answer but the document returned does not support that answer; Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval ­ query formulation, search process. General Terms Algorithms, Experimentation. Keywords: document priors. Document retrieval; question answering; 1. Approach Question answering systems typically begin their processing of a question by performing document retrieval to find documents that are likely to contain an answer to the question. Retrieval results are either used directly, or forwarded to a passage retrieval system. Some systems (e.g., Cui et al. [2]) place strong emphasis on the retrieval phase, performing sophisticated processing on the question and document set, and using complex algorithms to assess matches; others use traditional approaches to retrieval, relying on subsequent processing stages to narrow the search. We were interested in whether the probability that a document contains an answer to a question might be assessed before the question is seen. That is, we wanted to know whether certain documents have higher prior probability of containing question answers than others. The mathematics of many language modeling approaches to information retrieval easily admits the use of document priors (see e.g., Miller et al. [4]). However, this approach works only with certain retrieval systems, and may demand a complex training phase. An alternative to embedding document priors into the retrieval system's similarity metric is to augment the query to include terms that might appear in documents that have higher prior probability of relevance. For example, if we expect a document containing the word north to have a higher-thanaverage probability of being relevant to a WHERE question, we Copyright is held by the author/owner(s). SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. 677 non-exact: the answer-string contains a correct answer and the document supports that answer, but the string contains more than just the answer (or is missing bits of the answer); and 4. correct: the answer-string consists of exactly a correct answer and that answer is supported by the document returned. For our purposes, we treat correct and non-exact as relevant, and all others as non-relevant. Given these two document sets (judged relevant and judged non-relevant), we want to extract terms that are prominent in the relevant documents but not in the non-relevant documents. To do so, we first consider each document set separately, extracting words that appear frequently in that set, but relatively infrequently in the collection as a whole. We use a home-brew `affinity' statistic for this purpose [3], but other measures, such as mutual information or the Dice coefficient, might work as well. The result is an ordered list of scored terms. We then score each term by the difference of its scores in the two sets. Finally, we select top scoring terms as the expansion terms. We generated augmentation terms for each question type. Sample augmentation terms are show in Table 1. Some augmentation terms are clearly sensible, such as north, near and miles for WHERE questions. Others, such as rock and ago, make less intuitive sense. Using retrieved but non-relevant documents to provide terms that should not appear in the augmentation lists eliminates most question-specific terms. However, a fair amount of noise remains in these augmentation term lists. Nonetheless, we used the unaltered lists in our experiments. 3. simplistic question type assignment mechanism, and with little tuning of the weighting parameter. It is reasonable to expect that further gains might be seen if these two weaknesses were addressed. 4. Conclusions Assigning different document priors based on question type can produce a boost in retrieval performance. Our technique of adding query terms based on the question type is a way to exploit nonuniform document priors in any retrieval system without modifying the index. In spite of a simplistic question type assignment algorithm and a roughly tuned query term weight assignment, we nonetheless saw a statistically significant performance boost of 4.8% from the technique in the TREC- 2005 Document Ranking task, as well as in our own cross-validation experiments on the question sets from the three prior TRECs. We therefore recommend the technique as a way to boost retrieval effectiveness for question answering systems. Careful tuning of the approach would likely increase that boost. Table 1. Top ten augmentation words for three categories. HOW Built Park Scientists Orbit Became NASA Sun Water Found Thought WHEN Singer Death Thought America War Space History King William II WHERE War North Ago Rock West Near Museum Across Began Miles 3. Evaluation We built our augmentation sets using TREC-2002, -2003 and -2004 data, and tested the approach using data from the TREC2005 Document Ranking subtask of the Question Answering track [6]. This latter task used fifty questions, all with answer documents in the collection. Assessors generated binary relevance judgments for the pool of top documents from 77 submitted runs. To process each test question, we first assign it a question type as described above. We then build a new query, comprising the terms from the original query, plus the expansion terms for the selected question type. We weight query terms at a ratio of 25:1 relative to the expansion terms. We used three-fold crossvalidation on TREC-2002, -2003 and -2004 data to measure performance using ratios of 25:1, 100:1 and 250:1, with 25:1 proving to be the best of the three. A more accurate tuning of this parameter might lead to significant additional gains from the technique. Finally, we processed the augmented query normally; we used the HAIRCUT retrieval system [3] with unstemmed words as indexing terms, a unigram language model with =0.5, and no blind relevance feedback. Our results show good improvement from the use of this technique. Mean average precision without augmentation was 0.3364. This is a reasonably strong baseline; if entered into the TREC-2005 Document Ranking task, these results would have ranked sixth. When augmentation words were added to the queries, mean average precision rose to 0.3528. Augmentation thus produced a 4.8% relative improvement, which would have ranked fourth if entered in TREC-2005. This result is statistically significant, using a Wilcoxon test, to well below the 0.01 level. The percentage of questions with at least one relevant document in the top ten rose from 0.86 to 0.90. These gains were achieved with a 5. REFERENCES [1] Charniak, E. `A maximum-entropy-inspired parser.' Proceedings of the 1st Meeting of the NAACL. Seattle, Washington, pp. 132-139. 2000. [2] Cui, H., Sun, R., Li, K., Kan, M. and Chua, T. `Question answering passage retrieval using dependency relations.' Proceedings of the 28th ACM SIGIR Conference, pp. 400-407. 2005. [3] Mayfield, J. and McNamee, P. `The HAIRCUT information retrieval system.' Johns Hopkins APL Technical Digest 26(1):2-14. 2005. [4] Miller, D., Leek, T. and Schwartz, R. `BBN at TREC7: Using hidden Markov models for information retrieval.' Proceedings of the Seventh Text REtrieval Conference (TREC-7), pp. 133142. 1999. [5] Text REtrieval Conference Web site, http://trec.nist.gov/. [6] Voorhees, E. and Dang, H. `Overview of the TREC 2005 Question Answering Track.' Notebook Proceedings of the Fourteenth Text REtrieval Conference, TREC 2005. 678