SIGIR 2007 Proceedings Session 14: Question Answering A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Luo Si Depar tment of Computer Science Purdue University Lafayette, IN 47907 Eric Nyberg Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 jko@cs.cmu.edu ABSTRACT lsi@cs.purdue.edu ehn@cs.cmu.edu Graphical models have been applied to various information retrieval and natural language processing tasks in the recent literature. In this paper, we apply a probabilistic graphical model for answer ranking in question answering. This model estimates the joint probability of correctness of all answer candidates, from which the probability of correctness of an individual candidate can be inferred. The joint prediction model can estimate both the correctness of individual answers as well as their correlations, which enables a list of accurate and comprehensive answers. This model was compared with a logistic regression model which directly estimates the probability of correctness of each individual answer candidate. An extensive set of empirical results based on TREC questions demonstrates the effectiveness of the joint model for answer ranking. Furthermore, we combine the joint model with the logistic regression model to improve the efficiency and accuracy of answer ranking. question analysis and extraction techniques to identify a set of likely candidates, from which the final answer(s) are selected [21, 4, 9]. Since question analysis, document retrieval and/or answer extraction may produce erroneous results, the selection process can be very challenging, as it often entails identifying relevant answer(s) amongst many irrelevant ones. For example, given the question "Which city in China has the largest number of foreign financial companies?", the answer extraction component produces a ranked list of five answer candidates: Beijing (AP880603-0268)1 , Hong Kong (WSJ920110-0013), Shanghai (FBIS3-58), Taiwan (FT9422016) and Shanghai (FBIS3-45320). Due to the imprecision in answer extraction, an incorrect answer ("Beijing") can be ranked at the first position. On the other hand, the correct answer ("Shanghai") was extracted from two different documents and ranked at the third and the fifth positions. In order to rank an answer like "Shanghai" in the top position, we have to address two interesting challenges: · Answer Relevance. How do we identify relevant answer(s) amongst irrelevant ones? This task may involve searching for facts in a knowledge base. For example, IS-IN(Shanghai, China), IS-A(Shanghai, city). · Answer Similarity. How do we exploit similarity among answer candidates? For example, when there are redundant answers ("Shanghai", as above) or several answers which represent a single instance (e.g. "Clinton, Bill" and "William Jefferson Clinton") in the candidate list, how much should we boost the rank of the answer candidate? Effective handling of redundancy is also important when identifying a set of novel answers for list or definition questions. Although many QA systems address these issues separately, there has been little research on generating a probabilistic framework that allows any relevance and similarity features to be easily incorporated. In our previous work [14], we proposed a probabilistic answer ranking model to address these two challenges. The model used logistic regression to estimate the probability 1 Answer candidates are shown with the identifier of the TREC document where they were found. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Algorithms Keywords Answer ranking, probabilistic graphical model, question answering 1. INTRODUCTION Question Answering (QA) aims at finding exact answers to natural language questions in a large collection of documents. Most QA systems combine document retrieval with Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00. 343 SIGIR 2007 Proceedings Session 14: Question Answering that an individual answer candidate is correct given relevance of the answer and the amount of supporting evidence provided by a set of similar answer candidates. The experimental results on TREC factoid questions show that the model significantly improves answer ranking performance for different extraction techniques. However, this model considered each answer candidate separately, and did not consider correlation of the correctness of answer candidates, which is problematic for generating accurate and comprehensive answers. For example, several similar answers may be ranked high in the final answer list, but a less redundant answer may not be ranked high enough to reach the user's attention. In this paper, we propose a new probabilistic answer ranking framework that uses an undirected graphical model to estimate the joint probability of the correctness of all answer candidates, from which the probability of the correctness of an individual candidate can be inferred. The proposed model considers both the correctness of individual answers as well as their correlation to generate an accurate and comprehensive answer list. In comparing the previous logistic regression model (which considers answers independently) to the new graphical model (which jointly considers answers), we will refer to the former as an independent prediction model and the latter as a joint prediction model. Experimental results on the TREC factoid questions [25] show that the joint prediction model significantly improves answer ranking performance over a baseline model, and produces a set of unique answers whose precision is higher than those produced by the independent prediction model. Furthermore, we incorporate the independent prediction model into the joint prediction model for improving the efficiency and accuracy of answer ranking. This paper is organized as follows: Section 2 describes related work. Section 3 presents the independent prediction model and the joint prediction model. Section 4 lists the features used for the models. In Section 5, we describe our experimental methodology and the results. Finally, Section 6 concludes with suggestions for future research. nies broader coverage of location names. This is evidence that the method of combining potential answers may matter as much as the choice of resources. Collecting evidence from similar answer candidates to boost the confidence of a specific answer candidate is also important for answer selection. As answer candidates are extracted from different documents, they may contain identical, similar or complementary text snippets. Some previous work [19, 11] has used heuristic methods like manually compiled rules to cluster evidence from similar answer candidates. Graph-based clustering was also used to consider non-transitiveness in similarity [11]. Similarity detection is more important in list questions which require a set of unique answers. In many systems, cutoff threshold has been used to select the most probably top N answers [8, 13] or exhaustive search to find all possible candidates has been applied [27]. Although previous work has utilized evidence from similar answer candidates for a specific answer candidate, the algorithms only modeled each answer candidate separately and did not consider both answer relevance and answer correlation to prevent the biased influence of incorrect similar answers. As far as we know, no previous work has jointly modeled the correctness of available answer candidates in a formal probabilistic framework, which is very important for generating an accurate and comprehensive answer list. 3. MODELS The independent prediction model estimates the probability of correctness of each answer candidate. It considers two factors. The first factor tries to identify relevant answers by estimating the probability P (correct(Ai )|Ai , Q), where Q is a question and Ai is an answer candidate. The second factor tries to exploit answer similarity by estimating the probability P (correct(Ai ) |Ai , Aj ), where Aj is similar to Ai . By combining these two factors together, the independent prediction model estimates the probability of an answer as: P (correct(Ai )|Q, A1 , ..., An ). Instead of addressing each answer candidate separately, the joint prediction model estimates the joint probability of correctness of available answer candidates. In particular, the joint model estimates the probability of P (correct(A1 ),..., correct(An )|Q, A1 , ..., An ), where n is the number of answer candidates in consideration. The marginal probability of P (correct(Ai )|Q, A1 , ..., An ) for each individual answer as well as the conditional probability P (correct(Ai )|correct(Aj ) , Q, A1 , ..., An ) can be naturally derived from the joint probability. 2. RELATED WORK To identify relevant answers from a list of extracted candidates, several answer selection approaches have used external semantic resources. One of the most common approaches relies on WordNet, CYC and gazetteers for answer validation or answer reranking. In this approach, answer candidates are either removed or discounted if they are not found within the resource's hierarchy corresponding to the expected answer type of the question [26, 17, 3, 6]. The Web also has been used for answer reranking by exploiting search engine results produced by queries containing the answer candidate and question keywords [15]. Wikipedia's structured information has been used for answer type checking [2]. Although each of these approaches uses one or more semantic resources to independently support an answer, few have considered the potential benefits of combining resources together as evidence. There was an attempt to combine geographical databases with WordNet for type checking of location questions [23]. However, the experimental results show that the combination did not improve performance because of the increased semantic ambiguity which accompa- 3.1 Independent Prediction Model The independent prediction model [14] directly estimates the probability of correctness of each individual answer candidate. It was implemented with logistic regression. Figure 1 shows how logistic regression predicts the probability that an answer candidate is correct given multiple answer relevance features and answer similarity features. K1 and K2 are the number of features for answer relevance and answer similarity scores, respectively. n is the number of answer candidates for a question. Each relk (Ai ) is a feature function used to produce an answer relevance score for an individual answer candidate Ai . Each simk (Ai , Aj ) is a scoring function used to calculate an answer similarity between Ai and Aj . Each simk (Ai ) represents one similarity feature 344 SIGIR 2007 Proceedings Session 14: Question Answering P (correct(Ai )|Q, A1 , ..., An ) P (correct(Ai )|rel1 (Ai ), ..., relK 1 (Ai ), sim1 (Ai ), ..., simK 2 (Ai )) K1 K2 P P exp(0 + k relk (Ai ) + k simk (Ai )) k=1 k=1 = K1 K2 P P 1 + exp(0 + k relk (Ai ) + k simk (Ai )) k=1 k=1 N X j =1(j =i) where, simk (Ai ) = simk (Ai , Aj ). Figure 1: Indep endent prediction mo del 0 2 31 n K1 K2 XX XX 1 4( P (S1 , ..., Sn ) = exp @ k relk (Ai ))Si + ( k simk (Ai , AN (i) ))Si SN (i) )5A Z i=1 k=1 N (i) k=1 Figure 2: Joint prediction mo del for an answer candidate Ai and is obtained by summing N1 answer similarity scores to represent the similarity of one answer candidate to all other candidates. The parameters , {k }, and {k } are estimated from training data by maximizing the log likelihood. In particular, the Quasi-Newton algorithm [16] is used. After applying the independent prediction model, answer candidates are reranked according to their estimated probability. For factoid questions, the top answer is selected as a final answer to the question. As logistic regression can be used for a binary classification task with a default threshold of 0.5, we may use the model to identify incorrect answers: if the probability of an answer candidate is lower than 0.5, it may be considered to be a wrong answer and is filtered out of the answer list. This is useful in deciding whether or not a valid answer exists in the corpus [24]. A Boltzmann machine [10, 12] is a special type of undirected graphical model whose node Si has a binary value: either {0,1} or {-1,1}. The joint probability of this graph is represented in Equation 2: P (S ) = X X 1 exp( ij Si Sj + i0 Si ) Z i