Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Dept. of Computer Science & Engineering Korea University 1, 5-ga, Anam-dong, Seongbuk-gu, Seoul 136-701, Korea {kshan, song, rim}@nlp.korea.ac.kr ABSTRACT This pap er prop oses a probabilistic model for definitional question answering (QA) that reflects the characteristics of the definitional question. The intention of the definitional question is to request the definition ab out the question target. Therefore, an answer for the definitional question should contain the content relevant to the topic of the target, and have a representation form of the definition style. Modeling the problem of definitional QA from b oth the topic and definition viewp oints, the prop osed probabilistic model converts the task of answering the definitional questions into that of estimating the three language models: topic language model, definition language model, and general language model. The prop osed model systematically combines several evidences in a probabilistic framework. Exp erimental results show that a definitional QA system based on the prop osed probabilistic model is comparable to state-of-theart systems. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Algorithms, Exp erimentation Keywords Definitional question answering, probabilistic model, language model 1. INTRODUCTION Definitional question answering (QA) is a task of answering definitional questions used for finding out conceptual facts or essential events ab out the question target. The definitional questions take the form of "What is X?" or "Who is X?", such as "What is fractals?" or "Who is Andrew Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00. Carnegie?", and X is called the question target. There are several characteristics of the definitional QA that are different from those of the factoid QA handling a question such as What country is Aswan High Dam located in?. Definitional questions do not clearly imply an exp ected answer typ e but contain only the question target (e.g., fractals and Andrew Carnegie in the ab ove example questions), contrary to the factoid questions involving a narrow answer typ e (e.g., country name for the ab ove example question). Thus, it is difficult to find which information is useful for the answer to a definitional question. Another difference is that a short passage cannot answer the definitional questions b ecause a definition needs several essential information ab out the target. Therefore, the answer of definitional questions can consist of several comp onent information called information nuggets. Each answer nugget is naturally represented by a short noun phrase or a verb phrase. Good definitional QA systems have to find out more answer nuggets with shorter length. Related works have b een actively carried out, mainly oriented by TREC Question Answering track [17]. Most of the approaches rank answer candidates by combining various information heuristically without any formal model for definitional QA. The heuristic approaches are widely used for easy implementation. These approaches use external resources, such as a dictionary or an encyclop edia, and definition patterns for ranking answer candidates [10, 6, 18, 8]. The definition patterns are also used for extracting shorter phrases than sentences [7, 19, 3]. However, since the heuristic approaches cannot explain the role of each information in a conceptual model, it is difficult to analyze why the system p erformance improves or deteriorates. Therefore, it is not easy to make the systems based on the heuristic approaches p erform b etter. Prager et al. [10] prop osed a definitional QA method which splits a definitional question into several auxiliary factoid questions and combines the factoid answers into a definitional answer. The auxiliary questions are generated by using patterns constructed manually according to the typ e (e.g., p erson) of question target. However, it is difficult to set the auxiliary questions in advance, b ecause the essential facts or events can b e different b etween the question targets even if the targets have the same typ e. Berger et al. [1] and Soricut & Brill [14] applied translation model to QA, considering the question and answer as a target and a source language, resp ectively. They collected the bulk of question-answer pairs like FAQ (frequently asked question) list, and trained the probability that each word in 212 the answer is translated to each word in the question. The translation model is a formal model for QA, but it is difficult to use the model for definitional QA. The model requires a large amount of training corpus and complex word alignment b etween the question and the answer. The word translation probability does not seem to b e well estimated for definitional QA, b ecause the definition question is very short (i.e., even the only question target) and the answer is very long. Moreover, it is hard to use other information than the translation probability in the translation model. We prop ose a formal model for definitional QA, considering the characteristics of the definitional questions. A definitional question can b e interpreted as the request for the definition ab out the question target. Therefore, the answer for the question should b e a definition-style representation form related to the question target. We model the definitional QA from the two p oints of view, topic and definition. The prop osed model can naturally incorp orate various information during the topic and definition modeling. We explain our intuition and the prop osed model in detail in Section 2 and describ e a definitional QA system based on the prop osed model in Section 3. Exp erimental results and discussions are presented in Section 4. Finally, we conclude our work in Section 5. contains "NASA", it is ab out John's housewarming party, and it is of no use as a definition ab out "NASA". The sentence S4 represents the topic of the target, but it cannot b e considered as a definition sentence. On the other hand, the sentence S6 is a definition sentence, but it does not represent the topic of the target. Therefore, it is reasonable to think of {S 1, S 2, S 3} as the answer to the question. The example shows that it is necessary for a definitional QA system to see if each answer candidate represents b oth the topic related to the question target and the definition. Given that T is a set of answer candidates describing the topic related to the question target and D is a set of answer candidates representing the definition, the answer A to a definitional question ab out the target is the intersection of T and D. A = {a | a T , a D} (1) We design a probabilistic model for definitional QA from this p oint of view. 2.2 Definitional Question Answering Model Given a definitional question Q ab out the target X and a sentence S 2 , we have the following events: · T : The sentence S includes the contents related to the topic of the target X . · T : The sentence S includes no content related to the topic of the target X . · D: The sentence S represents the definition. · D: The sentence S does not represent the definition. As mentioned ab ove, the answer to the definitional questions consists of the sentences which do not only describ e the contents related to the topic of the target X , but also represent the definition. Therefore, definitional QA can b e considered as a task of finding the sentences which maximize a joint probability P (T , D|S ) where P (T , D|S ) is the probability that the sentence S does not only include the contents related to the topic of the target X , but also represent the definition. The probability is rewritten by using the chain rule as follows: P (T , D|S ) = P (T |S )P (D|T , S ) (2) where P (T |S ) is the probability that the sentence S includes the contents related to the topic of the target, and P (D|T , S ) is the probability that the sentence S including the contents related to the topic of the target represents the definition. We assume that we can identify whether a sentence represents the definition or not, regardless of whether the sentence is related to the target topic or not. For example, we can understand that a sentence "Copland was b orn in Brooklyn." represents the definition without regard to the topic that the sentence describ es. Even if the target is not known in the sentence (i.e., "X was b orn in Brooklyn."), the sentence can b e considered as a definition sentence. Therefore, assuming T and D are conditionally indep endent given S , we can simplify the probability P (T , D|S ) as follows: P (T |S )P (D|S ) (3) P (D)P (S |D) P (T )P (S |T ) × (4) = P (S ) P (S ) 2 We use a sentence for convenience, although any text segment can b e valid. P (T , D|S ) 2. PROBABILISTIC MODEL FOR DEFINITIONAL QUESTION ANSWERING 2.1 Topic and Definition A definitional question such as "What is X?" or "Who is X?" is a representation for user information need "Find the definition ab out X". The definition 1 ab out X consist of conceptual facts or principal events that are worth b eing registered in a dictionary or an encyclop edia for explaining X. Provided that D is a set of definitions ab out all p ossible targets, the definition ab out X is a subset of D containing the related to X. Since the topic of a question target is represented by facts or event related to the target, the definition ab out X can b e considered as the definition that represents the topic of X. We can supp ose that the following sentences are answer candidates for a question "What is NASA?". S1: NASA is the agency responsible for the public space program of the USA. S2: NASA was established in 1958. S3: The headquarters of NASA is located in Washington, D. C. S4: NASA announced the new annual budget. S5: John who works for NASA gave a housewarming party yesterday. S6: Ji-Sung Park is a famous footbal l player from South Korea. Among the candidates, the sentences representing the topic of "NASA" are {S 1, S 2, S 3, S 4}, and those representing the definition are {S 1, S 2, S 3, S 6}. Although the sentence S5 1 The definition is interpreted in a broad sense according to the TREC 2003 QA track guideline[16]. 213 Equation 3 is expanded into equation 4 by using Bayes' theorem. In order to answer a definitional question, it is necessary to select the sentences that deserve to b e the answer by calculating the probability P (T , D|S ). The task of classifying whether each sentence deserves to b e the answer or not according to the probability P (T , D|S ) can b e regarded as the task of ranking the sentences according to the probability. The top ranked sentences are selected for the answer. It is not necessary to calculate the exact probability in order to rank the sentences. The equation can b e simplified by order-preserving transformation instead. A scoring function used for ranking the sentences is defined as follows: D S (S ) = P (S |D) P (S |T ) × P (S ) P (S ) (5) 2.3.2 Topic Language Model The probability P (w1,n |T ) of the topic language model is rewritten by using the chain rule and the indep endence assumption of word occurrence as follows: Puni (w1,n |T ) = i n P (wi |T ) =1 If we had a set of ideal sentences describing the topic T ab out the question target X , we could model the target topic from the sentences. Alternatively, we model the target topic using the following several evidences: · Top ranked documents R retrieved from the collection by the query X: Pseudo relevance feedback techniques in information retrieval research consider top ranked documents as relevant ones and retrieve more relevant documents by modifying the original query based on them. This idea can b e applied to topic modeling. Top ranked documents retrieved by the query consisting of the question target are considered as the documents describing the target topic, and they are used for modeling the topic. · Definitions E for the target X from external resources: There may b e definitions for the target X in the external resources such as online dictionary. The definitions for the target are very useful information for topic modeling, if there is any, b ecause the definitions include only the content ab out the target X without any noise. However, since the definitions cannot cover all p ossible question targets, it is necessary to use other evidences for target modeling. The definitions from external resources are called external definitions from now on. · Top ranked web pages W retrieved from the WWW by the query X: The web pages are used to alleviate the sparseness of R and E . Considered as the pages that describ e the target topic, the top ranked web pages retrieved by the target are used for target modeling. The topic language model is estimated by linear interp olation of the ab ove evidences as follows: Puni (w1,n |T ) = As P (T ) and P (D) in equation 4 do not have an effect on the rank under the condition that a score is assigned to each sentence for a given question, the equation can b e transformed into equation 5. P (S |T ) is the probability that the sentence S is generated from the target topic, and P (S |D) is the probability that S is generated from a set of definition representations. P (S ) is the prior probability of the sentence S. If the sentence S is a sequence of words w1 w2 · · · wn , the function is rewritten as follows: D S (S ) = P (w1,n |T ) × P (w1,n |D) × P (w1,n )-2 (6) P (w1,n |T ) is the probability that the word sequence w1,n is generated from the target topic, and we call it topic language model. P (w1,n |D) is the probability that w1,n is generated from the definition representations, and we call it definition language model. P (w1,n ) is the prior probability of w1,n , and we call it general language model. A score is assigned to each sentence by combining the probability of the three language models. Different formulations of the model for definitional QA are p ossible according to the way each probability is estimated. 2.3 Estimating Language Models 2.3.1 General Language Model The probability P (w1,n ) is rewritten by using the chain rule as follows: P (w1,n ) = P (w1 )P (w2 |w1 )P (w3 |w1,2 ) · · · P (wn |w1,n-1 ) Assuming the word occurrences are indep endent of one another, the probability is calculated by the following equation: Puni (w1,n ) = i n (P (wi |R) + P (wi |E ) + P (wi |W )) (7) =1 i n P (wi ) =1 The prior probability P (wi ) of a word wi is estimated by MLE (maximum likelihood estimation) based on the whole collection where the answer is searched: P (wi ) = where P (wi |R), P (wi |E ), and P (wi |W ) are the probability that word wi is generated from the top ranked documents R, external definitions E , and web pages W , resp ectively. The , , and are interp olation parameters, and they are empirically set satisfying + + = 1. Each probability is estimated by Dirichlet smoothing [5] which is known to b e useful in information retrieval based on language model [4]. P (wi |R) P (wi |E ) = = CR (wi ) + P (wi ) j CR (wj ) + CE (wi ) + P (wi ) j CE (wj ) + CW (wi ) + P (wi ) j CW (wj ) + È C (wi ) j C (wj ) where the C (wi ) is the occurrence count of word wi in the whole collection. The probability of the general language model is also used for the smoothing of probabilities in other language models. P (wi |W ) = È È È 214 where CR (wi ), CE (wi ), and CW (wi ) are the occurrence counts of wi in R, E , and W , resp ectively. P (wi ) is the prior probability of wi calculated in the general language model, and is a parameter for Dirichlet smoothing. According to Zhai & Lafferty [4], the value of for a high p erformance in information retrieval is b etween 500 and 10,000, and the p erformance is robust with 2,000 of . Question Question Analysis AQUAINT Document Retrieval 2.3.3 Definition Language Model The probability P (w1,n |D) of definition language model is rewritten by using the chain rule and indep endence assumption of word occurrence as follows: Puni (w1,n |D) = Answer Candidate Extraction i Syntactic Definition Patterns External Definitions n P (wi |D) WordNet Answer Candidate Ranking =1 Web Search Engine Definition Corpus The definition corpus is necessary for definition modeling, and the definitions can b e collected from the online dictionary or encyclop edia. We constructed the definition corpus by collecting the definitions for arbitrary definition targets from the online resources, and estimated the probability using the definition corpus. The word probability distribution can b e different dep ending on the domain of the definition target. For example, "president", "scientist", "b orn", and "died" will frequently occur in the definition for a p erson. On the other hand, "established", "memb er", "headquarters", and "branch" will do in the definition for an organization. Moreover, there may b e some words that reflect the definition itself without regard to the definition domain. Therefore, the definition language model is estimated by the linear interp olation of the definitions for the domain of question target and the definitions for all domains. Puni (w1,n |D) = Answer Selection Answer Figure 1: Architecture of definitional question answering system relevance and definition representation, the model has an advantage in the asp ect of easy estimation of the probabilities. Moreover, the model can b e easily extended to other descriptive QA. If the question is "Why · · · ?", the QA for the question can b e modeled in terms of topic and reason. Given a binary random variable H which is 1 if a sentence represents the reason but 0 otherwise, the answer could b e selected from the top ranked sentences according to the joint probability P (T , H |S ). The prop osed model transforms the definitional QA into the estimation tasks for the three language models: topic language model, definition language model, and general language model. The model is promising in the sense that the various techniques for the language modeling studied in sp eech recognition and natural language processing can b e applied to it. As shown in equation 5, the general language model is used for normalizing the probability value. As a consequence, the prop osed model prefers the sentences which are more likely to occur in the target topic and the definition than in the general text. During the probability estimation of each language model, the model is naturally served as a framework for combining various evidences systematically. i n (P (wi |DtX ) + (1 - )P (wi |Dall )) (8) =1 where P (wi |DtX ) is the probability that word wi is generated from the definition corpus whose domain for definition target is equal to the domain tX for question target, and P (wi |Dall ) is the probability that wi is generated from the definition corpus for all domains. is a interp olation parameter whose value is empirically determined. We used three domains in this pap er: p erson, organization, and term.3 Each probability is estimated by Dirichlet smoothing as follows: P (wi |DtX ) = P (wi |Dall ) = CDtX (wi ) + P (wi ) j CDtX (wj ) + CDall (wi ) + P (wi ) j CDall (wj ) + È È where CDtX (wi ) is the occurrence count of wi in the definition corpus whose domain is tX , and CDall (wi ) is that in the whole definition corpus. P (wi ) is the prior probability of wi calculated in the general language model, and is a parameter for Dirichlet smoothing. 3. DEFINITIONAL QUESTION ANSWERING SYSTEM BASED ON THE PROBABILISTIC MODEL In this section, we explain the definitional QA system based on the prop osed probabilistic model describ ed in the previous section. An overview of the system is shown in Figure 1. 2.4 Discussion Our prop osed probabilistic model handles the definitional QA in terms of topic and definition. As the definitional QA is separated into two p oints for finding the answer, topic Targets in TREC 2004 were classified into the three domains. 3 3.1 Question Analysis Given a definitional question, the question target is extracted and its typ e is identified. Since a definitional ques- 215 tion is simple, it is easy to extract the question target. From a question such as "Who is Andrew Carnegie?", for example, "Andrew Carnegie" is extracted by using simple rules. Moreover, the typ e of the target is identified by a named entity tagger, BBN Identifinder [2], and it is used for estimating the probabilities of definition language model. We classify the target into three typ es: p erson, organization, and term. If a target is not classified into p erson or organization by the named entity tagger, it is classified into term. 11.1 11.2 11.3 11.4 11.5 11.6 11.7 Who is the lead singer/musician in Nirvana? Who are the band members? When was the band formed? What is their biggest hit? What are their albums? What style of music do they play? Other Figure 2: Question series for "the band Nirvana" 3.2 Document Retrieval Since calculating the probabilities for all the documents costs too much, relevant documents to the question target are retrieved, and answer candidates are extracted from the retrieved documents. The document retrieval is carried out by using BM25 scoring function of OKAPI [15]. The query consists of each word of the question target. If the word overlap exceeds the upp er threshold (i.e., 0.8), the candidate is considered to b e redundant. If the word overlap lies b etween the upp er threshold and the lower threshold (i.e., 0.5), the semantic class of the candidate is checked. If there is any selected candidate that shares the synset in WordNet, the candidate is regarded as redundant one. 3.3 Answer Candidate Extraction All sentences in the retrieved passages are usually used as answer candidates. However, a sentence may b e too long that it is likely to contain information which is not related to the question target. Thus, we try to extract target-related parts of sentences using syntactic structure of the sentences. If the parts are extracted, they are used as answer candidates. Otherwise, the sentences are used as the candidates. We extract noun and verb phrases from the sentences using the syntactic definition patterns [11]. 4. EXPERIMENTAL RESULTS 4.1 Experiments Setup We have exp erimented with 50 TREC 2003 topics and 64 TREC 2004 topics, and we found the answer from the AQUAINT corpus used for TREC Question Answering track evaluation. The TREC answer set for the definitional QA task consists of several definition nuggets for each target, and each nugget is a short string. In TREC 2004, as shown in Figure 2, a topic consists of several factoid/list questions and one definitional question4 given at the end. The questions are associated with a question target and assumed to b e processed in order[17]. Thus, the gold standard answer for definitional questions ab out a target does not include the answers for preceding factoid/list questions ab out the same target. However, the answer for those questions can b e considered as the answer for definitional questions in order to evaluate definitional QA systems only. Therefore, we expanded the TREC 2004 gold standard answer by adding the answer nuggets for factoid questions of each topic and used the expanded answer to evaluate for TREC 2004 topics. (When we compare our systems with other TREC participant systems, we used the original gold standard answer.) We skimmed through the documents containing the answer string for each factoid question and comp osed a short phrase manually. For example, a nugget "formed in 1989" is added to the answer for the definitional question ab out "the band Nirvana", extracted from the document containing answer string "1989" for factoid question "When was the band formed?". The evaluation of systems involves matching up the answer nuggets and the system output. Because the manual evaluation such as TREC evaluation requires a lot of cost, we evaluated our system using the automatic measure POURPRE [9]. The POURPRE estimates the TREC metric, nugget recall, precision, and F-measure, using term cooccurrences b etween answer nugget and system output. The POURPRE b etter correlates with the TREC official scores than another automatic measure ROUGE [12] does. There is no rank swap b etween official scores and POURPRE scores except swaps attributed to error inherent in the evaluation 4 Definitional questions are called other questions in TREC 2004 3.4 Answer Candidate Ranking The answer candidates are ranked by using the prop osed probabilistic model mentioned in the previous section. In order to keep the probability for each candidate from b eing too small and calculate it efficiently, we take the logarithm as follows: LDS (a) = i n ( log(P (wi |R) + P (wi |E ) + P (wi |W )) + log (P (wi |DtX ) + (1 - )P (wi |Dall )) - 2 log P (wi ) ) =1 3.5 Answer Selection The answer is selected from the ranked candidates, in order. The length of the final answer A is controlled by the score threshold sel as follows: A = {aj |LDS (aj ) > sel } (9) If the target length of the answer is set, the answer is truncated to the length. The redundant candidates are skipp ed during the answer selection process. The redundancy b etween two candidates ai and aj is calculated by the word overlap as follows: O v e r l a p (a i | a j ) O v e r l a p (a i , a j ) |ai aj | |ai | = max (Ov er lap(ai | aj ), Ov er lap(aj | ai )) = where |ai aj | is the numb er of common content words in the candidates, and |ai | is the numb er of content words in the candidate ai . The word overlap b etween an answer candidate and the selected answer is calculated as follows: O v e r l a p (a i , A ) = aj A max (Ov er lap(ai , aj )) 216 Table 1: Performance according to topic modeling: TREC 2003 Rec Prec F ( = 3) 1 0 0 0.4248 0.1881 0.3469 0 1 0 0.4540 0.1990 0.3718 0 0 1 0.4192 0.1881 0.3427 0 0.5 0.5 0.4360 0.1906 0.3556 0.5 0 0.5 0.4316 0.1931 0.3533 0.5 0.5 0 0.4497 0.1974 0.3680 0.10 0.60 0.30 0.4306 0.1885 0.3510 0.15 0.60 0.25 0.4353 0.1905 0.3551 0.20 0.60 0.20 0.4375 0.1914 0.3570 0.25 0.60 0.15 0.4379 0.1913 0.3572 0.30 0.60 0.10 0.4398 0.1923 0.3590 Table 3: Performance according to definition modeling: TREC 2003 Rec Prec F ( = 3) 0.0 0.4285 0.1865 0.3487 0.1 0.4321 0.1887 0.3522 0.2 0.4350 0.1897 0.3547 0.3 0.4379 0.1914 0.3573 0.4 0.4379 0.1914 0.3573 0.5 0.4398 0.1925 0.3590 0.6 0.4398 0.1923 0.3590 0.7 0.4375 0.1915 0.3570 0.8 0.4342 0.1904 0.3543 0.9 0.4324 0.1892 0.3525 1.0 0.4506 0.1983 0.3691 Table 4: Performance according to definition modeling: TREC 2004 Rec Prec F ( = 3) 0.0 0.4453 0.2759 0.4158 0.1 0.4453 0.2762 0.4158 0.2 0.4453 0.2761 0.4158 0.3 0.4492 0.2777 0.4192 0.4 0.4494 0.2777 0.4193 0.5 0.4479 0.2776 0.4183 0.6 0.4491 0.2784 0.4194 0.7 0.4486 0.2780 0.4188 0.8 0.4460 0.2769 0.4165 0.9 0.4461 0.2768 0.4166 1.0 0.4443 0.2761 0.4151 As shown in Table 1, the system using only the external definition outp erforms the other systems. It is probably b ecause there are external definitions for 92% (46/50) questions of TREC 2003 questions. On the other hand, there are external definitions for 86% (55/64) questions of TREC 2004 questions, and the system combining each information, as shown in Table 2, outp erforms the others. The exp erimental results for the topic modeling can b e summarized as follows: · The target topic is b est represented by the external definitions. As the external definitions provide core information ab out the question target without noise, they contribute to the increase of the system p erformance in terms of recall and precision. · Top ranked documents and web pages complement insufficient coverage of the external definitions. · The confidence of each information is determined according to the degree of its noise. The exp erimental results show that the external definitions are the most confident, followed by the top ranked documents and web pages, in order. Since the external definitions have almost no noise and the news articles are generally less noisy than web pages, it is conjectured that less noisy source supplies more confident information for the topic modeling. Table 2: Performance according to topic modeling: TREC 2004 Rec Prec F ( = 3) 1 0 0 0.4023 0.2478 0.3753 0 1 0 0.4342 0.2714 0.4062 0 0 1 0.3980 0.2465 0.3718 0 0.5 0.5 0.4451 0.2775 0.4161 0.5 0 0.5 0.3986 0.2454 0.3717 0.5 0.5 0 0.4341 0.2704 0.4057 0.10 0.60 0.30 0.4460 0.2776 0.4168 0.15 0.60 0.25 0.4447 0.2767 0.4156 0.20 0.60 0.20 0.4473 0.2776 0.4177 0.25 0.60 0.15 0.4511 0.2790 0.4211 0.30 0.60 0.10 0.4491 0.2784 0.4194 process [9]. F-measure is used for the official measure in TREC evaluation, and is set to three favoring the recall. For topic modeling, we collected external definitions from various online sites in query time: Acronym Finder, Biography.com, Columbia Encyclop edia, Wikip edia, FOLDOC, The American Heritage Dictionary of the English Language, Online Medical Dictionary, and Google Glossary. We also used ten web pages retrieved by the Google search engine and five local documents retrieved from the AQUAINT corpus for topic modeling. For definition modeling, we constructed definition corpus from the ab ove sites according to the target typ e: 14,904 p ersons, 994 organizations, and 3,639 terms entries. We processed top 200 documents retrieved in all exp eriments. 4.2 Topic Modeling The topic modeling exp eriments are carried out to find out which information is useful for estimating the topic language model. Table 1 and Table 2 show the p erformance of our definitional QA system according to the various parameter settings for TREC 2003 and TREC 2004 questions, resp ectively. The , , are weight of top ranked documents R, external definitions E , and top ranked web pages W , resp ectively, in equation 7 for the topic modeling. and are set to 0.6 and 2,000, resp ectively. In order to evaluate the ranking p erformance, the score threshold sel is not applied, and the target length is set to 2,000 byte5 . The target length is measured by the numb er of non-whitespace characters. The answer can b e shorter than the target length b ecause of the insufficient candidates. 5 4.3 Definition Modeling The definition modeling exp eriments are conducted to know how much the question target typ e affects the esti- 217 Table 6: Comparison with top systems: TREC 2003 System Rec 0.3979 0.4229 0.3939 Top 10 systems 0.3314 0.3790 at TREC 2003 0.3043 0.3490 0.3155 0.2027 0.1818 prop osed(1000,10) 0.3604 prop osed(1500,0) 0.4041 prop osed(2000,0) 0.4331 10 TREC participant Prec 0.3513 0.2009 0.2062 0.4961 0.2658 0.5124 0.1836 0.2376 0.5220 0.2726 0.3076 0.2290 0.2026 F ( = 3) 0.3644 0.3531 0.3402 0.3348 0.3321 0.3047 0.2955 0.2442 0.2126 0.1744 0.3332 0.3493 0.3563 Table 7: Comparison with top systems: TREC 2004 System Rec 0.3468 0.3412 0.2950 Top 10 systems 0.3053 0.3174 at TREC 2004 0.3301 0.2776 0.2543 0.2418 0.3459 prop osed(1000,10) 0.2728 prop osed(1500,0) 0.3142 prop osed(2000,0) 0.3334 10 TREC participant Prec 0.1920 0.1920 0.2451 0.1682 0.1894 0.1704 0.2533 0.3876 0.4079 0.0639 0.2764 0.1947 0.1583 F ( = 3) 0.3139 0.3107 0.2803 0.2766 0.2676 0.2657 0.2603 0.2564 0.2508 0.2297 0.2699 0.2921 0.2952 mation of the definition language model. Table 3 and Table 4 show the p erformance according to the various parameter settings for TREC 2003 and TREC 2004 questions, resp ectively. is the degree to which the target typ e has an effect on the definition modeling. For a question such as "Who is Andrew Carnegie?", is a parameter for the definition corpus of p erson typ e. As shown in the tables, the systems reflecting the target typ e outp erform the system which uses the definition corpus as a whole, not considering the target typ e ( = 0). The system considering the target typ e heavily ( = 1.0 for TREC 2003 and = 0.6 for TREC 2004) p erforms b est. In order to analyze the reason why the question target typ e have a such great effect on the system p erformance, we compared the term probability distribution of each definition corpus using Jensen-Shannon divergence (JS divergence) [13]. The value of JS divergence lies b etween zero and two, and closer value to zero means that two input distributions are more similar to each other. Table 5 shows the JS divergences b etween various distributions where AQ is AQUAINT corpus as a whole, and AQ nyt and AQ apw are NYT and AP part of the AQUAINT corpus, resp ectively. D all is the definition corpus for all domains, and D p erson, D org, and D term are p erson, organization, and term part of the definition corpus, resp ectively. The table shows that the term distribution of news text is very different from that of definition text. The divergence (0.1622) b etween AQ nyt and AQ apw is small, but the divergence (0.4039) b etween AQ nyt and D all is much larger. On the other hand, the divergence b etween definition corp ora is very large according to the definition domain, although they equivalently consist of definition text. For example, the divergence b etween D org and D p erson is 0.5003, whereas the divergence b etween D org and AQ apw is 0.4050. The result implies that the term distribution of the definition corpus is very different b etween the target domains. The large difference in the term distribution explains the reason why the system heavily considering the definition typ e p erforms so well. run files of each system are offered by NIST6 . Table 6 and Table 7 show the POURPRE evaluation result of top ten systems and our prop osed systems prop osed(·, ·) of which target length and score threshold is variously set. For example, prop osed(1000,10) is the system whose target length and score threshold is set to 1,000 byte and 10, resp ectively. Because the resp onses of the TREC participant systems are generated under the condition that the answers for definitional questions do not have to include the answer for other preceding questions, we evaluated the systems with the original TREC 2004 gold standard answer. Our systems may b e slightly underestimated for TREC 2004 questions b ecause ours do not consider other typ es of questions. The tables show that our system is comparable to the state-of-the-art definitional QA systems. 5. CONCLUSIONS We prop osed a probabilistic model for definitional QA, analyzing the problem into two main comp onents, topic and definition. With the separation of the topic model and the definition model, the definition QA system can estimate each model effectively. The exp erimental results show that the external definition which has almost no noise is the most valuable information for topic modeling, and that the top ranked documents and web pages complement insufficient coverage of the external definition. The definition corpus is used for estimating the term probability of the definition language model. Because of the large difference in term distribution b etween the definition domains, it is reasonable to estimate the probability by using the dynamically selected definition corpus based on the question target typ e. The prop osed model can b e easily extended to other descriptive QA by replacing the definition model with a new model customized to it. Moreover, as the task of QA is transformed into that of estimating three language models, various techniques related to the language modeling can b e applied to the model. For the future work, we will estimate the probabilities of the language models using more contexts. Furthermore, we will extend our model to other question typ es and combine the models for each question typ e into a general model for descriptive QA. 6 4.4 Comparison with Other Systems We compared our prop osed system with the previous TREC participant systems [16, 17]. For this exp eriments, the raw http://trec.nist.gov/ 218 Table 5: Jensen-Shannon divergences between term probability distributions AQ AQ nyt AQ apw D all D p erson D org D term AQ 0.0000 0.0666 0.0707 0.3950 0.5670 0.3887 0.4273 AQ nyt 0.0666 0.0000 0.1622 0.4039 0.5734 0.4114 0.4389 AQ apw 0.0707 0.1622 0.0000 0.4487 0.6037 0.4050 0.4987 D all 0.3950 0.4039 0.4487 0.0000 0.1307 0.2933 0.1972 D p erson 0.5670 0.5734 0.6037 0.1307 0.0000 0.5003 0.5528 D org 0.3887 0.4114 0.4050 0.2933 0.5003 0.0000 0.4028 D term 0.4273 0.4389 0.4987 0.1972 0.5528 0.4028 0.0000 6. REFERENCES [11] [1] Adam Berger, Rich Caruana, David Cohn, Dayne Freitag, and Vibhu Mittal. Bridging the lexical chasm: Statistical approaches to answer-finding. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-2000), pages 192­199, 2000. [2] D. M. Bikel, R. L. Schwartz, and R. M. Weischedel. An algorithm that learns what's in a name. Machine Learning, 34(1-3):211­231, 1999. [3] S. Blair-Goldensohn, K. R. McKeown, and A. H. Schlaikjer. A hybrid approach for QA track definitional questions. In Proceedings of the 12th Text Retrieval Conference (TREC-2003), pages 185­192, 2003. [4] Chengxiang Zhai and John Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2):179­214, 2004. [5] David. J. C. MacKay and Linda C. Bauman Peto. A hierarchical Dirichlet language model. Natural Language Engineering, 1(3):1­19, 1994. [6] A. Echihabi, U. Hermjakob, E. Hovy, D. Marcu, E. Melz, and D. Ravichandran. Multiple-engine question answering in TextMap. In Proceedings of the 12th Text Retrieval Conference (TREC-2003), pages 772­781, 2003. [7] S. Harabagiu, D. Moldovan, C. Clark, M. Bowden, J. Williams, and J. Bensley. Answer mining by combining extraction techniques with ab ductive reasoning. In Proceedings of the 12th Text Retrieval Conference (TREC-2003), pages 375­382, 2003. [8] Horacio Saggion and Rob ert Gaizauskas. Mining on-line sources for definition knowledge. In Proceedings of the 17th International Florida Artificial Intel ligence Research Society Conference (FLAIRS-2004), 2004. [9] Jimmy Lin and Dina Demner-Fushman. Automatically evaluating answers to definition questions. In Proceedings of the Human Language Technology and Conference on Empirical Methods in Natural Language Processing (HLT-EMNLP-2005), 2005. [10] John Prager, Jennifer Chu-Carroll, Krzysztof Czuba, Christopher Welty, Abraham Ittycheriah, and Ruchi [12] [13] [14] [15] [16] [17] [18] [19] Mahindru. IBM's PIQUANT in TREC 2003. In Proceedings of the 12th Text Retrieval Conference (TREC-2003), pages 283­292, 2003. Kyoung-Soo Han, Young-In Song, Sang-Bum Kim, and Hae-Chang Rim. Phrase-based definitional question answering using definition terminology. Lecture Note in Computer Science, 3689:246­259, 2005. C.-Y. Lin. ROUGE: A package for automatic evaluation of summaries. In Proceedings of Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004, 2004. J. Lin. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1):145­151, 1991. Radu Soricut and Eric Brill. Automatic question answering: Beyond the factoid. In Proceedings of the Human Language Technology and Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL-2004), pages 57­64, 2004. S.E. Rob ertson, S. Walker, and M.M. Hancock-Beaulieu. Large test collection exp eriments on an op erational, interactive system: Okapi at TREC. Information Processing & Management, 31(3):345­360, 1995. E. M. Voorhees. Overview of the TREC 2003 question answering track. In Proceedings of the 12th Text Retrieval Conference (TREC-2003), pages 54­68, 2003. E. M. Voorhees. Overview of the TREC 2004 question answering track. In Proceedings of the 13th Text Retrieval Conference (TREC-2004), 2004. Wesley Hildebrandt, Boris Katz, and Jimmy Lin. Answering definition questions with multiple knowledge sources. In Proceedings of the Human Language Technology and Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL-2004), pages 49­56, 2004. J. Xu, A. Licuanan, and R. Weischedel. TREC 2003 QA at BBN: Answering definitional questions. In Proceedings of the 12th Text Retrieval Conference (TREC-2003), pages 98­106, 2003. 219