Answering Complex Questions with Random Walk Models
Sanda Harabagiu, Finley Lacatusu and Andrew Hickl
Language Computer Corporation 1701 N Collins Blvd. Suite 2000 Richardson, TX 75080
{sanda,

finley, andy}@languagecomputer.com

ABSTRACT
We present a novel framework for answering complex questions that relies on question decomposition. Complex questions are decomposed by a procedure that operates on a Markov chain, by following a random walk on a bipartite graph of relations established between concepts related to the topic of a complex question and subquestions derived from topic-relevant passages that manifest these relations. Decomposed questions discovered during this random walk are then submitted to a state-of-the-art Question Answering (Q/A) system in order to retrieve a set of passages that can later be merged into a comprehensive answer by a Multi-Document Summarization (MDS) system. In our evaluations, we show that access to the decompositions generated using this method can significantly enhance the relevance and comprehensiveness of summarylength answers to complex questions.

Categories and Subject Descriptors
H.3.m [INFORMATION STORAGE AND RETRIEVAL]: Miscellaneous; I.2.7 [ARTIFICIAL INTELLIGENCE]: Natural Language Processing

General Terms
Algorithms, Performance, Experimentation

Keywords
Question Answering, Summarization

1.

INTRODUCTION

Complex questions cannot be answered using the same techniques that apply to "factoid" questions. Complex questions refer to relations between entities or events; they refer to complex processes and model scenarios that involve deep knowledge of the topic under investigation. For example, a question like Q0 : "What are the key activities in the research and development phase of creating new drugs?" looks for information on two distinct phases of creating drugs. Typically, relevant information for these kinds

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00.

of questions can be found in multiple documents and needs to be fused together into a final answer. In the Document Understanding Conferences (DUC), the answer to complex questions like Q0 is considered to be a multi-sentence multiple document summary (MDS) that meets the information need of the question. We introduce a new paradigm for processing complex questions that relies on a combination of (a) question decompositions (of the complex question); (b) factoid question answering (Q/A) techniques (to process decomposed questions); and (c) multi-document summarization techniques (to fuse together the answers provided for each decomposed question). Central to this process is a question decomposition model that enables the selection of the textual information aggregated in the final answer. We present a novel question decomposition procedure that operates on a Markov chain model inspired from the Markov chains used for expanding language models introduced in [9]. We propose that question decomposition depends on the successive recognition (and exploitation) of the relations that exist between words and concepts extracted from topic-relevant sentences. (For the purposes of this paper, we will define a relation as any semantic property that can exists between two or more entities or events in texts.) For example, if a topic-relation r1 between "develop" and "drugs" is recognized in question Q0 , we assume that this sentence (and all other sentences containing this particular relation) will contain relevant information that can be used to decompose of Q0 . Furthermore, we expect that sentences containing topic-relevant relations will also contain other relevant relations that should be leveraged in question decomposition. For example, if r1 is identified in the sentence s1 "The challenge for Glaxo was to develop a drug that was pleasant to swallow.", we expect that a new relation r2 between the concept C O M PA N Y ("Glaxo") and "develop" should be extracted and used to identify still other sentences that could potentially provide relevant information. As new relations are discovered, we expect that sentences containing the most relevant relations (or combinations of relations) can be used to generate questions that can represent possible decompositions of the original complex question. For example, given r1 and r2 in s1 , a question like "What companies develop new drugs?" can be created which could be used to obtain a set of answers which could represent a partial response to Q0 . Relevant answers to each newly-decomposed question can be used to discover more relevant relations, that in turn, prompt still more question decompositions. This process ends when either no new relations are discovered, or the random walk is stabilizing within a threshold. We evaluate question decompositions in three ways. First, we compare them against decompositions produced by humans. Second, we conduct several evaluations of the quality of the MDS answers they enable. Third, we use every sentence from the MDS

220


Question Answering System

Complex Question

QUESTION DECOMPOSITION WITH RANDOM WALKS

Question 1 Question 2 Question N

Question Processing

Document Processing

Answer Processing

Answer 1 Answer 2 Answer M

SYNTACTIC QUESTION DECOMPOSITION Sentence Retrieval Keyword Extraction and Keyword Alternations Sentence Ranking Topic Signatures

Sentence 1 Sentence 2 Documents Sentence N

QUESTION COMPARISON AND SELECTION

Question 1 Question 2 Question S

MULTIPLE DOCUMENT SUMMARIZATION SYSTEM

Sentence Ranking

Answer Summary 1

Answer Summary 2

QUESTION GENERATION

Answer Summary 3

Figure 1: The Architecture of our Framework for Processing Complex Questions. answer and generate questions with the same procedure employed when creating question decompositions from relevant sentences. The questions that have answers in the summaries are evaluated against questions generated by human linguists. They are also used for measuring the similarity to the decomposed questions. Our studies indicate that these comparisons correlate with the relevance of the answers. We claim that this is an important finding since current MDS evaluation methods typically rely on (a) human produced answers, or (b) human judgments. The automatic scoring of the MDS answers based on comparisons of decomposed questions allows a framework in which researchers can test multiple Q/A techniques or multiple MDS techniques that best operate for finding answers. The remainder of the paper is organized as follows. Section 2 presents the framework we have designed for processing complex questions. Section 3 details the question decomposition procedure. Section 4 describes the random walk models employed for decomposing questions. Section 5 details the evaluation results while Section 6 summarizes the conclusions. After complex questions are decomposed syntactically, as illustrated in Figure 1, keywords are extracted from each sub-question and are expanded with keyword alternations. The keywords are expanded by (1) identifying the semantic class to which they belong, and (2) using other terms from the lexicons associated with such semantic classes. To identify the semantic class, the keyword is matched against the lexicon of the class. The keyword alternations are selected from the first 20 scored words from the lexicon. The semantic classes are acquired off-line with a co-training method reported in [18].
research: trial, effort, step, study, work, activity, area, business, cause, field, function, issue, program, project, sector, service, site, education, information, science drug: amphetamine, cocaine, ecstasy, epo, heroin, lsd, marijuana, medication, morphine, opium, measure, prozac, ritalin, steroid, treatment, viagra, alcohol, cost, disease, issue

Figure 2: Keyword Alternations. Figure 2 illustrates the keyword alternations resulting for the keywords "research" and "drug" that were extracted from the subquestion "What are the key activities in the research phase of creating new drugs?". Additionally, we use two different models of topic signatures to identify (a) the most representative relations for the topic referred by the complex question and evidence by the document collection. The first topic signature (T S1 ) we have implemented was reported in [10]. T S1 is defined by a set of terms ti , where each term is highly correlated with the topic with an associated weight wi : T S1 = {topic (t1 , w1 ), (t2 , w2 ), . . . , (tn , wn ) }. The selection of the terms for T S1 as well as the assignment of the association weights is determined by the use of the likelihood ratio. The second topic signature (T S2 ) was introduced in [3]. It takes into account the fact that topics are not characterized only by terms, there are also relations between topic concepts that need to be identified. If only nouns and verbs from T S1 are selected as topic concepts, the topic signature T S2 is defined as T S2 = {topic, (r1 , w1 ), (r2 , w2 ), . . . (rn , wn ) }, where ri is a binary relation between two topic concepts. The procedure of generating T S2 was detailed in [3], and it identifies two forms of relations: (a) syntax-based relations, and (b) salience-based context relations. The arguments of these relations may be (1) nouns or nominalizations; (2) named entity types that a Named Entity Recognizer (NER) identifies; and (3) verbs.

2.

PROCESSING COMPLEX QUESTIONS

In this section, we outline three methods for producing answers to complex questions from based on the output of a question decomposition system. By decomposing a complex question into a set of simpler subquestions that each represent a different dimension of the complex question's information need, we expect to be able to identify answers that are both informative and responsive. In this paper, we introduce a new technique for question decomposition that uses a random walk in order to generate possible decompositions of a complex question. Figure 1 illustrates the system described in this paper. Figure 1 includes two types of question decomposition modules: a syntactic question decomposition module and a random walkbased question decomposition module. With syntactic question decomposition, overtly-mentioned subquestions are extracted from a complex question by separating conjoined phrases and recognizing embedded questions. While syntactic decomposition is an important part of any question decomposition algorithm, we will not be discussing techniques for this type of decomposition in this paper. 1
1

For more information on syntactic decomposition, see [8].

221


When topic signatures are available, each sentence from the document collection receives a score based on (a) the presence of a term from T S1 ; (b) the presence of a relation from T S2 ; and (c) the presence of any of the keywords extracted from the sub-question or their alternations. The sentence scores determine a ranking of the sentences from the collection for each sub-question. Finally, the answer is produced by selecting for each decomposition only the corresponding highest ranked sentences. Redundancy is eliminated by checking that each new added sentence does not contain any predicate-argument relation that was already present in a previously selected sentence. Predicate-argument relations are discovered by processing sentences with a semantic parser trained on PropBank [16]. Additionally, each predicate and argument is mapped into every WordNet synonym to enable paraphrase identification. In this way Answer Summary 1 from Figure 1 is produced. In addition to the method described above, complex questions can also be decomposed by another method that is described in Sections 3 and 4. Due to this, in our framework we can produce two additional answers as summaries. The second form of question decomposition discovers relations relevant to the complex question and sentences in which they are present. For each such sentence, one or multiple questions are generated, representing additional question decompositions. When these decompositions are ignored and only the sentences are considered, the topic signatures can be used to score them and to produce a second answer as summary (Answer Summary 2 illustrated in Figure 1). When complex questions are decomposed using random walks, subquestions are submitted a state-of-the-art question-answering (Q/A) system (described in [5]), which returns sets of ranked relevant answers for each such decomposition. All these answers are considered separate documents, which are used to produce a multidocument summary as the third answer (MDS) (Answer Summary 3 illustrated in Figure 1). The MDS system that was used has been reported in [7]. Furthermore, for each sentence in the third answer, we generate one or several questions with the same technique that is used for decomposing questions with random walks. Since questions produced from the complex question, and questions produced from the answer are available, we argue that the answer is relevant if the two sets of questions are very similar. Question comparison is produced by a battery of four question similarity measures, previously reported in [4]. In Section 5 of this paper we detail the similarity measures we used in the experiments. The selection of only the most similar questions improves the quality of the answer. Instead of submitting all questions generated by the random walks, only the most similar questions are processed again by the Q/A system, thus closing a feedback loop. Using a hill-climbing technique, if the aggregate similarity of the new set of questions derived from the new answer is improved significantly, the feedback loop starts again. The aggregate similarity is also described in Section 5 of this paper. The feedback loop ends either when new improvements are not obtained, or when the number of loops is larger than a threshold, in our case LT = 7. With this framework , we were able to study the effects of different forms of question decompositions on the quality of the answers.

decompositions of Q0 and can be generated automatically by the technique we present in this section.
Q1 : 0 Q2 : 0 Q3 : 0

What companies develop new drugs? What diseases are new drugs being developed for? How long does it take to develop a new drug?

Table 1: Examples of Question Decompositions. The main feature of the decomposed question is related to the ability to easily detect their expected answer type (EAT), which represents the semantic class to which their answers should belong. For example, the EAT of Q1 is O R G A N I Z AT I O N, the EAT 0 of Q2 is D I S E A S E , whereas the EAT of Q3 is D U R AT I O N. Our 0 0 main assumption is that the question decomposition model should be based on several types of relations between words or concepts used in (a) the complex question, (b) in sentences that contain relevant information for the complex question, or (c) in other question decompositions that have been produced before for the same complex question. In order to produce question decompositions, we follow four steps. In the first step we process the complex question for deriving the relations that are meaningful. In the second step we generate questions based on the relations selected. In the third step we enhance the meaningful set of relations with relations discovered when generating a question decomposition and then we select a new relation based on the latest decomposition. In the fourth step, we loop back to step 2 unless the probability to continue is not above a certain threshold. The detailed operations in each step are: STEP 1: The complex question is lexically, syntactically and semantically analyzed with the goal of identifying the relationships between words that may lead to the generations of simpler questions. The three forms of knowledge are marked up in each of the phases of the analysis: 1.a. (lexical) The determination of the part-of-speech of each word, generated by the Brill tagger [1]. 1.b. (syntactic) A full parse of the question is generated by the probabilistic parser reported in [2]. The result of the parse renders information about the syntactic constituents of the question and about their relations. For example, for the complex question Q0 , we derive the following constituents: V P1 : {are}; V P2 : {containing}; N P1 : {the key activities}; N P2 : {the research}; N P3 : {development phase}; N P4 : {N P2 and N P3 }; N P5 : {new drugs}; P P1 : {N P4 of N P5 }; P P2 : {N P1 in P P1 } 2 . 1.c. (lexical) For each base NP (e.g. N P1 , N P2 , N P3 , N P5 ) we determine whether the head is a nominalization of some verb, by accessing the WordNet database [12]. For example, the noun "development" is a morphological derivation of the verb "develop". The NPs having heads which are nominalizations are not considered in Step 1.d. 1.d. (lexical/semantic) The generality of the heads of each NP is assessed in one of the two categories: abstract, or concrete. The assessment is based on a large answer type taxonomy that was developed for the TREC evaluations of Q/A systems. The taxonomy, which was described in [17] comprises 440 synsets from WordNet (and their hyponyms) and 150 semantic classes of names that are recognized by the Named Entity Recognizers we have available. If any of the heads of an NP is found in the answer taxonomy, it is assigned the attribute concrete, otherwise it is labeled abstract. For the question Q0 , only the head of N P5 is categorized as concrete. N P1 is labeled abstract. The question processing techniques ap2 NP stands for noun phrase, VP for verb phrase, and PP for prepositional phrase

3.

DECOMPOSING COMPLEX QUESTIONS

In order to process complex questions like Q0 : "What are the key activities in the research and development phase of creating new drugs?", current Q/A systems need to decompose the question in a series of simpler questions, that can be tackled by the factoid-based techniques that have emerged from the TREC Q/A evaluations. Table 1 illustrates some of the questions that represent

222


plied for factoid Q/A identify N P1 as being the constituent that indicates the EAT for the question. Since no EAT can be established for Q0 , it is considered a complex question. 1.e. (syntactic) Relations between concrete NPs and other constituents are sought. The syntactic relationship from the constituent P P1 indicates a prepositional attachment relation between N P5 and N P4 , which is a coordination between N P2 and N P3 . The syntactic decomposition of the coordination entails two relations between the verbs related to N P2 and N P3 and N P5 . The output of Step 1 for Q0 is: RELATIONS: {R1 = [develop ­ new drugs]; R2 = [research ­ new drugs]} STEP 2: For a relation discovered at Step 1 we generate questions that involve that relation. In order to generate questions automatically, we employ a method that was first reported in [4]. In order to generate the question, we first find a sentence that constitutes an answer for that question. This is done by the following sub-steps: 2.a. Query Formulation. In order to find sentences in which elements from the RELATIONS list are discovered, we formulated two kinds of queries: (a) queries involving the lexical arguments of the relation, e.g. ["develop" AND "drug"] as well as (b) queries that involved semantic extensions. Four forms of extensions were considered: (1) extensions based on the semantic class of names that represent the nominal category (e.g. names of drugs), (2) extensions based on verbs which are semantically related to the verb in the WordNet database (e.g. develop(v) ­sem.relation create(v); develop(v) ­sem.relation produce(v)); (3) extensions that allow the nominal to be anaphoric, therefore replaced by a pronoun, e.g. [develop ­ it]; and (4) extensions that allow the nominalizations, as well as the verbal conjuncts, to be considered.
rj shares a predicate with ri rj shares an argument with ri rj specializes the predicate of ri rj specializes the argument of ri the predicates of rj and ri can be composed

from sentence S1 as COMPANY, and "Medical School" from S2 as UNIVERSITY. When the EAT is established, the question stem that is associated with it is known (e.g. "what companies") and it substitutes the name from the sentence, to generate the question Q1 from Table 1, in which relation R1 is fully specified with all 0 the argument adjuncts it had in the complex question Q0 . Sentence S1 generates the question Q1 , whereas sentence S4 generates the 0 question "What universities develop drugs?". Sentence S5 illustrated in Table 3 enables the generation of Q2 , whereas sentence 0 S6 is used for generating Q3 . Starting from relation R1 in RELA0 1 TIONS, three new relations are discovered: R1 = [[R1 = develop ­ 2 drug] ­ COMPANY], R1 = [[R1 = develop ­ drug] ­ DISEASE], 3 and R1 = [[R1 = develop ­ drug] ­ DURATION]. Each of these new relations enable the generation of the decomposed questions listed in Table 1.
S1 : The challenge for Glaxo was to develop a drug that was pleasant to swallow. S2 : The remaining 60 per cent of royalties will be paid to [...] Charing Cross and Westminster Medical School which developed it. S3 : Few companies admit setting out to create me-too drugs. S4 : Cancer Research funded research and development of the drug which was originally discovered by Aston University in Birmingham. S5 : At Bristol-Myers, which he left in 1980 to join SmithKline, Crooke helped develop an array of chemotherapy drugs for cancer patients that put Bristol at the forefront of cancer treatment. S6 : Since a typical drug takes 10 years and Y10bn to develop, only those companies large enough to absorb the costs will be able to survive.

Table 3: Sentences retrieved for relation R1 : [develop ­ drug]. 3 Special Cases. The properties between relations ri and rj that are used in the index cover three more cases that need to be addressed by question generation. They are: 2.c. Argument Specialization. In order to inquire about the attributes of arguments, three forms of questions are generated: (i) questions that inquire about instances of entities that are referred by the argument of a relation in which the semantic class of the argument is the EAT of the question, and the question becomes a list question; (ii) questions that specialize the argument of the relation by using a modifier which becomes the EAT of the question; and (iii) questions that inquire about the characteristics of the arguments by using the question stem "what types". An example of the first form of questions is Q6 : "What new drugs have been 0 developed?", generated from the sentence: "Zinnat is a new drug which was developed because other drugs in its class needed to be injected and were therefore of little use outside the hospital environment.". An example of the second form of questions is Q7 : "How 0 many medicines are launched per year?" , in which the EAT is NUMBER, it modifies the argument "medicines" in sentence "The number of medicines launched during the early 1980s averaged about 60 per year.".
How are new drugs researched? How are drugs manufactured? What types of activities are included in the development of new drugs?

Table 2: Properties Between Relations ri and rj . 2.b. Sentence Retrieval. We built an index based on the processing of relations in the text collection3 . A sentence is added to the inverted list of a relation ri when it may be composed with another relation rj in the same sentence and (a) relations ri and rj meet one of the conditions listed in Table 2, or the predicates of relations ri and rj may be composed with the predicate composition procedure described as a special case in 2.c; and (b) the argument of the relation rj can be mapped in one of the EAT categories of the Q/A system. Examples of such sentences are illustrated in Table 3. Sentence S1 is retrieved because it contains relation ri = [develop ­ drugs] and also a relation rj = [develop ­ Glaxo] that shares the same predicate ("develop") and "Glaxo" is mapped into the EAT = COMPANY. Similarly, sentences S2 , S3 , and S4 are retrieved because they contain three different expansions of ri and new relations that are compatible with it. 2.c. Question Generation. Every sentence retrieved at 2.b. contains additional relations, besides those that were expressed by the query. Among those relations, some share arguments with the queried relations, some do not. The first group of relations may serve to point to EATs that the decomposed questions should refer to. For example, in sentence S1 , the new relation [Glaxo ­ develop] can be generalized into [ORGANIZATION ­ develop] in which ORGANIZATION can be selected as the EAT of the question that shall be generated. Our named entity recognizer (NER) is able to distinguish between different types of organizations, tagging "Glaxo" We process the text collection and discover all syntactic and salient relations when we build the topic signature T S2 described in Section 2.
3

Table 4: Questions Based on Predicate Specialization. 2.d. Predicate Specialization. There are three ways of specializing the predicates from the relations: (i) by selecting the EAT of the question as a MANNER, and associating the question stem "how"; (ii) by using adjuncts of the predicates in the question to produce either a specialized MANNER EAT or a YES/NO question; and (iii) by considering that the predicates represent complex events that have structure, and thus this structure can be inquired by using special constructs of the form "what steps are included in", or "what types of activities are included in". The first form of predicate specialization is the most productive one, and it can be generated based on the recognition of MANNER relations that was reported in [6].

223


Examples of questions that were generated for predicate specialization are listed in Table 4. 2.e. Predicate Composition. Some questions need to capture relations between predicates. Such relations may be determined by the discovery of (a) causal relations, as it was reported in [4]; (b) temporal relations; or (c) because the predicates share an argument. Table 5 illustrates such questions, their type of relations, and the sentences that enabled them. In our implementation, we have used a set of cue phrases and causal verbs to detect causal relations between predicates. For the temporal relations, we relied on the temporal signals annotated in TimeBank (e.g. "before", "after", "during").
CAUSAL: How do trade restrictions affect new drug development? TEMPORAL: How many times must a drug be tested before it can be sold? ARGUMENT SHARING: How do companies decide which new drugs to research?

Table 5: Questions Based on Relations Between Predicates. STEP 3: The selection of a new relation is performed after newly discovered relations are added to the RELATIONS list. 3.a. Relations that specialize arguments or predicates are not added to RELATIONS, but all the other three types of relations rj are ap1 pended. For example, the relations R1 = [[R1 = develop ­ drug] ­ 2 3 COMPANY], R1 = [[R1 = develop ­ drug] ­ DISEASE], and R1 = [[R1 = develop ­ drug] ­ DURATION] are added. 3.b. A new relation is selected to maximize the probability estimation that it will lead to another question decomposition of the complex question. The probability estimations are detailed in Section 4. STEP 4: The decision to continue or stop the process of generating question decompositions depends on our formalization of the process. We have formalized the process of generating question decompositions which lead to the discovery of new meaningful relations as a random walk on a bipartite graph of questions and relations. For a given relation, a sentence that contains the relation is selected. That sentence is considered to be the answer to a question decomposition, which is generated by identifying a new relation, which in turn, when selected will lead to a new question decomposition. Thus the random walk continues with a probability , generating a new decomposition and selecting a new relation, or it stops with a probability (1 - ). Section 4 describes the formalisms that allow us to estimate the probability that the random walk ends after k steps, corresponding to k loops of the Steps 2 and 3 of this procedure.

1 of the initial state is set as n , where n=|RELATIONS(time = 0)|. After selecting a relation ri at step i, the index is consulted to find sentences where ri and other relations rj having the properties listed in Table 2 are present. If the argument of relation rj can be categorized in the EAT hierarchy as an expected answer type ej , then it can enable the generation of a question decomposition QDi+1 with EAT = ej . The probability that a question decomposition QDi+1 , with EAT = ej , is generated from a relation ri is given by p(QDi+1 |ri ) = p(ej |ri ). The new relation rj is placed in the RELATIONS list. If the index of ri had only one sentence and only one relation rj = ri could be found in that sentence, then ri is removed from the list RELATIONS. A new relation ri+1 is selected from RELATIONS based on the probability p(ri+1 |QDi+1 ). Since question QDi+1 was generated based on the EAT discovered with the help of relation rj which led to the EAT ej , we can evaluate p(ri+1 |QDi+1 ) = p(ri+1 |ej ). In this way we have defined the transition probabilities of the Markov Chain (MC) illustrated in Figure 3. The MC alternates from selecting relations from RELATIONS and generating a new question decomposition. In this way, the decomposition process is "surfing" the set of relations meaningful for the complex question, and also the decomposed questions that are generated based on these relations. After each step there is some chance that the question decomposition process will stop. The process continues the random walk with probability , generating a new set of question decompositions. With probability (1 - ), the walk stops after step k (after producing the question decomposition QDk+1 ).
Step 0 r0 P(r1|QD1) P(QD1|r0) QD 1 P(QD2|r1) QD2 Step 1 r1 P(r2|QD2) P(QD3|r2) QD3 Step 2 r2

4.

MARKOV CHAINS FOR QUESTION DECOMPOSITION

Figure 3: The Markov chain alternates between relations and question decompositions. Since our goal is to estimate the probability that the MC stops after k steps, we produce a matrix formulation of the problem which is similar to the formulation reported in [9]. This formulation is described in Section 4.1. We also want to test our hypothesis that the decomposed questions are relevant for the complex question. Since these question decompositions have been generated by relations that we have discovered in the text to be associated with relations originating in the complex question, we want to test if our assumption that they are valid decompositions can be quantified by a measure of relatedness to the complex question. For this purpose, in Section 4.2 we define a mixture model which generates a different random walk that evaluates the relevance of the decomposed questions.

4.1 Matrix Formulation
The operation of the random walk can be cleanly described by using a matrix notation. Let N be the size of the index. The number N corresponds to the relations that we have discovered in the text, having the properties that for every relation ri there is also a relation rj sharing with ri the properties from Table 2, and rj has the argument mapped in one of the semantic categories of the EAT classes. Let M be the number of EAT classes. We consider A to be a N × M stochastic matrix with entries aij = p(ri |ej ) representing the probability that the relation ri will be composed in a sentence with a relation rj that has an argument of semantic type ej , which will become the EAT of the question decomposition that is generated. Similarly, a stochastic matrix B of dimensions M × N can be defined, in which the elements bij = p(ei |rj ) represent the probability that a sentence that contains the relation rj can be the answer to a question with the EAT = ei . Then, the N × N stochastic matrix

In this section, we describe how we employ two different types of random walks to decompose complex questions for questionanswering and/or multi-document summarization applications. We begin by describing how a random walk can be used to populate a network with potential decompositions of a complex question. Later, in Section 4.2, we show how another random walk can then be used to select a set of generated decomposed questions that best represents the information need of the complex question. The question decomposition procedure detailed in Section 3 can be cast as a Markov chain (MC). A MC over a set of states S is specified by an initial distribution p0 (S ) over S , and a set of state transition probabilities p(St |St-1 ). In the case of question decomposition, the initial state is represented by one relation r0 selected from the list RELATIONS (time = 0), which is the set of relations generated when processing the complex question. The probability

224


C is defined as C = A × B . The probability that the MC stops after k k steps is given by (1 - )k Cr,e , if the last relation it discovered is r and the last question decompositions it has generated had the EAT = e. To estimate p(ei |rj ) we consider p(rj |ei )p(ei ) p(ei |rj ) = P k p(rk |ei )p(ei ) where p(ei ) is the prior distribution of the semantic type ei in the corpus, and p(rj |ei ) is given by the maximum likelihood estimate of the relation distribution in the text collection. Let J1 be the number of instances of the relation rj composed with a relation ri in the same sentence such that the argument of ri has the semantic J1 type ei . Then, pmle (rj |ei ) = #(instances of rj )

4.2 Random Walks with Mixture Models
Recently, [15] introduced a random walk model for finding answers to complex questions. This model is based on the idea that answers can be found by scoring each sentence against a complex question and selecting only the first top-ranked sentences. The sentence rank is produced by a mixture model that combines an approximation of a sentence's relevance to a question with similarity measures that can be used to select answer sentences that are not similar to one another. Using the same idea, we devised a similar mixture model for measuring the relevance of a question decomposition q di to the complex question cq . The relevance measure is defined as: r elev ance(q di , cq ) = + sima (q di , cq ) + q dj sima (q dj |cq ) X simb (q di , q dj ) P (1 - d ) q dk simb (q dk , q dj ) dP
q dj

The similarities sima and simb are selected from the four similarity measures defined in Section 5. If k is the number of question decompositions that we consider, a stochastic matrix A of dimensions k × k is considered such that aij =P  r elev ance(q di , cq ).  In order for matrix a to be stochastic, i aij = 1, thus  = P 1/ k=1 r elev ance(q di , cq ). Similarly, a stochastic matrix B of i dimension k × k is defined such that bij = j  simb (q di , cq ), P where j = 1/ k=1 simb (q di , q dj ). Next, the relevance vector i R for all question decompositions is defined by R = [dA + (1 - d)B ]i × R. The square matrix E = [dA + (1 - d)B ] defines a MC where each element eij from E specifies the transition probability from state i to state j in the corresponding Markov Chain. The relevance vector R is the stationary distribution of the Markov chain. With probability d, a transition is made from the current question decomposition q di to new question decompositions that are similar to the complex question cq . With a probability (1 - d), a transition is made to question decompositions that are similar to the last question generated, q di . We have used several values for d in our experiments.

5.

EVALUATION RESULTS

Our experiments have targeted (1) the evaluation of the decomposed questions; (2) the evaluation of the three forms of answers produced by the framework illustrated in Figure 1; and (3) the evaluation of the impact of the decomposed questions on the quality of answer summaries. Evaluation of Decomposed Questions The evaluation of the decomposed questions was performed in two ways. First, the decomposed questions were evaluated against

decompositions created by humans. Second, question decompositions were evaluated against questions generated from the answer summaries. The second evaluation was also compared against an evaluation involving only human-generated questions, both from the complex question and from the answer summaries. The evaluation was performed against 8 complex questions that were asked as part of the DUC 2005 question-directed summarization task. The questions correspond to the topics listed in Table 6. Four human annotators performed manual question decomposition based solely on the complex questions themselves. Annotators were asked to decompose each complex question into the set of subquestions they felt needed to be answered in order to assemble a satisfactory answer to the question. (For ease of reference, we will refer to this set of question decompositions as QDhuman .) In order to evaluate the quality of the automatic question decompositions produced by our system, we generated three different types of question decompositions for a total of 8 complex questions that were asked as part of the 2005 DUC question-directed summarization task. First, we had 4 human annotators perform manual question decomposition based solely on the complex questions themselves. Annotators were asked to decompose each complex question into the set of subquestions they felt needed to be answered in order to assemble a satisfactory answer to the question. (For ease of reference, we will refer to this set of question decompositions as QDhuman .) The subquestions generated by the annotators were then compiled into a "pyramid" structure similar to the ones proposed in (Nenkova and Passonneau, 2004). In order to create pyramids, humans first identified subquestions that sought the same information (or were reasonable paraphrases of each other) and then assigned each unique question a score equal to the number of times it appeared in the question decompositions produced by all annotators. Second, we utilized our random walk model to generate a set of question decompositions (QDauto1 ) for each complex questions. Third, as shown in Figure 1, the subquestions in QDauto1 were used to generate multi-document summaries which were used to automatically generate a fourth set of question decompositions (QDauto2 ). As with QDhuman , the subquestions generated for QDauto1 and QDauto2 were combined into pyramid structures by human annotators. Each of these three sets of question decompositions were then compared against a set of "gold standard" decompositions created by another team of 4 human annotators from from the 4 "model summaries" prepared by NIST annotators as "gold standard" answers to the 8 complex questions. Each of the three question decompositions described above (i.e. QDhuman , QDauto1 , and QDauto2 ) were then scored against the corresponding "model" question decomposition pyramid using the technique outlined in [14]. Table 6 illustrates the Pyramid coverage for QDauto1 , QDauto2 , and QDhuman It is to be noted that although the QDhuman captured 45% of the questions contained in the "model" pyramids, the high average Pyramid score (0.5000) suggests that human question decompositions typically included questions that corresponded to the most vital information identified by the authors of the "model" summaries. Another important observation is that the coverage and the Pyramid score of QDauto2 are almost 80% of the same measures obtained for QDhuman , whereas the Pyramid score of the question decompositions QDauto1 is only 45% of the Pyramid score and coverage obtained for QDhuman . In fact, these scores vary based on the number of feedback loops allowed for the Answer Summary 3 from Figure 1. Figure 5 illustrates the average average Pyramid scores that were obtained at each step of the feedback loop for all eight questions, both for QDauto1 , and QDauto2 . The Figure

225


Topic Description Falkland Islands Tourist Attacks Drug Development Amazon Rainforest Welsh Government Robot Technology U.K. Tourism Czechoslovakia AVERAGE

Pyramid Score for Question Decompositions QDauto1 QDauto2 QDhuman 0.2012 0.3202 0.3889 0.2317 0.3745 0.5000 0.3114 0.5195 0.6744 0.2500 0.4091 0.6000 0.2931 0.4873 0.5091 0.2268 0.4421 0.6222 0.0196 0.3917 0.4035 0.2301 0.3116 0.3836 0.2205 0.4070 0.5000

0.6

0.5

Pyramid Score

0.4

0.3

QD-auto1 QD-auto2 QD-human

0.2

Table 6: Pyramid Coverage of Question Decompositions. shows that the Pyramid scores improve for QDauto1 . The improvement for QDauto2 is less dramatic. This means that the comparison and selection of new question decompositions at each feedback loop determines better questions and better answers.
Topic Description Falkland Islands Tourist Attacks Drug Development Amazon Rainforest Welsh Government Robot Technology U.K. Tourism Czechoslovakia AVERAGE Summary 1 3.75 2.75 2.00 3.00 3.75 3.00 3.75 2.25 3.03 Responsiveness Score Summary 2 Summary 3 4.00 4.00 3.00 3.25 2.75 3.00 3.25 4.00 4.00 3.75 3.50 4.00 4.00 4.25 3.00 4.00 3.44 3.72 Human Sum 4.50 4.75 4.50 4.75 5.00 4.50 4.25 4.50 4.59

0.1

0 0 1 2 3 4 5 6

L

Figure 5: Pyramid scores at each step of the feedback loop.

Table 7: Responsiveness Score for the Human Summaries and Answer Summaries 1, 2 and 3. Four different similarity metrics are responsible for the comparisons. They are listed in Figure 4. Pairs of these similarity metrics were also used for defining the relevance of question decompositions to each complex question. The aggregate similarity between qi P QDauto1 and QDauto2 is defined as A sim(qi , QDauto2 ) =  - 7 1 j =1 simj (qi , QDauto2 ). The similarity scores play an impor7 tant role in the selections of questions from QDauto1 for the next loop. In our experiments, we observed that if we take only similarity 3 we obtain the best results.
Similarity Metric 1 weights content terms in QDauto1 and QDauto2 using tf idf (wi = w (ti ) = (1 + log(tfi )) log N ), where N is the di f number of questions in QDauto1 and QDauto2 , while dfi is equal to the number of questions in containing ti and tfi is the number of times ti appears in QDauto1 and QDauto2 . Any question in QDauto1 and a QDauto2 can be transformed into two vectors, vq = wq1 , wq2 , ..., wqm nd vu = wu1 , wu2 , ..., wun ; The similarity between QDauto1 and QDauto2 is measured as the cosine measure between their corresponding vec1 1 P P P 2 2 tors:: cos(vq , vu ) = ( i wqi wui )/(( i wqi ) 2 × ( i wui ) 2 ). Similarity Metric 2 is based on the percent of terms in QDauto1 that appear in the QDauto2 . It is obtained by finding the intersection of the terms in the term vectors of the two questions. Similarity Metric 3 utilizes semantic information from WordNet. It involves: (a) finding the minimum path between WordNet concepts. Given two terms t1 and t2 , each with n and m WordNet senses S1 = {s1 , ..., sn } and S2 = {r1 , ..., rm }. The semantic distance between the terms  (t1 , t2 ) is defined by the minimum of all the possible pair-wise semantic distances between S1 and S2 :  (t1 , t2 ) = minsi S1 ,rj S2 D(si , rj ), where D(si , rj ) is the path length between si and rj . (b) The semantic similarity between the vector transformations vq and vu from QDauto1 and QDauto2 respectively is defined as sem(vq , vu ) = P I (vq ,vu )+I (vu ,vq ) 1 , where I (vx , vy ) = xvx 1+min |vq |+|vu | (x,y)
yvy

Evaluation of Answers. Answers were evaluated by the "responsiveness score" designed by the NIST assessors. The score provides a coarse ranking of the summaries for each topic, according to the amount of information in the summary that helps to satisfy the information need expressed in the topic statement. Four linguist assigned these scores for all three forms of answer summaries. Table 7 illustrates the responsiveness scores for Answer Summary 1, Answer Summary 2, Answer Summary 3, from Figure 1 and the human generated summaries. The responsiveness score is measured on a scale from 1 to 5 and it quantifies how well does a summary answer the complex question. A score of 1 is the least responsive to the question. A score of 5 means that the summary answers completely the question. Evaluation of the Impact of the Decomposed Questions on Answer Summaries. We were also interested to evaluate the impact the question decompositions would have when we select different values for the parameter  which stops the Markov chain. Figure 6 illustrates the average Responsiveness score obtained when  = 0.85,  = 0.6,  = 0.5,  = 0.3 and  = 0.15. Since the question decompositions determine two different answers, as it was illustrated in Figure 1, we have measured the responsiveness for both of them and illustrate the results in Figure 6.
5.00

4.50

4.00

Responsiveness

3.50

3.00

2.50

2.00

1.50

Answer Summary 2 Answer Summary 3 Human Summary
0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90

1.00 0.00


Figure 6: Responsiveness for different  values. In a separate effort, we evaluated the impact of only the question decompositions that were considered relevant to the complex question by the random walk presented in Section 4.2. Since that random walk depends on the parameter d, we have tested the question coverage for d = 0.85, d = 0.6, d = 0.5, d = 0.3, and d = 0.15. Figure 7 illustrates the average Responsiveness obtained

Similarity Metric 4 is based on question type similarity, using a question-type similarity matrix similar to the one introduced in [11].

Figure 4: The four similarity metrics.

226


in this case. Since only Answer Summary 3 is obtained by considering the relevance of question decompositions to the complex question, unlike Figure 6, in Figure 7 we illustrate results only for Answer Summary 3. The best results are obtained for d = 0.85. This result supports our intuition that the question decompositions should not be necessarily very different, but they must be relevant to the complex question. The difference from the responsiveness of human-generated summaries indicates that relevance takes into account more sophisticated information than the one contained in questions.
5.00

4.50

4.00

3.50

3.00

2.50

2.00

1.50

Answer Summary 3 Human Summary
0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90

1.00 0.00

d

Figure 7: Responsiveness for different d values.

6.

CONCLUSIONS

We have presented a new framework for question decomposition that allows several forms of answers to be returned for complex questions. Two forms of random walks were used. The first random walk was used for surfing the space of relations relevant to the complex question, in order to generate question decompositions. The second random walk was used for measuring the relevance of the question decompositions to the complex question. The evaluations have shown that the question decompositions lead to more relevant and complete answers. The results have also shown that the coverage of automatically generated question decompositions, when compared with the questions generated from the answer summary are better indicators of answer quality than the relevance score to the complex question. The evaluations have also indicated the question coverage for automatic methods is 85% of the coverage of questions produced by humans. In this paper we have also described a Q/A architecture which allows feedback loops for improving the quality of answers through the coverage of question decompositions.

7.

ACKNOWLEDGMENTS

This material is based upon work funded in whole or in part by the U.S. Government and any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the U.S. Government.

8.

REFERENCES

[1] E. Brill. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging. Computational Linguistics, 21(4), 1995. [2] M. Collins. Head-Driven Statistical Models for Natural Language Parsing. PhD thesis, University of Pennsylvania, 1999.

[3] S. Harabagiu. Incremental Topic Representations. In Proceedings of the 20th COLING Conference, Geneva, Switzerland, 2004. [4] S. Harabagiu, A. Hickl, J. Lehmann, and D. Moldovan. Experiments with Interactive Question-Answering. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), 2005. [5] S. Harabagiu, D. Moldovan, C. Clark, M. Bowden, A. Hickl, and P. Wang. Employing Two Question Answering Systems in TREC 2005. In Proceedings of the Fourteenth Text REtrieval Conference, 2005. [6] S. Harabagiu, D. Moldovan, C. Clark, M. Bowden, J. Williams, and J. Bensley. Answer Mining by Combining Extraction Techniques with Abductive Reasoning. In Proceedings of the Twelfth Text REtrieval Conference, 2003. [7] F. Lacatusu, A. Hickl, P. Aarseth, and L. Taylor. Lite-GISTexter at DUC 2005. In Proceedings of the Document Understanding Workshop (DUC-2005) , 2005. [8] F. Lacatusu, A. Hickl, and S. Harabagiu. Impact of Question Decomposition on the Quality of Answer Summaries. In Proceedings of the fifth international conference on Language Resources and Evaluation, (LREC 2006), 2006. [9] J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In 2001 ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2001. [10] C.-Y. Lin and E. Hovy. The automated acquisition of topic signatures for text summarization. In Proceedings of the 18th ¨ COLING Conference, Saarbrucken, Germany, 2000. [11] S. Lytinen and N. Tomuro. The Use of Question Types to Match Questions in FAQFinder. In Papers from the 2002 AAAI Spring Symposium on Mining Answers from Texts and Knowledge Bases, pages 46­53, 2002. [12] G. A. Miller. WordNet: a lexical database for English. Communications of the Association for Computing Machinery, 38(11):39­41, 1995. [13] S. Narayanan and S. Harabagiu. Question Answering based on Semantic Structures. In Proceedings of COLING-2004, 2004. [14] A. Nenkova and R. Passonneau. Evaluating Content Selection in Summarization: the Pyramid Method. In HLT-NAACL 2004, Boston, MA, 2004. [15] J. Otterbacher, G. Erkan, and D. Radev. Using random walks for question-focused sentence retrieval. In Proceedings of Human Language Technology Conference and Empirical Methods in Natural Language Processing (HLT/EMNLP 2005), Vancouver, Canada, 2005. [16] M. Palmer, D. Gildea, and P. Kingsbury. The Proposition Bank: An Annotated Corpus of Semantic Roles. Computational Linguistics, 31(1):71­106, 2005. [17] M. Pasca and S. Harabagiu. High Performance Question/Answering. In Proceedings of the 24th Annual International ACM SIGIR Conference, 2001. [18] M. Thelen and E. Riloff. A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), 2002.

Responsiveness

227