Automated Performance Assessment in Interactive QA Joyce Y. Chai Tyler Baldwin Chen Zhang Depar tment of Computer Science and Engineering, Michigan State University East Lansing, MI 48824 jchai@cse.msu.edu, baldwi96@cse.msu.edu, zhangch6@cse.msu.edu ABSTRACT In interactive question answering (QA), users and systems take turns to ask questions and provide answers. In such an interactive setting, user questions largely dep end on the answers provided by the system. One question is whether user follow-up questions can provide feedback for the system to automatically assess its p erformance (e.g.,assess whether a correct answer is delivered). This self-awareness can make QA systems more intelligent for information seeking, for example, by adapting b etter strategies to cop e with problematic situations. Therefore, this pap er describ es our initial investigation in addressing this problem. Our results indicate that interaction context can provide useful cues for automated p erformance assessment in interactive QA. Categories and Sub ject Descriptors: H.3.3 [Information Search and Retrieval]: Search Process General Terms: Exp erimentation, Performance Keywords: Performance Assessment, User Behavior, Interactive Question Answering fails to deliver a desired answer, users do exhibit some language b ehavior (e.g., rephrase of the question) in the followup question to resp ond to this problematic situation. User b ehavior and interaction context can provide imp ortant cues for a QA system to automatically identify problematic situations. Based on the data collected from our studies, we exp erimented with three classifiers (Supp ort Vector Machine, Maximum Entropy Model, and Decision Tree). Our results indicate that the Decision Tree model can detect problematic situations with 73.8% accuracy, which is significantly b etter than the baseline. 2. USER STUDIES To investigate the role of interaction context in automated p erformance assessment, we conducted a controlled user study where a human wizard was involved in the interaction loop to control and simulate problematic situations. Users were not aware of the existence of this human wizard and were led to b elieve they were interacting with a real QA system. This control led setting allowed us to focus on the interaction asp ect rather than information retrieval or answer extraction asp ect of question answering. More sp ecifically, during interaction after each question was issued, a random numb er generator was used to decide if a problematic situation should b e introduced. If the numb er indicated no, the wizard would retrieve a passage from a database with correct question/answer pairs. Note that in our exp eriments we used sp ecific task scenarios (describ ed later), it is p ossible to anticipate user information needs and create this database. If the numb er indicated that a problematic situation should b e introduced, then the Lemur retrieval engine 1 was used on the AQUAINT collection to retrieve the answer. Our assumption is that AQUAINT data are not likely to provide an exact answer given our sp ecific scenarios, but they can provide a passage that is most related to the question. The use of the random numb er generator was to control the ratio b etween the occurrence of problematic situations and error-free situations. In our initial investigation, since we are interested in observing user b ehavior in problematic situations, we set the ratio as 50/50. As a result, this simulation generated 56% error-free situations and 44% problematic situations. In our future work, we will vary this ratio (e.g., 70/30) to reflect the p erformance of stateof-the-art factoid QA and investigate the implication of this ratio in automated p erformance assessment. Eleven users participated in our study. Each user was asked to interact with our system to complete information 1 1. INTRODUCTION Interactive question answering has b een identified as one of the imp ortant directions in QA research [1]. In interactive QA, users and systems take turns to ask question and provide answers. In such an environment, questions formed by a user not only dep end on his/her information goals, but are also influenced by the answers from the system. Because of this dep endency, our assumption is that user follow-up questions can provide feedback for the system to assess the status of preceding answers (e.g., whether a correct answer is delivered). The awareness of its own p erformance will enable the system to automatically adapt b etter strategies to cop e with problematic situations. To our knowledge, there has not b een much work that addresses this imp ortant asp ect of interactive QA. This pap er describ es our initial investigation on this problem. Given a question Qi and its corresp onding answer Ai , the sp ecific question examined here is whether the user language b ehavior in the follow-up question Qi+1 and the interaction context can help the system to assess its p erformance at answering the preceding question (Ai ). To address this question, we conducted a user study where users interacted with a control led QA system to find information of interest. Our studies indicate that when the system Copyright is held by the author/owner(s). SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. http://www-2.cs.cmu.edu/ lemur/ 631 seeking tasks related to four sp ecific scenarios. Each of the four scenarios was focused around a separate topic: the 2004 presidential debates, Tom Cruise, Hawaii, and Pompeii. As a result of this study, a total of 456 QA exchanges from 44 interactive sessions were collected, where each answer was annotated with a binary tag to indicate whether or not the answer was problematic. (1) (2) (3) (4) (5) (6) 3. PERFORMANCE ASSESSMENT Features Baseline NEM, SQC TM, SQ TM, SQ, SA TM, NEM, SQ, SQC, SA TM, NEM, SQ, SQC, SA, SAC SVM 56.4 61.7 63.7 68.3 66.8 67.3 MaxEnt 56.4 61.7 64.1 69.2 66.6 67.7 DTree 56.4 61.7 61.7 72.1 71.3 73.8 We formulate automated p erformance assessment as a classification problem. Given a question Qi with a corresp onding answer Ai , our goal is to decide whether Ai is problematic based on the follow up question Qi+1 and the interaction context. More sp ecifically, the following set of features are used: (1) Target matching(TM): a binary feature indicating whether the target typ e of Qi+1 is the same as the target typ e of Qi . Our data show that the rep etition of target typ e may indicate a question rephrase, which could signal a problematic situation has just occurred. (2) Named entity matching (NEM): a binary feature indicating whether all the named entities in Qi+1 also app ear in the Qi . If no new named entity is introduced in Qi+1 , it is likely Qi+1 is a rephrase of Qi . (3) Similarity between questions (SQ): a numeric feature measuring the similarity b etween Qi+1 and Qi . (4) Similarity between content words of questions (SQC): this feature is similar to the previous feature (i.e., SQ) except that the similarity measurement is based on the content words excluding named entities. This is to prevent the similarity measurement from b eing dominated by the named entities. (5) Similarity between Qi and Ai (SA). (6) Similarity between Qi and Ai only based on the content words (excluding named entities)(SAC). To measure the similarity b etween two chunks of text T1 and T2 , we applied the following equation prop osed by Lin [2]: P - wT1 T2 log P (w) sim1 (T1 , T2 ) = P - wT1 T2 log P (w) where P (w) was calculated based on 1806 pseudo documents (i.e., question/answer pairs) from previous TREC evaluations. We exp erimented with three classification approaches (Maximum Entropy Model from MALLET2 , SVM from SVMLight3 , and Decision Trees from WEKA4 ) based on ten fold cross-validation (90% of data was used as training data and 10% as testing data in each trial). Table 1 shows the accuracy of the three approaches on identifying problematic/errorfree situations using different combinations of features. The baseline was obtained by simply assigning the most frequently occurring class (i.e., 56% of correct situations in our data). The b est p erformance for each model is highlighted in b old in Table 1. The Decision Tree model achieves the b est p erformance of 73.8% in identifying problematic situations, which is more than 17% b etter than the baseline p erformance. Different combinations of features result in different p erformance in all three models. In general, the feature set that considers different forms of question/answer similarity works b etter than those that do not consider these asp ects. 2 3 Table 1: Accuracy of automated performance assessment based on three approaches Identification of problematic situations can b e considered as implicit feedback. One might think that an alternative way is to explicitly ask users for feedback (for example, with a feedback button). However, soliciting feedback after each question not only will frustrate users and lengthen the interaction, but also it may not b e p ossible for certain devices (e.g., PDA). Therefore, our focus here is to investigate the more challenging end of identifying problematic situations through implicit feedback. In real interaction, explicit and implicit feedback should b e intelligently combined. For example, if the confidence for identifying a problematic situation or an error-free situation is low, then p erhaps explicit feedback can b e solicited. 4. CONCLUSION This pap er presents our initial investigation on automated p erformance assessment in interactive question answering. Our studies indicate that when a problematic situation occurs (i.e., retrieved answer does not app ear to b e correct), users exhibit distinctive b ehavior such as rephrasing the question. Follow-up questions and interaction context can provide useful cues for the system to automatically evaluate its p erformance. Although our current evaluation is based on the data collected from our study, the same approaches can b e applied during online processing as the question answering session proceeds. Such p erformance assessment can provide feedback directly to a QA system as to what questions the system may have correctly answered and what questions the system may have trouble with. This will not only allow the system to automatically adapt b etter strategies during online processing but also provide a mechanism to automatically build databases of question answering pairs for other applications(e.g., collab orative question answering). 5. REFERENCES [1] J. Burger and et al. Issues, tasks and program structures to roadmap research in question & answering. In NIST Roadmap Document, 2001. [2] D. Lin. An information-theoretic definition of similarity. In Proceedings of International Conference on Machine Learning, Madison, Wisconsin, July 1998. http://mallet.cs.umass.edu/index.php/ http://svmlight.joachims.org/ 4 http://www.cs.waikato.ac.nz/ml/weka/ 632