Quantative Analysis of The Impact of Judging Inconsistency on The Performance of Relevance Feedback Xiangyu Jin University of Virginia 151 Engineer Way Charlottesville VA 22904, USA James French University of Virginia 151 Engineer Way Charlottesville VA 22904, USA Jonathan Michel Science Applications International Corporation Charlottesville VA 22911, USA xiangyu@virginia.edu ABSTRACT French@cs.virginia.edu Jonathan.D.Michel@saic.com Practical constrains of user interfaces make the user's judgment (during the feedback loop) deviate from real thoughts (when the full document is read). This is often overlooked in evaluation of relevance feedback. This pap er quantitatively analyze the impact of judging inconsistency on the p erformance of relevance feedback. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Relevance Feedback General Terms Performance, Exp erimentation, Measurement Keywords Relevance Feedback, Judging Inconsistency, Performance Evaluation 1. INTRODUCTION Relevance feedback has b een historically proven to b e effective for information retrieval [4] [6]. Since it is very costly to do user-in-the-loop evaluation, large scale evaluation of relevance feedback usually employs machine simulated users instead of real human users. Under such exp eriments, the surrogates generated by the retrieval system (e.g., judging whether a provided document is relevant by its title) 1 are answered by machine based on the pre-defined groundtruth. This infers "p erfect consistency" b etween relevance judgment during the feedback loop and relevance judgment after the full document is read when deciding the groundtruth. However, it is very hard to achieve such p erfect consistency in practice. On the one hand, practical constrains, 1 In the following we focus our study on document-level feedback. such as time sp ent for judging, the screen resolution, etc., could restrict the information delivery to the user during the feedback loop. For example, in text retrieval, we can only provide a document's title, key terms, or abstraction to the user instead of providing the full document. On the other hand, real human users can b e inconsistent with themselves. When the user learns more ab out her information need during the feedback process, she might change her criteria of relevance in subtle ways. What she considers relevant during the feedback loop may not b e considered relevant after the retrieval process terminates. Neglecting such judging inconsistency may result in exaggerated p erformance gain for relevance feedback. To date the impact of such inconsistent judgments on the p erformance of refined search has not b een carefully studied. In this pap er we explore the answers to the following research questions. How often does such judging inconsistency happen? The answer to this question is related to sp ecific user interfaces and retrieval environments. Unfortunately, it is unrealistic to enumerate all p ossible user interfaces and p erform large scale tests over various environments since human sub jects are invloved. We focus on case studies of recent years' TREC HARD tracks in this pap er, where reasonable user interfaces (the clarification forms) are designed and reasonable relevance judgments are made (by NIST assessors). Such case studies can help us b etter understand the state of the art. How does such judging inconsistency affect the refined retrieval performance? If such judging inconsistency is inevitable, we want to know how it will affect the refined search p erformance. To study this question can help us estimate whether a relevance feedback technique would indeed help a retrieval application in practice. Moreover, relevance feedback algorithms p erform similarly with "p erfect user" may p erform quite differently with real user. In this sense, some relevance feedback algorithms are more robust. In this pap er, we make an initial attempt to study the ab ove two questions. 2. CASE STUDIES TREC (Text Retrieval Conference)'s HARD (High Accuracy Retrieval from Documents) track 2 provides us an opp ortunity to quantitatively analyze such judging inconsistency and its impact on p erformance of relevance feedback. HARD provides large-scale centrally-administered evalua2 Copyright is held by the author/owner(s). SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. http://trec.nist.gov/tracks.html 655 tions for retrieval systems which allow one round of interaction. Basically HARD splits the retrieval process into three phases: baseline, clarifying, and final. Initially, each participant generates baseline runs by p erforming traditional ad-hoc retrievals over the HARD corpus. In the second phase, "clarification forms" (CF) are generated for each topic. These CFs are submitted to NIST assessors. Later, the filled in CFs are returned to the HARD participants to generate their refined search results (final runs). Although HARD is not designated for the relevance feedback task, it is an ideal environment to study the previous two questions. First, HARD is based on large scale evaluation (ab out 1M documents) and up on metrics historically proven to b e effective in past TRECs. Second, the assessor who answers the CFs for a sp ecific topic is the one who decides its groundtruth. This eliminates inconsistency among different human sub jects. Third, practical constrains are imp osed on CFs. For example, each CF must b e filled in within 3 minutes. Fourth, the assessor answers the CFs indep endently of the participants. This is extremely imp ortant in keeping the bias of the interface develop ers from creeping into the evaluation. Finally, the CFs and their judging results are available for research purp ose. In the following, we choose UIUC's HARD 2003 submission (2 CFs) [5] and SAIC's HARD 2005 submission (1 CF) [1] for analysis b ecause these are the CFs where we currently know the association b etween the surrogates (on CFs) and the documents they are on b ehalf of. The settings are listed in Table 1. Table 1: CF Settings ILUC-1,2 SAIC1 6 8 3 minutes 3 minutes Abstraction Keywords, Title, and Abstraction User Interface Showing directly Showing when mouse over title Choices Relevant/Irrelevant Relevant/Perhaps /Irrelevant Default Irrelevant Perhaps affect the p erformance of relevance feedback. We compare the refined search p erformance when the CFs are judged by different users in HARD 2005 environments, including a p erfect user (judging by groundtruth), a blind user (judging everything as relevant), a real user (judging results of SAIC1), and a simulated user (who randomly make 30% of its judgments inconsistent with the groundtruth) We use a BM25-ranked retrieval system to generate the baseline search result. Top 20 terms from each judged relevant document are extracted and combined to the initial query by Rocchio [3] method. We only consider p ositive relevance feedback at this time. Mean average precision (MAP) is rep orted as the evaluation metric. In order to fairly compare the results, rank shifting [2] is employed b oth for the baseline and refined search results (for the documents listed for judging, move relevant ones to the head of the result list and irrelevant ones to the end of the result list). The results are shown in Table 3. Table 3: Performance of Relevance Feedback Run Name MAP Inconsistent Rate Baseline 0.2580 -- Perfect User 0.3629 0 Blind User 0.2953 0.5275 Real User 0.3069 0.2975 Simulated User 0.3064 0.2937 Interestingly, we find that relevance feedback in practice p erforms much lower than it should b e. If the user can judge the document consistently with the groundtruth, the refined MAP will reach 0.3629, which is much higher than the baseline 0.2580. However, the p erformance of relevance feedback with a real user is around 0.30, which is comparable to the p erformance pseudo-relevance feedback. This indicates the b ottle neck for current relevance feedback applications resides in judging inconsistency but not the relevance feedback algorithm. With b etter user interfaces and more appropriate information delivered, relevance feedback can b e greatly improved. Moreover, our simulated user is a more appropriate estimation of relevance feedback's p erformance in practice than the p erfect user. Name # Surrogates Time Limit Display Content We find that the relevance judgment during feedback loop is not fully consistent with relevance judgment after the full document is read for the same user. 19.6% documents of ILUC-1, 21.7% of ILUC-2, and 22.9% of SAIC1 are judged inconsistently (if exclude those leave as "p erhaps"). The detailed results are listed in Table 2. Table 2: Judging Inconsistency (the * indicates the inconsistent part) CFs Ground- Judged-Rel Judged-Irrel Perhaps truth ILUC-1 Rel 28.7% *6.3% -- ILUC-1 Irrel *13.3% 51.7% -- ILUC-2 Rel 36.3% *4.7% -- ILUC-2 Irrel *17.0% 42.0% -- SAIC1 Rel 25.3% *9.8% 12.3% SAIC1 Irrel *7.8% 34.0% 11.0% Afterward, we analyze how the judging inconsistency will 3. REFERENCES [1] X. Jin, J. French, and J. Michel. Saic and university of virginia at trec 2005: Hard track. In TREC, 2005. [2] X. Jin, J. C. French, and J. Michel. Toward consistent evaluation of relevance feedback approaches in multimedia retrieval. In Adaptive Multimedia Retrieval, pages 191­206, 2005. [3] J. Rocchio. Relevance feedback in information retrieval. In G. Salton, editor, The SMART Retrieval System: Experiments in Automatic Document Processing, pages 313­323. Prentice-Hall, 1971. [4] G. Salton and C. Buckley. Improving retrieval p erformance by relevance feedback. JASIS, 41(4):288­297, 1990. [5] X. Shen and C. Zhai. Active feedback - UIUC TREC-2003 HARD exp eriments. In TREC, pages 662­666, 2003. [6] H. Zhang and Z. Su. Relevance feedback in CBIR. In VDB, pages 21­35, 2002. 656