A Platform for Okapi-Based Contextual Information Retrieval Xiangji Huang School of Information Technology York University Toronto, Canada Miao Wen, Aijun An and Yan-Rui Huang Computer Science Depar tment York University Toronto, Canada jhuang@yorku.ca ABSTRACT We present an extensible java-based platform for contextual retrieval based on the probabilistic information retrieval model. Modules for dual indexes, relevance feedback with blind or machine learning approaches and query expansion with context are integrated into the Okapi system to deal with the contextual information. This platform allows easy extension to include other types of contextual information. mwen,aan, yhuang@cs.yorku.ca 2. SYSTEM OVERVIEW Figure 1 depicts the overall structure of the platform. We use Okapi BSS as the underlying retrieval system. On top of it, we implement a prototype system to deal with contextual retrieval and develop our own dual index module, machine learning feedback module and query expansion with context module [1]. The context information we consider in this platform includes clarification forms (CF) and some metadata1 associated with each topic. Metadata Topic CF Implicit Topic processing module Term processing module Machine learning feedback module Result processing module Categories and Subject Descriptors H.3.3 [Information Systems]: Information Retrieval General Terms Design Dual Index module Final Result JNI (Java Native Interface) Keywords Contextual Information Retrieval, Probabilistic Model Okapi BSS Okapi 2.31+ (BM25) 1. INTRODUCTION Database Most information retrieval systems do not consider the context information related to a query. Given a query, this type of systems returns a set of documents that are related to this specific query. The retrieval decision is made primarily based on the current query and the document collection. However, different users may have different needs even though they submit the same query. For example, if the user wants to search "Information Retrieval in Context (IRIX)" related articles, and inputs "IRIX" in Google, it will return a mixture of articles about the IRIX operating system and "Information Retrieval in Context (IRIX)". However, if the user provides additional information, say "Information Retrieval", then the search system can distinguish documents between the IRIX operating system and IRIX information retrieval. Therefore, it is reasonable that an IR system should consider the contextual information. When there is additional contextual information available, the system should make use of this contextual information to retrieve and rank relevant documents. Okapi, one of the most famous IR systems, does not contain any functionality for handling the contextual information. To enrich the Okapi system, we developed an extensible platform for contextual retrieval. Copyright is held by the author/owner. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. Figure 1: System Architecture The system takes the clarification forms (CF) as explicit feedback from users. The topic processing module loads the topic files, extracts relevant fields and converts them into the format acceptable by the term processing module. The term processing module extracts terms from the processed topic in full text paragraphs. The machine learning feedback model takes the blind feedback passages from the result processing module, applies machine learning algorithms to select more relevant passages that are not in the blind feedback, and passes the new expanded set of feedback passages back to the term processing module. The term processing module then expands query terms, updates statistics for the terms, and issues a new query to Okapi BSS. This java based platform also allows us to add more modules into the existing system easily. We are currently implementing the implicit user feedback module which includes search history, past queries and clicked search results. This module will capture a user's implicit search context and can be integrated into the system's retrieval process. 3. [1] Xiangji Huang, Yan-Rui Huang, Miao Wen. A Dual Index Model for Contextual Information Retrieval. In Proc. of the 28th ACM SIGIR 2005. REFERENCES 1 such as Genre, Geography, Granularity, Familiarity, Subject and Related Text 728