Semantic Term Matching in Axiomatic Approaches to Information Retrieval Hui Fang Depar tment of Computer Science University of Illinois at Urbana-Champaign ChengXiang Zhai Depar tment of Computer Science University of Illinois at Urbana-Champaign ABSTRACT A common limitation of many retrieval models, including the recently proposed axiomatic approaches, is that retrieval scores are solely based on exact (i.e., syntactic) matching of terms in the queries and documents, without allowing distinct but semantically related terms to match each other and contribute to the retrieval score. In this paper, we show that semantic term matching can be naturally incorporated into the axiomatic retrieval model through defining the primitive weighting function based on a semantic similarity function of terms. We define several desirable retrieval constraints for semantic term matching and use such constraints to extend the axiomatic model to directly support semantic term matching based on the mutual information of terms computed on some document set. We show that such extension can be efficiently implemented as query expansion. Experiment results on several representative data sets show that, with mutual information computed over the documents in either the target collection for retrieval or an external collection such as the Web, our semantic expansion consistently and substantially improves retrieval accuracy over the baseline axiomatic retrieval model. As a pseudo feedback method, our method also outperforms a state-ofthe-art language modeling feedback method. Categories and Sub ject Descriptors: H.3.3 [Information Search and Retrieval]: Retrieval models General Terms: Experimentation Keywords: Axiomatic model, retrieval heuristics, constraints, query expansion, feedback 1. INTRODUCTION The axiomatic approach to information retrieval was proposed recently as a new retrieval framework, in which relevance is modeled by term-based retrieval constraints [5, 6]. Several new retrieval functions have been derived by using this approach and shown to be less sensitive to parameter setting than existing retrieval functions with comparable optimal performance; using a fixed parameter value can often achieve near-optimal performance in most test sets [6]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00. However, like most traditional retrieval models, these new axiomatic retrieval functions also have the limitation that retrieval scores are solely based on exact (i.e., syntactic) matching of terms in the query and documents, without allowing distinct but semantically related terms to match each other and contribute to the retrieval score. Since it is unlikely that the authors of relevant documents always use exactly the same terms as a user would use in a query, such a limitation makes the retrieval performance of existing models non-optimal. For example, given the query "car", intuitively, a single-term document with the term "vehicle" should have a higher score than a single-term document with the term "fish" because "car" is semantically more related to "vehicle" than "fish". However, existing retrieval models do not directly support such semantic matching and would treat both documents equally as not matching the query. Although techniques such as query expansion and pseudorelevance feedback can support semantic term matching to certain extent, query expansion purely based on semantic relations between terms have so far not been very successful, presumably because of the difficulty in assigning appropriate weights to the new terms [26, 14]. Pseudo feedback is much more successful, but the semantic matching exploited is restricted by the top-ranked documents and it does not allow us to incorporate external resources. In this paper, we show that there is a natural way to incorporate semantic term matching to the inductively defined axiomatic retrieval functions ­ all we need to do is to generalize the primitive weighting function to incorporate the semantic similarity of terms. Specifically, we follow the spirit of axiomatic approaches [5, 6] and formally define several constraints on semantic term matching. We then use these constraints to provide guidance on how to compute the term semantic similarity and how to regularize the weights of the original terms and the semantically related terms. We show that our method can be implemented as query expansion in the axiomatic framework, and when term similarity is computed using feedback documents, it can also be regarded as a method for pseudo feedback in the axiomatic approaches. We conduct experiments over several representative data sets. The results show that the proposed semantic expansion works well for all the six inductively defined axiomatic retrieval functions, significantly improving the retrieval accuracy over the original baseline axiomatic retrieval functions on all the data sets we experimented with. Moreover, the analysis of semantic term matching constraints can predict parameter boundaries that are consistent with the empirically discovered optimal ranges of parameters. Fur- 115 thermore, as a pseudo feedback method, our method outperforms a state-of-the-art language modeling approach for pseudo feedback [30] due to its capability of selecting better terms for query expansion. The rest of the paper is organized as follows. We discuss related work in Section 2 and briefly review the existing axiomatic approach to IR in Section 3. We then present our work on incorporating semantic term matching to the axiomatic framework in Section 4, and discuss experiment results in Section 5. Finally, we conclude in Section 6. 2. RELATED WORK Many studies have tried to bridge the vocabulary gap between documents and queries in traditional retrieval models, mostly based on either co-occurrence-based thesaurus [11, 24, 18, 20, 10, 29, 23, 2] or hand-crafted t hesaurus [26, 12]. Some researchers used both [14, 4]. Although our general strategy would be applicable to exploit both types of thesauri, in this paper, we focus on the use of co-occurrencebased thesaurus and leave other possibilities as future work. The earliest study of co-occurrence-based thesaurus can be traced back to the early sixties. Lesk [11] studied term expansion in the vector space model, where term similarity is computed based on the cosine coefficient [22]. Smeaton et al. [24] studied query expansion based on classical probabilistic model. These previous studies suggested that query expansion based on term co-occurrences is unlikely to significantly improve performance [18]. Qiu et al. [20] showed that adding terms that have the greatest similarity to the entire query, rather than individual terms, can obtain more improvement. Xu et al. [29] showed that the analysis of word occurrences and relationships on a local set of documents (i.e. the top ranked documents retrieved by the original query) yields better performance than on the whole corpus. In language modeling approaches, Berger et al. [3] proposed a translation model to incorporate term relationship into language modeling approaches. Cao et al. [4] extended the translation model to integrate both co-occurrence and hand-crafted thesaurus and achieve reasonable performance improvement. Bai et al. [2] showed that query expansion based on co-occurrences can improve the performance in language modeling approaches. Although the motivation is similar, our work differs from the previous work in that (1) we attempt to integrate term semantic relationship as a component in our retrieval model; (2) we take an axiomatic approach and define constraints to guide us in the incorporation of semantic term matching. Similar to previous work [11, 24, 18, 20, 23, 2], our method can also be implemented as query expansion. Thus, when we compute term similarity based on the documents in the collection, it bears some similarity to traditional feedback methods [21, 30, 19, 16], which also select terms from documents as to expand the query. But our method selects terms that are semantically related to each individual query term and relies on the axiomatic approaches to combine them, while feedback methods select terms that discriminate the feedback documents, which are not necessarily related to any individual query term. Because of this difference, our method is complementary to the traditional feedback method. Indeed, our experiment results show that they can be combined to further improve performance. retrieval is to search in a space of candidate retrieval functions for one that can satisfy a set of reasonable retrieval constraints; the assumption is that if a retrieval function satisfies all the desirable retrieval constraints, it would likely be effective empirically [5, 6]. Compared with other retrieval models, this approach has the advantage of connecting relevance more directly with terms through formalized retrieval constraints. In [6], several interesting new retrieval functions have been derived using formalized retrieval constraints and an inductive decomposition of the function space. These new functions are shown to perform as well as traditional retrieval functions but with much more robust parameter setting. The inductive definition decomposes a retrieval function into three component functions: primitive weighting function, document growth function and query growth function. The primitive weighting function gives the score of a oneterm document {d} and a one-term query {q }. It is usually instantiated as (q ) q = d (1) S ({q }, {d }) = 0 q=d where (q ) is an IDF-like function of q [6]. The query growth function describes the score change when we add a term to a query and is instantiated as S (Q {q }, D) = S (Q, D) + S ({q }, D). The document growth function describes the score change when we add a term to a document, and is instantiated based on some existing retrieval functions. The instantiation corresponding to Okapi 1 is S (Q, D {d}) = t DQ-{d} S (Q, {t})(|D | + 1, c(t, D )) +S (Q, {d}) · (|D | + 1, c(d, D ) + 1). where (x, y ) = y b x+b+y av dl , 0 b 1, |D| is document length, and c(t, D) is the count of term t in D. In general, a different instantiation of these component functions would result in a different retrieval function. In [6], several such inductively defined axiomatic retrieval functions are derived, and they are all shown to be effective. The following function (F2-EXP) is one of the best performing functions, which will be used in this paper as an example axiomatic retrieval function to illustrate how we can incorporate semantic term matching; however, the proposed method can be easily applied to all the other derived functions. t N 0.35 c(t, D) S (Q, D) = ) c(t, Q) · ( · | df (t) c(t, D) + b + b·vDl| QD ad 4. INCORPORATING SEMANTIC TERM MATCHING In this section, we show how we can naturally incorporate semantic term matching into the inductively defined axiomatic retrieval model proposed in [6]. Following the spirit of axiomatic approaches, we first define three constraints on semantic term matching. The instantiation of the do cument growth function is more general than the one given in [6], which is S (Q, D {d}) = t DQ-{d} S (Q, {t})(|D | + 1, c(t, D )) + S (Q, {d}) · (|D | + 1, c(d, D ) + 1). Given q = d, S ({q }, {d}) = 0, these two instantiations are equivalent. 1 3. AXIOMATIC RETRIEVAL MODEL The basic idea of the axiomatic approach to information 116 4.1 Semantic Term Matching Constraints Let s(t, u) [0, +] be any given semantic similarity function between two terms t and u. Without loss of generality, we assume that term t is semantically more similar to term u than to term v if and only if s(t, u) > s(t, v ), i.e., a large value of s indicates a high similarity. Since intuitively a term has the highest similarity to itself, we assume u = t, s(t, t) > s(t, u). We also assume that s is symmetric, i.e., t, u, s(t, u) = s(u, t). Based on such a semantic similarity function, we now define three constraints that we would like any reasonable retrieval function to satisfy. STMC1: Let Q = {q } be a query with only one term q . Let D1 = {d1 } and D2 = {d2 } be two single-term documents, where q = d1 and q = d2 . If s(q , d1 ) > s(q , d2 ), then S (Q, D1 ) > S (Q, D2 ). STMC1 requires a retrieval function to give a higher score to a document with a term that is more semantically related to a query term. Thus, even though D1 and D2 do not match the query Q syntactically, we would like D1 to have a higher score because term d1 is more semantically related to query term q than term d2 is. Clearly, STMC1 directly constrains the primitive weighting function. STMC2: Let Q = {q } be a single term query and d be a non-query term such that s(q , d) > 0. If D1 and D2 are two documents such that |D1 | = 1, c(q , D1 ) = 1, |D2 | = k and c(d, D2 ) = k (k 1), where c(q , D1 ) and c(d, D2 ) are the counts of q and d in D1 and D2 respectively, then S (Q, D1 ) S (Q, D2 ). STMC2 requires that matching an original query term q exactly should always contribute no less to the relevance score than matching a semantically related term d, no matter how many times term d occurs in the document. STMC3: Let Q = {q1 , q2 } be a query with only two query terms and d be a non-query term such that s(q2 , d) > 0. Let D1 and D2 be two documents. If |D1 | = |D2 | > 1, S ({q1 }, {q1 }) = S ({q2 }, {q2 }), c(q1 , D1 ) = |D1 |, c(q1 , D2 ) = |D2 | - 1 and c(d, D2 ) = 1, then S (Q, D1 ) S (Q, D2 ). STMC3 intends to capture the following intuition: Suppose we have a query with two equal ly important terms q1 and q2 . Suppose a document D1 matches q1 n (> 1) times, but does not match q2 or any of its semantically related terms. If we change one of the occurrences of q1 in D1 to a term semantically related to q2 to form a document D2 , D1 should not have a lower score than D2 , because D2 covers more distinct query terms than D1 . where f is a monotonically increasing function. Note that it is reasonable to require q Q, f (s(q , q )) = 1 for any query Q, because the score of generalized primitive weighting function should b e comparable with the s core of the o riginal o ne when the two terms match exactly. One way to ensure such property is to define f in terms of normalized similarity. f (s (q , d )) = where (q , d ) = s (q , d ) × (q , d ) s (q , q ) 1 q=d q=d (2) is used to regulate the weighting of the original query terms and the semantically similar terms. The value of s(q ,q should satisfy 0 < < s(q,d) , because f (s(q , d)) < f (s(q , q )) ) = 1) when d = q . The generalized primitive weighting function clearly satisfies STMC1, and if we combine it with any existing instantiations of document growth function and query growth function, the derived retrieval functions would also satisfy STMC1 unconditionally. We further analyze STMC2 and STMC3 on such derived functions and find that these constraints are satisfied when is within a certain range. Specifically, the analysis of STMC2 provides a tighter upper bound for , while the analysis of STMC3 provides a tighter lower bound. The actual values of these bounds depend on the instantiation of document growth function. As an example, the lower and upper bounds for F2-EXP is: s (q , q ) s (q , q ) b 1 × × 2+b s (q , d ) b+1 s (q , d ) (3) 4.2 Extension based o n S TMCs The constraints defined above provide some guidance on how to extend the inductively defined axiomatic retrieval functions to incorporate semantic term matching. First, it is clear t hat t hese existing axiomatic functions violate all the three constraints we defined, simply because the semantic similarity function s is not part of the retrieval function. For example, based on the primitive weighting function shown in Equation (1), any single-term document will be assigned a zero score if the term in the document is not matching exactly the query term, which clearly violates STMC1. To make the primitive weighting function satisfy STMC1, a natural solution is to define the following generalized primitive weighting function based on a given similarity function s. S ({q }, {d }) = (q ) × f (s (q , d ) ), s(q ,q of is determined by the highest value of s(q,d) , which are ) the minimal requirements of . Since a term can potentially have a huge number of semantically related terms, the computation of the generalized retrieval functions can be expensive. To reduce the computation cost, we can reasonably restrict our attention to the most similar terms for each query term. Such simplification is not expected to affect the retrieval score significantly, because the dropped terms would contribute little to the score anyway. Thus we redefine the generalized primitive weighting function as follows: ) (q ) · s(q,d) · (q , d) d (q ) s(q ,q Sg e n ( { q } , { d } ) = (4) 0 otherwise We see that the bounds of depend on both the query and semantic similarity function s. In our experiments, on each data set, the lower bound of is determined by the lowest s(q ,q value of s(q,d) for all the query terms while the upper bound ) where (q ) is the set of K most semantically similar terms of q according to the similarity function s, (q ) is as in Equation (1) and (q , d) is defined in Equation(2). Even with this simplification, the computation can still potentially involve enumerating all the combinations of query terms and document terms. Fortunately, there is an efficient way to compute such a retrieval function based on query expansion as shown in the next section. 4.3 As Query Expansion Let us first introduce some notations. S (Q, D) is the scoring function of the original inductively defined axiomatic re- 117 trieval function, where only syntactic term matching is considered. Sgen (Q, D) is the generalized inductively defined axiomatic retrieval function obtained by combining the generalized primitive weighting function with the original document growth and query growth function. The generalized primitive weighting function (i.e., Equation (4)) can be re-written as follows. d=q (q ) (q : d) d (q )/{q } Sg e n ( { q } , { d } ) = 0 otherwise ) where (q : d) = (q ) × × s(q,d) . s(q ,q ( Let q ) be the set of K most semantically similar terms of q excluding itself, i.e., (q ) = (q )/{q }. Let P be the set of the K most similar terms of all query terms, i.e., q ( P= Q ( q )). t P , let (t) b e the set of query terms that are semantically similar to u . Define S such that t (t) (u:t) t P ,S ({t}, {t}) = ((t) : t) = ; otherwise |Q| ( S Q, D) = S (Q, D). Theorem: Q, D, Sgen (Q, D) = S (Q P, D). Pro of: Dice similarity [1] and mutual information [25, 15, 9, 8, 13, 7] to compute term similarity. In this paper, we adopt the mutual information as the basic semantic similarity metric, leaving other choices for future work. The mutual information (MI) of two terms t and u in a set of documents can be computed as follows [25]: X p(Xt , Xu ) I ( Xt , X u ) = p(Xt , Xu ) log p(Xt )p(Xu ) t ,Xu {0,1} Sg en (Q, D ) = = = = = = = q q q Q Sg e n (q , D ) (Sg e n (q , D q ) + Q t ( q ) D S g e n ( t, D t ) ) S ( t, D t ) ) (S (q , D q ) + Q t S (Q, D ) + S (Q, D ) + q t t ( q ) D S ( t, D t ) Q ( q ) D S ( t, D t ) P S (Q, D ) + t S ( t, D ) P S (Q P , D ) where Dt is the part of the do cument D that only contains t. (i.e., |Dt | = c(t, D ) = c(t, Dt )). The first step is based on query growth function. The second step assumes that the relevance score of a document can be computed as the sum of the disjoint subsets of the document, which holds for all the inductively defined axiomatic retrieval functions. The third step is based on the fact that Sgen and S use the same document growth function and the fact that S ({t}, {t}) = ((t) : t) is consistent with generalized primitive function when t P . The theorem shows that scoring a document using Sgen can be reduced to scoring using S with an expanded query formed by adding, for each query term, K most similar terms to the query. Note that the weight of a similar term t is computed from ((t) : t) instead of (t) as used in the traditional query expansion methods. 4.4 Term Semantic Similarity Function The remaining challenge is to define s(t1 , t2 ) in STMC1. In general, we may exploit any knowledge and resources available to us to compute term similarity and there are many ways to compute it. For example, co-occurrences of terms obtained from the analysis of a document collection usually reflect underlying semantic relationships that exist between terms [23, 3, 4, 2], and we may use measures such as Xt and Xu are two binary random variables corresponding to the presence/absence of term t and term u in each document or segment. Mutual information is a principled way to measure term correlations, and it satisfies our requirements about the similarity function s. The next choice we have to make is which corpus to use when computing the mutual information. A natural choice would be the document collection from which we retrieve documents. However, such a choice may not be ideal because an ambiguous term can have multiple senses in a large corpus. As a result, the semantically related terms found by mutual information could be a mix of terms corresponding to different senses of the original term, introducing noise in query expansion. Thus, it is crucial to compute mutual information over a "clean" corpus, where ideally only one (correct) sense of the query term occurs. How can we find such a "clean" corpus? One possibility is to use the top-M documents returned by the retrieval systems for the query. The rational is that we can reasonably assume there is only one sense of a query term in the set of relevant documents, and the top-M documents are reasonable approximations of the set of relevant documents. This is indeed in line with what previous work in query expansion has found ­ local document analysis tends to be more effective than global document analysis [29]. However, the top-M documents would clearly be a biased corpus, and in this sense, it is not a good corpus for computing mutual information. For example, it is likely that a query term occurs in all the top-M documents. The abundance of a query term would then cause popular terms in the top-M documents to generally have a high mutual information. In particular, a common term (e.g., "can") would have a high mutual information, even if it also occurs in many other documents where the query term does not occur. To solve this problem, we need to supplement the topM documents with some additional documents that do not necessarily contain any query term. Thus we will randomly choose r × M documents from the collection and combine them with the top-M documents as a mixed corpus for computing mutual information. Clearly, the choice of r may also affect the mutual information results. How do we choose a good value for r? Once again, constraint analysis can provide some guidance. The following notations will be used in defining the constraints: N is the total number of documents in the document collection. df (t) is the number of documents that contain t in the collection. W is the working set containing r × M random documents plus the top M documents returned by the system; since the r × M documents are chosen from the documents ranked below the top-M documents, we clearly have M + M × r N . df (t1 , t2 |W ) is the number of documents that contain both t1 and t2 in the working set W . df (t|W ) is the number of documents that contain t in the working set W . 118 Intuitively, the value of r should not be very small, because we need enough number of random documents to penalize the common terms. Consider the scenario in Figure 1(a), where t1 is a "truly" semantically related term, while t2 is a common term. t1 is semantically more similar to q than t2 , although t2 co-occurs with q in more documents than t1 . This intuition can be captured by the following Term Semantic Similarity Constraint(TSSC). TSSC1: Let q be a query term and t1 and t2 be two non-query terms. If df (q , t1 |W ) = M , df (t1 |W ) = M , 2 2 df (q |W ) = M , df (q , t2 |W ) = M , df (t2 |W ) = M + r×M , 2 then s(q , t1 ) > s(q , t2 ). 3. Gather the top L similar terms for all the query terms, then select the top K ranked terms based on ((t) : t). 4. Expand the original query with the K terms. Note that the weight of an expansion term is computed based on ((t) : t) instead of (t). In the first step, the working set can be constructed over any reasonable resources in the following way: Given any collection of documents and a query, we first use the original inductively defined axiomatic retrieval function to rank the documents. We then merge the top M returned documents with r × M random documents selected from the same collection to form a working set for computing term similarity. The collection to be used can be either the target collection for retrieval (called internal expansion ) or any other collections (called external expansion ). To form a large pool of terms, L is usually fixed to 1000. Four parameters need to be tuned: the number of expansion terms (i.e., K ), the number of top documents (i.e., M ), the number of random documents (i.e., r) and the scaling parameter . The optimal values of and r are expected to be within a certain range based on Equation (3) and Equation(5), which is also supported by our experiment results. q t1 Top M docs M 2 t2 Top M docs q t1 t2 r M rM 2 r * M random docs r * M random docs s (q, t1 ) > s( q, t 2 ) (a) TSSC1 s (q, t1 ) > s( q, t 2 ) (b) TSSC2 5. EXPERIMENTS 5.1 E xperiment Design We conduct three sets of experiments. First, we evaluate the effectiveness of the semantic term matching. Second, we examine the parameter sensitivity of the method. Finally, we compare it with a model-based feedback method in language modeling approaches [30]. All experiments are conducted over two collections used in recent Robust track [28, 27]: (1) TREC Disk 4&5 (minus Congressional Record) with 249 official topics of Robust track in 2004. The document set has 1908MB text and 528,000 documents. This is labeled as "ROBUST04". (2)AQUAINT data with 50 official topics of Robust track in 2005. The document set has 3GB text and 1,033,461 documents. This is labeled as "ROBUST05". Some experiments are also conducted over six other data sets used in the previous studies [5, 6, 31]: news articles (AP88-89), technique reports (DOE), government documents (FR8889), Web data (WEB2g), and the ad hoc data in TREC (TREC7 and TREC8). In all the experiments, we use the title-only queries, because short keyword query is the most frequently used query type by web users and semantic term matching is necessary for such short queries. The performance is measured using the official measures in Robust track: MAP (mean average precision) and gMAP (geometric mean average precision). gMAP [27, 28] is a variant of the traditional MAP measure that uses a geometric mean rather than an arithmetic mean. T his measure emphasizes the performance of poorly-performing topics. The preprocessing only involves stemming with Porter's stemmer. As pointed out in the previous work [6], using a fixed parameter value (b = 0.5), F2-EXP can often achieve near-optimal performance in many test sets. Thus, we fix b to 0.5 in our experiments. We use the optimal value of b for the other five inductively defined axiomatic retrieval functions. In the first and third sets of experiments, M and K are both fixed to 20 and r is fixed to 29, so that we will get a total of 600 documents in the working set. We Figure 1: TSSC On the other hand, the value of r should not be very large because we want to ensure that the dominant sense of a query term is the one determined by the whole query. Consider the scenario in Figure 1(b). Suppose a query term q has two senses. The first sense is the one determined by the whole query (i.e., in the top M documents), and a term t1 is semantically related to this sense of q (i.e., they cooccur in the top M documents). Now suppose another term t2 is semantically related to another sense of q (i.e., they cooccur in the random documents). Intuitively, t1 should have a higher similarity score than t2 . The following constraint captures this intuition. TSSC2: Let q be a query term and t1 and t2 be two nonquery terms. If 0 < < 1, df (q , t1 |W ) = M , df (t1 |W ) = M , df (q |W ) = M + × r × M and df (t2 |W ) = df (q , t2 |W ) = × r × M , then s(q , t1 ) > s(q , t2 ). is the percentage of the documents that contain q in a random sample of the whole collection after the top M doc( uments excluded, i.e., = dfNq)-M . -M The above two constraints are satisfied only when the value of r is within a certain range. Indeed, TSSCs provide a lower and an upper bounds for r. 1