SIGIR 2007 Proceedings Poster Revisiting the Dependence Language Model for Information Retrieval Loïc Maisonnasse LIG-UJF 38041 Grenoble, France. loic.maisonnasse@imag.fr Eric Gaussier LIG-UJF 38041 Grenoble, France. eric.gaussier@imag.fr Jean-Pierre Chevallet IPAL-I2R Singapore 119613 viscjp@i2r.a-star.edu.sg ABSTRACT In this paper, we revisit the dependence language model for information retrieval proposed in [1], and show that this model is deficient from a theoretical point of view. We then propose a new model, well founded theoretically, for integrating dependencies between terms in the language model. This new model is simpler, yet more general, than the one proposed in [1], and yields similar results in our experiments, on both syntactic and semantic dependencies. Categories and Sub ject Descriptors: B.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Algorithms, Theory Keywords: Information retrieval, language model, syntactic/semantic above quantity can be further decomposed, leading to: i log P (Q|Md ) = log P (L|Md ) + =1..n log P (qi |Md ) ( + M I (qi , qj |L, Md ) (2) i,j )L where M I denotes the mutual information, and: ( ^ P (L|Md ) P (R|qi , qj ) i,j )L (3) 1. DEPENDENCE LANGUAGE MODELS [1] introduces a dependence language model (that we will refer to as DLM) for IR which, based on the standard language model ([5]), integrates syntactic dependencies in the computation of document relevance scores. This model relies on a variable L, loosely defined as a "linkage" over query terms, which is generated from a document according to P (L|Md ), where Md represents a document model. The query is then generated given L and Md , according to P (Q|L, Md ). In principle, the probability of the query, P (Q|Md ), is to be calculated over all linkages Ls, but, for efficiency reasons, the authors make the standard assumption that these linkages are dominated by a single one, the most probable one: L = argmaxL P (L|Q). P (Q|Md ) is then formulated as: P (Q|Md ) = P (L|Md ) P (Q|L, Md ) (1) ^ P (R|qi , qj ) in the above equation represents the empirical estimate of the probability that concepts qi and qj are related through a parse in document d. As the reader may have noticed, there is a certain ambiguity in the way the linkage L is used in the DLM model, ambiguity which is due, we believe, to the lack of a clear definition of what a linkage represents. In particular, according to equation 3, the probability P (L|Md ) assumes the knowledge of the query terms, so that a linkage represents a set of dependencies over a set of known terms. However, in equation 1, such an interpretation cannot hold, as it would lead to disregard the term P (Q|L, Md ) (as all the query terms are known in L), a quantity which is nevertheless necessary to derive the final form of the model given in equation 2. This ambiguity in the definition of L might not be important in practice, as it finally amounts to rely twice on the contribution of term pairs, which can be counter-balanced with appropriate smoothing. However, it is not completely satisfactory from a theoretical point of view. Without loss of generality, we assume that a syntactic and/or semantic analysis of a query q can be represented as a graph Gq = (C, E ), where C is the set of terms (or concepts) in q , and E is a binary relation from C × C in 0, 1 (E (ci , cj ) = 1 if ci and cj are related, and 0 otherwise). The probability that the graph of query q is generated by the model of document d can be decomposed as1 : P (Gq |Md ) = P (C |Md ) P (E |C, Md ) (4) In the case of a dependcy parser, as the one used in [1], each term has exactly one governor in each linkage L, so that the Assuming that, conditioned on Md , query terms/concepts are independent of one another (a standard asumption in the language model), and that, conditioned on Md and C , edges are independent of one another (again a standard assumption, also made in DLM), we can write: P ( C |M d ) P ( E |C , M d ) 1 = = ( c i C P (ci |M d ) (5) (6) i,j ) P (E (qi , qj )|C, Md ) Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. In the DLM model, a query is also implicitly represented as a graph (in fact a dependency parse), as the only linkage used is the most probable one, obtained by a parser trained on the collection. 695 SIGIR 2007 Proceedings Poster syntactic dependencies u r training test MAP DLM GLM LM P@5 DLM GLM LM 0.8 0.4 0.4 0.2 0.2 0.2 0.9 0.9 0.194 0.218 0.220 0.288 0.344 0.336 0.210 0.209 0.209 0.287 0.300 0.287 lamb da u 0.1 0.1 0.1 0.1 0.1 0.1 semantic relations lambda r training 0.9 0.1 0.256 0.243 0.255 0.450 0.433 0.408 test 0.333 0.337 0.339 0.440 0.480 0.453 0.9 0.9 0.9 0.1 Table 1: Mean average precision and precision af 5 do cuments for the DLM, GLM and LM mo dels. Mo dels are trained on 25 queries and evaluated on 30 queries from ImageCLEFmed 2005 and 2006. u and r corresp ond to the smo othing parameters for the unigrams and relations (contribution from the collection). Equation 5 corresponds to the standard language model (potentially applied to concepts), and is similar to the second contribution of the right-hand term of equation 2. The quantities P (E (qi , qj )|C, Md ) of equation 6 can be directly estimated through maximum likelihood. Following standard practice in language modeling, one can furthermore "smooth" this estimate by adding a contribution from the collection. This results in: P ( E ( ci , c j ) = x|C , M d ) = (1 - ) + xD(ci , cj , R) + (1 - x)D(ci , cj , ¬R) D(ci , cj , R) + D(ci , cj , ¬R) xC (ci , cj , R) + (1 - x)C (ci , cj , ¬R) C (ci , cj , R) + C (ci , cj , ¬R) where D(ci , cj , R) (C (ci , cj , R)) is the number of times ci and cj are linked in the document (collection). Similarly, D(ci , cj , ¬R) (C (ci , cj , ¬R)) is the number of times ci and cj are observed together in the document without being linked. The model we have just defined (which we will refer to as GLM, for Graph Language Model) is well motivated from a theoretical point of view, and can be applied to any graphical representation of queries and documents. Furthermore, it relies on only two terms, which are easy to estimate, whereas the DLM model uses three terms, with a somewhat complex estimation of the term P (L|Md ). Lastly, it is easy to see that the GLM model generalizes the bigram model presented in [6]. We now show how this model behaves experimentally. corresponding to nouns, adjectives, verbs abd adverbs). In order to estimate the parameters of our models (namely the smoothing coefficients), we divided the 55 queries available from ImageCLEF 2005 and 2006 in two sets: 25 queries were randomly selected for training, and 30 for testing. Lastly, we retained two measures for evaluation: the mean average precision, and the precision at 5 documents. Also note that we use the DLM model as is on semantic relations, even though this use is not theoretically justified. Table 1 shows that on both the syntactic and semantic dependencies, the models DLM and GLM performs in a similar way (no significant difference was detected using a Wilcoxon signed rank test at the level 0.05). On this collection, there is furthermore no significant difference between these two models and the LM model, which does not make use of the relations between terms. This observation agreees with some of the results reported in eg [2]. Lastly, it is interesting to note that the semantic indexing we have retained significantly improves the results over the syntactic one, and shoud be preferred here. 3. CONCLUSION In this paper, we have shown that the DLM proposed in [1] is flawed theoretically. We have then proposed a new model for integrating dependencies between terms that is (a) well founded theoretically, (b) simpler and (c) more general. Our experimental results suggests that this new, simpler model behave similarly to the DLM model, and may thus be preferred over it. 2. EXPERIMENTAL VALIDATION In order to assess the GLM model and answer the above question, we conducted two series of experiments on the collection of ImageCLEF2 . In the first series, we used MiniPar ([3]) to produce a dependency parse for both queries and documents. In the second series, we derived a semantic graph for queries and documents from UMLS. In the latter case, we replaced all possible instances of concepts by their corresponding concept(s), and retained all the relations between concepts provided in the semantic network associated with UMLS. As the collection in ImageCLEF focuses on pathologies and anatomic diseases, we did not take into account the NCI and PDQ thesauri of UMLS, which focus on cancer. This filtering step is similar to the one proposed in [4]. In all cases, we retained only full words (ie words 2 http://ir.shef.ac.uk/imageclef/. This collection consists of written diagnostic with associated images. We retrieved only the text part of documents as it enables the use of semantic relations presents in UMLS and thus allows testing the integration of syntactic and semantic graphs. 4. REFERENCES [1] J. Gao, J.-Y. Nie, G. Wu, and G. Cao. Dependence language model for information retrieval. In SIGIR, 2004. [2] C. Lee, G. G. Lee, and M. G. Jang. Dependency structure language model for information retrieval. In ETRI journal, 2006. [3] D. Lin. Dependency-based evaluation of MiniPar. In Workshop on the Evaluation of Parsing Systems, 1998. [4] Y. H. H. Lowe and W. Hersh. A pilot study of contextual UMLS indexing to improve the precision of concept-based representation in XML-structured clinical radiology reports. In AMIA, 2003. [5] J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR, 1998. [6] M. Srikanth and R. Srikanth. Biterm language models for document retrieval. In SIGIR, 2002. 696