SIGIR 2007 Proceedings Poster MRF based Approach for Sentence Retrieval Keke Cai, Chun Chen*, Kangmiao Liu, Jiajun Bu, Peng Huang College of Computer Science, Zhejiang University Hangzhou, 310027, China *Corresponding Author, +86 571 8795 1431 {caikeke, chenc, lkm, bjj, huangp}@zju.edu.cn ABSTRACT This poster focuses on the study of term context dependence in the application of sentence retrieval. Based on Markov Random Field (MRF), three forms of dependence among query terms are considered. Under different assumptions of term dependence relationship, three feature functions are defined, with the purpose to utilize association features between query terms in sentence to evaluate the relevance of sentence. Experimental results have proven the efficiency of the proposed retrieval models in improving the performance of sentence retrieval. exploring the connections between query and relevant sentences, better retrieval performance is then expected. 2. MRF BASED SENTENCE RETRIEVAL MODEL MRF is constructed based on an undirected graph G, in which the nodes represent random variables and the edges present the dependence between the random variables. A random object X is a Markov random field if it satisfies: P ( x i | x j , j , j i ) = P ( x i | x j , j , j N ( i )) (1) Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval ­ Retrieval models. General Terms Algorithms, Experimentation where xi is the value of the i-th random variable and N(i) means the neighbors of the i-th random variable. Neighborhood means the direct dependence between variables. In our task, the graph G consists of a set of query term variables and sentence variables. According to the introduction of [3], the relevance score of sentence S with respect to the query Q can be defined as: P(S | Q ) = rank Keywords Sentence retrieval, term dependency, Markov Random Field. c cC ( G ) f (c ) (2) 1. INTRODUCTION Traditional term based matching approaches are not efficient enough to deal with the task of sentence retrieval. One of the most possible reasons is the limited information contained in each sentence. Term context dependences, which can disclose the implicit associations among terms in appearance, can express the special regularity behind query terms. This kind of information is considered quite useful for sentence relevance detection. In this poster, a novel sentence retrieval model based on term context dependence is proposed. The main idea is to utilize Markov Random Field (MRF) [1] to simulate the dependence model of variables involved in the problem of sentence retrieval. Based on the simulated dependence model, association features between query terms in sentences are used as the relevance evidence of sentences. Three forms of context dependence among query terms are simulated in this poster. These assumptions consider term context dependence from different points of view. Thus, in the following process, three feature functions are proposed to evaluate the accuracy of each estimated dependency model by considering terms' co-occurrence and syntactic relationships in sentences. Considering the efficiency of term context dependence in Copyright is held by the author/owner(s). SIGIR'07, July 23-27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. where C(G) is a set of cliques in G, f(c) is defined as the realvalued feature function over certain cliques, parameters c is the weight of each particular feature function. Based on the concept of MRF, three forms of configuration of G are considered by referring to the configurations employed by [3], i.e., full independence configuration, sequential dependence configuration and full dependence configuration. (For more detail, please refer to [3]). Accordingly, three feature functions are proposed based on dependency parse trees of sentences [2]. 1) The first potential function involves with cliques that contain a query term and a sentence. Such cliques assume the independence between query terms and the potential functional is defined as: f I (c ) = P ( qi | S ) = freq ( qi , S ) * |S| height ( qi ) (3) where freq(qi, S) represents the number of times query term qi occurs in sentence S and always equals to 1, height(qi) is height of term qi in dependency parse tree of S, |S| is the total number of terms in S, and parameter is ranged between 0 and 1 and is used to control the influence of the second component on sentence relevance estimate. 2) The second potential function involves with cliques containing two or more query terms, which are syntactically connected in sentences, and is defined as: 795 SIGIR 2007 Proceedings Poster f SR ( c ) = P ( q u ,... q v | S ) = BoolSynR ( q u ,... q v ) * * d h' (4) In formula 4, parameters and are ranged between 0 and 1 and are used to control the influence of each component on sentence relevance identification. Function BoolSynR(qu, ... qv) is set to 1 if qu, ... qv are syntactically connected in the dependency parse tree of S, else is set to 0. Functions d is and h' denote the distance and height of terms qu,... qv in the dependency parse tree and are respectively defined as d = i , j [ u , v ] dist (q i , q j ) N It is also noted that in our experiments models SDM perform better than FDM. After the statistic analysis, the reason is summarized as follows: Our experiments select the title portion of the TREC topics as the experimental queries. Compared with Web query, these formulated queries are always well formed. Term dependences assumed by SDM are more precise than those formed by FDM. Therefore, for SDM model less noisy data will be involved in the following process and helps to the identification of sentence relevance. Table 1. Performance P@k on topics N1-N50 P@5 P@10 P@20 P@30 P@50 P@100 TFIDF 0.696 0.668 0.651 0.616 0.569 0.514 KLD 0.612 0.608 0.584 0.534 0.495 0.453 FIDM 0.609 0.586 0.551 0.537 0.509 0.494 SDM 0.742 0.721 0.718 0.694 0.651 0.635 FDM 0.734 0.706 0.681 0.668 0.636 0.597 , where dist(qi, qj) is the number of linkages between qi and qj in parse tree , N is the number of term pairs of qu,... qv, h' = Min(height(qi)), i [u, v]. 3) The third potential function similarly involves with cliques containing two or more query terms. Its difference is that these query terms are syntactic proximity with each other in the dependency parse trees. The third function is defined as: f PR (c ) = P ( qu ,...qv | S ) = BoolP roxR ( qu ,...qv ) * * d h '' (5) where BoolProxR(qu, ... qv) is set to 1 if qu, ... qv are proximity terms in the dependency parse tree of S, else is set to 0. Function d and parameters and have similar definitions as those in formula 4. Function h'' also represents the height of qu, ... qv in the dependency parse tree, but is defined as: h '' = v 1 * height ( qi ) . v - u + 1 i =u Table 2. Performance P@k on topics N51-N100 P@5 P@10 P@20 P@30 P@50 P@100 TFIDF 0.424 0.400 0.386 0.399 0.367 0.323 KLD 0.408 0.390 0.387 0.370 0.341 0.302 FIDM 0.361 0.350 0.372 0.366 0.313 0.298 SDM 0.462 0.449 0.427 0.430 0.413 0.397 FDM 0.446 0.420 0.412 0.426 0.407 0.361 Based on the potential functions defined above, the sentence ranking function is formulated as: P(S | Q) = rank cI I * f I (c ) + ( SR cCD UD * f SR (c ) + PR * f PR (c )) (6) where set I is a set of cliques containing a query term and a sentence. CD represents a set of cliques involving a sentence and several query terms that appear contiguously within the query. UD denotes a set of cliques, terms of which appear non-contiguously with each other in the query. Parameters I, SR and PR are valued between 0 and 1 and used to control the influence of each feature function on the relevance estimate. 4. CONCLUSIONS AND FUTURE WORK MRF is often used in the statistical machine learning domain to model joint distributions. In this poster, the joint distribution over query terms and sentence is modeled by MRF and used as evidence of sentence relevance to the given query. Experimental results have shown the promising effectiveness of the proposed models in improving the performance of sentence retrieval. In this poster, only dependency grammar is selected to describe the syntactic relationships between terms. More sophisticated syntactic analyses are interesting to explore. 3. EXPERIMENTS Our experiments are implemented on Aquaint Collection by using the 100 topics: N1-N100. Relevance of sentences that are retrieved is assessed by using the relevance assessments provided by TREC for the Novelty Task 2003 and 2004. In our experiments, only title portions of these TREC topics are considered as experimental queries. Table 1 and Table 2 show the performance comparisons of our proposed retrieval models, i.e., FIDM: considers full independence among query terms, SDM: considers the dependences among sequential query terms, and FDM: considers the dependences among any query terms, with two traditional retrieval approaches, i.e., TFIDF model (TFIDF) and KLdivergence model (KLD). The metric P@k represents the precision at the top k ranked sentences. As shown in these tables, models SDM and FDM produce clear improvements over not only the traditional retrieval models but also the model with full independence assumption. It proves the correctness of the assumption of the underlying dependences among the query terms. 5. REFERENCES [1] Dobrushin, R.L., The descriptor of a random field by means of conditional probabilities and conditions of its regularity. Theory of Probability and its Applications, 13, (1968), 197224. [2] Hays, D.G., Dependency theory: A formalism and some observations. Language, 40, (1964), 511-525. [3] Metzler, D., Croft, W.B., A Markov random field model for term dependencies. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'05), (Salvador, Brazil, August 15-19, 2005), ACM Press, New York, NY, 2005, 472-479. 796