SIGIR 2007 Proceedings Poster Fast Exact Maximum Likelihood Estimation for Mixture of Language Models School of Engineering University of California Santa Cruz Santa Cruz, CA, USA Yi Zhang yiz@soe.ucsc.edu NEC Lab America 10080 Nor th Wolfe Road, Suite SW3-350 Cuper tino, CA, United States Wei Xu xw@sv.nec-labs.com ABSTRACT A common language modeling approach assumes the data D is generated from a mixture of several language models. EM algorithm is usually used to find the maximum likelihood estimation of one unknown mixture component, given the mixture weights and the other language models. In this paper, we provide an efficient algorithm of O(k) complexity to find the exact solution, where k is the number of words occurred at least once in D. Another merit is that the probabilities of many words are exactly zeros, which means that the mixture language model also serves as a feature selection technique. Categories and Sub ject Descriptors: B.3.3 [Information Search and Retrieval]: Information Search and Retrieval General Terms: Algorithms, Languages Keywords: language models We can treat the maximum likelihood estimator (MLE) of p as already known, usually it is calculated directly as: ^ pi = P (wi |p) = ^ dfi j dfj where dfi is the number of times word i occurs in the collection. The ma jor task is to find the MLE of q , which is usually obtained using the EM algorithm. Despite its popularity and simplicity, EM has two weaknesses. First, it is a computationally expensive greedy search algorithm. This expense discourages the use of the language modeling approach for the ad-hoc retrieval task in situations where computational efficiency is important, for example, in a large scale web search engine. Second, EM only provides an approximately optimal result with increased computational complexity, which we will discuss later. 2. EXACT SOLUTION 1. INTRODUCTION Mixture of language models is commonly used in IR application. Assume a sequence of data D is generated from a language model r, which is a linear combination of two multinomial language models p and q : r = p + q (1) where and are interpolation weights that sum to 1. The problem is to find qi 's that maximize the likelihood of observed data D, for given fi , pi , and , sub ject to i the constraints qi = 1 and qi 0. fi is the observed frequency of word i in D. The log-likelihood of the data is: i i LL = fi log(ri ) = fi log(pi + qi ) (2) This modeling approach has been applied to various information retrieval tasks such as relevance feedback [4] [2] [1], context sensitive language model smoothing [6], and novelty detection [5]. For example, p is the corpus language model, r is the language model used to generate document related to a query, and q is a query language model to be learned from relevance feedback. Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. An exact solution for q and an O(k log(k)) algorithm were presented in [3]. We briefly revisit the exact solution in this section, since the work is not commonly known for the IR community and it is the basis of our new algorithm. For all the qi such that qi > 0, apply Lagrange multiplier method and calculate the derivatives with respect to qi : L i = fi L- qi - 1 - = 0(3) qi pi + qi qi = fi - pi (4) i Applying the constraint that qi = 1 to all the qi > 0 : i f = i :qi >0 fi i i - pi 1= (5) 1+ :qi >0 pi We can prove by contradiction that if qi : i = 1 · · · k max2 1 imize the ob jective function (2), q2 > 0 and p1 p2 , then f f q1 > 0. We refer the readers to [3] for more details. A direct consequence is that all the qi 's greater than 0 correspond i to the smallest pi . So we can use the following algorithm f to find the exact q that maximize the likelihood of observed data: f f f i Algorithm 0: Sort pi so that p1 > p2 · · · pk , find f 1 2 k t such that t + j =1 pj pt - >0 (6) t ft fj j =1 865 SIGIR 2007 Proceedings Poster and + t+1 t+1 j =1 pj j =1 fj - pt+1 0 ft+1 f f if pi pt t i otherwise (7) Then qi 's are given by: fi - qi = 0 Where is given by: = p i (8) t i=1 1+ fi t i=1 pi (9) 3. NEW LINEAR ALGORITHM The key part of the Algorithm 0 is to find the thresholding pair (ft , pt ). Borrowing ideas from the O(k) algorithm for finding median, we have the following average O(k) algorithm for finding (ft , pt ). Algorithm 1 Find t 1: Sp 0, Sf 0 2: A {1, 2, · · · , k } 3: rep eat 4: select a pivot s from A L , G , Sf Sf , Sp Sp 5: for all i A do 6: f f 7: if pi > ps then s{ i 8: GG i}, Sf Sf + fi , Sp Sp + pi , fi fs 9: else if pi < ps then { 10: LL i} 11: else 12: Sf Sf + fi , Sp Sp + pi 13: end if 14: end for +S s 15: if S p - ps > 0 then f 16: A L, Sp Sp , Sf Sf , t s 17: else 18: AG 19: end if 20: until A = f words not in data D do not influence the estimation of and are ranked lower than the pivot, we can ignore them in Algorithm 1. This property is nice, because k could be much smaller than the vocabulary size. For example, if D is a document generated by the mixture of document language model q and background English language model p, we only need to process words in the document while estimating the document language model p using Algorithm 1. If we use EM, all words in the whole vocabulary need to be processed. The algorithm can also be modified to a worst case O(k) algorithm. This can be achieved by using s corresponding f to the median of pi at Line 4 instead of randomly selecting i a s. If implemented appropriately, the division operations can be mostly avoided in this algorithm. For example, Line 7 can be implemented as if fi ps > fs pi . As we know, division is a much more time consuming floating point operation than multiplication. Our simulations show that EM algorithm converges to the results of our exact algorithm, while our algorithm takes about the same time as 1 to 3 EM iterations to finish. Needless to say that EM algorithm needs many iterations to converge [3]. The source code of our algorithm written in C is provide online at www.soe.ucsc.edu/ yiz/papers/sigir07poster. Another merit of our algorithm is that we know the probabilities of some words in language model q are exactly zero (Equation (8)), which means that the mixture language model also serves as a feature selection technique. 4. REFERENCES The general idea of Algorithm 1 is to first select a random term/element/word s (Line 4) as a pivot to partition the words into two sets L and G (Line 7 and Line 9). Then we decide whether t is in set L or G (Line 15). Then we recurse on the appropriate subset (L or G) until we find t (Line 20). On average, it takes O(log(k)) iterations, and the cost of each iteration being roughly half of the previous one. The whole process is a geometric series converges as an average linear time algorithm. Now we prove that the average computation cost is O(k) using recursion. Let Ck be the average computation cost for n, then we have the following recursion: 1 (C1 + · · · Ck-1 ) k From (10), it can be derived that Ck = k + Ck = 2 k - ( (10) [1] D. Hiemstra, S. Robertson, and H. Zaragoza. Parsimonious language models for information retrieval. In SIGIR '04, pages 178­185, New York, NY, USA, 2004. ACM Press. [2] W. Kraaij, R. Pohlmann, and D. Hiemstra. Twenty-one at trec-8: using language technology for information retrieval. In TREC 8: Proceedings of the 1999 Text REtrieval Conference. National Institute of Standards and Technology, special publication. [3] W. X. Y. Zhang and J. Callan. Exact maximum likelihood estimation for word mixtures. In Text Learning Workshop in International Conference on Machine Learning, Sydney, Australia, 2002. [4] C. Zhai and J. Lafferty. Model-based feedback in the language modeling approach to information retrieval. In Proceedings of the Tenth International Conference on Information and Know ledge Management, 2001. [5] Y. Zhang, J. Callan, and T. Minka. Novelty and redundancy detection in adaptive filtering. In SIGIR: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, 2002. [6] X. Zhou, X. Hu, X. Zhang, X. Lin, and I.-Y. Song. Context-sensitive semantic smoothing for the language modeling approach to genomic ir. In SIGIR: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, 2006. 1 1 1 + + ··· + ) (11) 1 2 k Hence the average computation cost is O(k). k is the number of words with none zero frequency in D. Because 866