Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan School of Computer Science Carnegie Mellon University Pittsburgh, PA, 15213 Alexander G. Hauptmann School of Computer Science Carnegie Mellon University Pittsburgh, PA, 15213 yanrong@cs.cmu.edu ABSTRACT Combining the output from multiple retrieval sources over the same document collection is of great imp ortance to a numb er of retrieval tasks such as multimedia retrieval, web retrieval and meta-search. To merge retrieval sources adaptively according to query topics, we prop ose a series of new approaches called probabilistic latent query analysis (pLQA), which can associate non-identical combination weights with latent classes underlying the query space. Compared with previous query indep endent and query-class based combination methods, the prop osed approaches have the advantage of b eing able to discover latent query classes automatically without using prior human knowledge, to assign one query to a mixture of query classes, and to determine the numb er of query classes under a model selection principle. Exp erimental results on two retrieval tasks, i.e., multimedia retrieval and meta-search, demonstrate that the prop osed methods can uncover sensible latent classes from training data, and can achieve considerable p erformance gains. Categories and Sub ject Descriptors: H.3.3 [Information Search and Retrieval]: Retrieval models General Terms: Algorithms, Performance, Theory Keywords: Retrieval, Query Class, Fusion, Learning alex+@cs.cmu.edu ent search engines on a single document collection. In all scenarios, retrieval systems usually b enefit from combining various retrieval sources by exploiting complementary information from them. Retrieval source combination has b een examined by a significant b ody of previous work. Most current approaches fall into the category of query-independent methods, which adopt the same combination strategy for every query. Wellknown examples are the CombSUM/CombMNZ methods [15] in meta-search. However, these methods might have limited effectiveness b ecause optimal combination strategies often vary considerably for different query topics. To address this, recent work has led to query-class dep endent combination approaches, which map the query space into one of the predefined query classes and learn the b est combination strategy for each class from training data. For example, in order to improve web document retrieval, Kang et al. [7] classified queries into three categories where each category had its sp ecific combination strategy. In multimedia retrieval, queryclass combination methods using manually defined query classes [20, 1, 8] have b een sup erior to query-indep endent combination in a numb er of standard video search evaluations [16]. Voorhees et al. [17] prop osed a query clustering method for distributed IR, which groups the queries based on the numb er of common documents retrieved and creates centroids by averaging the query vectors inside the clusters. Recent advances in query difficulty classification [21] can also b e viewed as quite similar. In these studies, each query is first classified into one of two groups based on retrieval "difficulty" and then coupled with corresp onding merging schemes based on classification results. Despite recent successes, query-class combination methods still have plenty of room for improvement. One ma jor issue is that query classes usually need to b e defined using exp ert domain knowledge. This manual design can work well when only a few query classes are needed, but it will b ecome difficult for tens or hundreds of query classes, where each query in a class has to share similar characteristics and thus a similar combination strategy. If the query classes are not carefully defined, a large learning error from b oth the query-class assignment and the combination parameter estimation might result. Furthermore, current query-class methods do not allow mixtures of query classes, but at times such a mixture treatment could b e helpful. For instance, the query "finding Bill Clinton in front of US flags" should b e associated with b oth a "Person" class and a "Named Ob ject" class rather than only one of these. Finally, determining the numb er of query classes remains an unanswered problem in 1. INTRODUCTION A numb er of information retrieval tasks utilize retrieval outputs provided by multiple knowledge sources for the same document collection. For example, web retrieval [7] considers a single web page as a document retrievable based on several typ es of information, such as titles, anchor texts, main b ody texts as well as its linking relation to other pages. Multimedia retrieval [20, 1, 8] attempts to retrieve documents containing information extracted from multiple modalities, such as the sp eech transcription text, the low-level visual features of the images and a set of semantic concepts automatically detected in the video clip. Meta-search [15, 10, 11] starts with the ranked lists generated from differ- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00. 324 these methods due to the nature of manual design. Some previous approaches [17, 8] can discover query classes by using clustering techniques. However, these approaches typically separate the processes of query class categorization and combination optimization into two sequential steps without b eing jointly optimized. Moreover, it is not straightforward for general clustering methods to handle mixtures of query classes in a principled manner. Based on these considerations, it is desirable to develop a data-driven probabilistic combination approach that allows query classes and their corresp onding combination parameters to b e automatically discovered from the training data itself, rather than handcrafted using human knowledge. Therefore, we prop ose a new combination approach called probabilistic latent query analysis (pLQA) to merge multiple retrieval sources based on statistical latent-class models. The prop osed approaches have advantages over queryindep endent and query-class combination methods in several ways: (1) they unify combination weight optimization and query-class categorization into a single discriminative learning framework; (2) they are able to automatically discover the latent query classes directly from training data; (3) they can handle mixtures of query classes in one query and (4) they can determine the numb er of query classes with an statistical model selection principle. Exp eriments are conducted on two retrieval applications, i.e., multimedia retrieval on the TRECVID'02-'05 collections [16] and metasearch on the TREC-8 collection [18]. The results show that the prop osed approaches can uncover sensible latent classes from training data, and also demonstrate higher effectiveness in combining multiple retrieval sources. tion. Moreover, different retrieval sources usually provide heterogeneous typ es of outputs for combination including b oth query-dep endent features and query-indep endent features. Generative models might face difficulties in the manual design of different model distributions for these outputs. Nallapati [12] has shown that in presence of heterogenous features, discriminative models are sup erior to generative models in a home-page finding task. Moreover, since the numb er of retrieval sources is usually much smaller than the numb er of text keywords, it allows a discriminative model to estimate the parameters more robustly given the same amount of the training data. Taking these factors into account, we decided to utilize discriminative models to combine multiple retrieval sources. Formally, we model the p osterior probability of the relevance as a logistic function on a linear combination of ranking features, i.e., P (y+ |D, Q) = iN , i fi (D, Q) (1) =0 where (x) = 1/(1 + e-x ) is the standard logistic function and is the combination parameter for the output of ith ranking features fi (D, Q). This logistic regression model presented in Eqn(1), a.k.a. the maximum entropy model, summarizes our basic retrieval source combination framework. It naturally provides a probabilistic interpretation for the retrieval outputs. Once the parameters are estimated, documents can b e presented to users in descending order of P (y+ |D, Q), or equivalently by the weighted sum of retrieval outputs N 0 i fi (D, Q). The optimization problem can b e i= solved using Newton's method describ ed in [2]. È 2. A DISCRIMINATIVE FRAMEWORK FOR RETRIEVAL SOURCE COMBINATION In this section, we present a discriminative combination framework that all of our prop osed approaches are built on. Let us b egin by introducing the basic notations and terminologies used in this work. The term document refers to the basic unit of retrieval throughout this pap er. A query collection Q contains a set of queries {q1 , ..., qt , ..., qMQ } where qt can have either a set of keywords, a detailed text descriptions and p ossibly image, audio, and video query examples. A search collection D contains a set of documents {d1 , ..., dj , ..., dMD }. Let y {-1, 1} indicate if document D is relevant or irrelevant to query Q. For each query q and document d, we can generate a bag of ranking features from N retrieval sources, denoted as fi (d, q ). Our goal is to generate an improved ranked list by combining fi (d, q ). Conventional relevance-based probabilistic models [4] rank documents by sorting the conditional probability that each document would b e judged relevant to the given query, i.e., P (y = 1|D, Q) or denoted as P (y+ |D, Q). Most well-known text retrieval models, such as the BIR model [13], proceed by inverting the p osition of y and D based on the Bayes rule and estimating the generative probabilities of document D in the relevant and irrelevant documents. However, the underlying model assumptions of these approaches such as the term indep endency could b e invalid in practice. In contrast, discriminative models can directly model the classification b oundary and typically make fewer model assumptions. They have b een applied in many domains of text processing such as text classification and information extrac- 3. PROBABILISTIC LATENT QUERY ANALYSIS The aforementioned discriminative retrieval framework provides a principled platform to supp ort the task of retrieval source combination. However, in its current form, the model still has a problem in that the combination parameters i are completely indep endent of the queries. In other words, the model will associate every p ossible query with the same set of combination parameters no matter what typ es of information needs users are expressing. To improve up on this, we study a more flexible combination approach called probabilistic latent query analysis (pLQA). It aims to determine the combination weights based on the latent mixing structure of the query space that can b e automatically discovered under a statistical latent class model. In the rest of this section, we first discuss the basic form of the pLQA method and its parameter estimation strategies. To deal with unseen queries outside the training collection, we then extend pLQA to its adaptive version and kernel version. 3.1 Basic pLQA It would b e ideal if we could learn sp ecific combination parameters for every p ossible query. However, given the virtually infinite numb er of query topics, it is impractical to learn the combination weights on a p er query basis b ecause we cannot collect enough training data individually. We need to come up with a trade-off to balance the difficulties of collecting training data and the ability to capture the idiosyncracy of the query space. To achieve this, we make the 325 Q D Q D Z Z Y MD MQ Y MD MQ (a) (b ) Figure 1: Graphical mo del representations for (a) the query-class combination where the query classes are manually defined, (b) the probabilistic latent query analysis(pLQA) where the query classes are defined as latent variables. The no des with known values are shaded, while other no des are unshaded. The plates stand for replicas of the subgraphs where the number of replicas is on the corner. following assumptions in our models: (1) the entire query space can b e describ ed by a finite numb er of query classes, where queries from each class share the same combination function; (2) the query description can b e used to indicate which class a query b elongs to. Under the first assumption, the basic probabilistic retrieval model expressed in Eqn(1) can b e naturally extended to a finite mixture of conditional probabilistic models. Formally, we can introduce a multinomial latent query class variable z to indicate which mixture the combination function is drawn from. Based on the second assumption, the choice of z is solely dep ending on the query Q. Putting all these together, we have the joint probability of relevance y and latent variable z as, P (y+ , z |Q, D; µ, ) = P (z |Q; µ)P (y+ |Q, D, z ; ), (2) where µ is the parameter for multinomial distributions, is the combination parameter for query classes. The mixture comp onent P (y+ |Q, D, z ; ) corresp onds to a single logistic regression model and the mixing prop ortion P (z |Q; µ) controls the switches among different classes based on the query-dep endent parameters µQ . By marginalizing out the hidden variables z , the corresp onding mixture model can b e written as, (Mz is the numb er of query classes) P (y+ |Q, D; µ, ) = ciate queries with query classes. By analogy, the query-class method can b e viewed as a hard version of "query categorization" which associates each query with one single query class, and meanwhile BpLQA can b e viewed as a soft version of "query categorization" which leads to a probabilistic memb ership assignment of queries to latent query classes. Because of the representation differences, BpLQA can exploit the following advantages over the query-class combination method: (1) it can automatically discover the query classes from training data rather than by manual definition, (2) it offers probabilistic semantics for the latent query classes and thus allows mixing multiple query typ es for a single query, (3) it can discover the numb er of query typ es in a principled way, (4) it can address the insufficient data problem caused by ill-defined query classes that have few p ositive examples in the training set and (5) it unifies the combination weight optimization and query class categorization into a single learning framework. The parameters in BpLQA can b e estimated by maximizing its incomplete data log-likelihood. A typical approach to achieve this is to use the Exp ectation-Maximization (EM) algorithm [3], which can obtain a local optimum of loglikelihood by iterating an E-step and an M-step until convergence. The E-step of the BpLQA model can b e derived by computing the exp ectation of z , htj (z ) = P (z |Qt , Dj ) = ÈM µz t (ytj z È i z =1 µz t (ytj D) Èzi fi (Qt(,Q j, )D )) , f i zi i t j where µz t = P (z |Qt ; µ) is the probability of choosing hidden query classes z given query qm and z i is the weights for fi (·) under the class z . By optimizing the auxiliary Q-function, we can derive the following M-step up date rules, z i = arg max k j t htj (z ) log iN , z i fi (Qt , Dj ) µz t = 1 MD MD j =1 P (z |Qt , Dj ). =1 zM z P (z |Q; µ) · iN . z i fi (D, Q) (3) =1 =1 In the following discussions, we refer the model presented in Eqn(3) to as the basic pLQA(BpLQA) model where each latent query class represents a group of similar queries sharing the same combination weights. Note that when the numb er of latent variables is reduced to 1, BpLQA degrades to the case where retrieval source combination is not relevant to queries, i.e., a query-indep endent combination approach. Figure 1(ab) compare the probabilistic graphical model representations of BpLQA and the query-class combination methods, where one of the their ma jor differences is the semantic of the mixing prop ortions P (z |Q, µ). In the queryclass method, query classes have to b e derived from manually defined generation rules b efore the learning process. However, query classes in the BpLQA model are expressed as latent variables and thus can b e estimated directly from training data together with the combination weight estimation. Moreover, they also differ in the way how they asso- When the log-likelihood converges to a local optimum, the estimated parameters can b e plugged back into the BpLQA model to describ e the underlying query space structure and show how the training queries are organized. The choice of mixture numb er K might dep end on the amount and the distribution sparseness of the training data. As the numb er of latent variables grows, the family of decision functions represented in BpLQA will b ecome less restricted and thus lead to lower bias in the estimated models, but meanwhile it would suffer from a higher variance due to an increased model complexity. Based on the biasvariance trade-off, we can determine the numb er of query classes that are sufficient to capture the variations in the query space with the given amount of training data. To b e more concrete, the numb er of query classes can b e obtained by maximizing the sum of the log-likelihood and some model selection criteria such as Akaike Information Criteria(AIC), Bayesian Information Criteria(BIC) and so forth. In our work, we choose BIC [14] as the selection criterion. It quantifies the relative goodness-of-fit of statistical models by maximizing the formula of B I C = 2l(µ, ) - k log(n), where k is the numb er of parameters to optimize and n is the numb er of training data. 326 3.2 Adaptive pLQA Discovering the underlying structure of query space by itself is not sufficient to handle the retrieval source combination, b ecause a practical combination model should b e able to predict combination parameters for unseen queries outside the training collection. Unfortunately, BpLQA cannot easily generalize the multinomial parameters µ to any of these unseen queries, b ecause each parameter µ·t in BpLQA sp ecifically corresp onds to the tth training query. Since it is imp ossible to enumerate all p ossible queries in the training set, we need to come up with a solution to predict the mixing prop ortions P (z |Qt ; µ) of any unseen queries that do not b elong to the training collection. To generalize the parameters to new documents, Hofmann [6] suggested a "fold-in" process for the latent class model by re-learning all training documents with the new document to generate an up dated parameter estimation. However, the "fold-in" process by plugging in new queries and re-estimating the entire model is not reasonable in our task, b ecause it requires a long time to process and more imp ortantly, we do not have any relevance judgment for new queries to learn from. To address this problem, we prop ose an adaptive approach aiming at parameterizing the mixing prop ortion P (z |Qt ; µ) using a sp ecific set of features directly extracted from query topics, or called query features, that are able to capture imp ortant characteristics of users' information need. Formally, we can represent each query as a bag of query features {q1 , ...qL }. The mixing prop ortions P (zk |Q; µ) can then b e 1 modeled using a soft-max function Z exp( l µz l ql ), where Z = z exp( l µz l ql ) is the normalization factor that scales the exp onential function to b e a probability distribution. By substituting the mixing prop ortion back into Eqn(3), previous BpLQA model can b e rewritten as, The only remaining issue for defining the ApLQA model is to design a set of predictive query features. There are two useful principles to guide the design of suitable query features: 1) they should b e able to b e automatically generated from query descriptions or the statistics of ranking features, and 2) they should b e predictive to estimate which latent classes the query b elong to. For example, we can consider the presence/absence of sp ecific p erson names in query topics, and the mean retrieval scores of each retrieval source as query features. Note that, in contrast to creating query classes that must b e mutually exclusive, defining query features is much more flexible, eliminating the need to partition the query space into non-overlapping regions. Moreover, the numb er of query features can b e much larger than the numb er of query classes with the same amount of training data. 3.3 Kernel pLQA By introducing explicit query features into the combination function, ApLQA can handle unseen queries that do not app ear in the training data. However, the assumption of linear query feature combination in ApLQA might not b e the b est choice in general b ecause the numb er of query features is often limited. Also, there exists some useful query information that cannot b e describ ed by explicit query feature representation. For example, the edit distance b etween two queries is a helpful hint for combination but it cannot b e easily represented as a explicit query feature. Therefore, we develop an extension of the ApLQA model called the kernel pLQA(KpLQA) model that lends itself to the use of implicit feature space via Mercer kernels based on the representer theorem [9]. This extension is also motivated by a significant b ody of recent work that has demonstrated kernel methods are effective in a variety of applications. In more detail, the kernel representation allows simple learning algorithms to construct a complex decision b oundary by pro jecting the original input space to a high dimensional feature space, even infinitely dimensional in some cases. This seemingly computationally intensive task can b e easily achieved through a p ositive definite reproducing kernel K and the well-known "kernel trick". To b egin, let us rewrite the sum term l µz l ql in Eqn(4) to b e an arbitrary function fz (Q) with resp ect to the query Q, È È È 1 P (y+ |Q, D) = Z z exp( l µz l ql ) iN . z i fi (Q, D) (4) =1 Note that, since µz l is associated with each query feature instead of each training query, this modification allows the estimated µz l to b e applied in any unseen queries as long as they can b e formulated as vectors of query features. In the following discussions, we refer the model expressed in Eqn(4) to as the adaptive pLQA(ApLQA) model. For this model, the EM algorithm can b e derived similarly except the mixing prop ortions need to b e substituted with Eqn(3.2). Therefore, the E-step computes the p osterior probability of latent variable z given Qt and Dj as follows, htj (z ) = È P (y+ |Q, D) z exp fz (Q) · iN . z i fi (Q, D) (5) ȵ Èy ,) i l l i Èexepx(p(È zµqtlq))((È tjyzfi (QtQDjD) )) . f( , z l z l tl i tj zi i t j =1 In the M-step, we have the same up date rule for up dating z i . For up dating µ, we have µz l = arg max µz l z t j l 1 htj (z ) og Zt exp( lL =1 , µz l qtl ) where Zt is the normalization factor for query Qt , i.e., Zt = z exp( l µz l qtl ). The M-step can b e optimized by any gradient descent method. Note that, this step is actually fitting a multi-class logistic regression model with query features as inputs. In more detail, it is looking for the most similar logistic regression model w.r.t. the p osterior probabilities of z for query Qt . È È Since the loss function ab ove only dep ends on the value of f at the data p oints {fz (Q)}, according to the representer theorem, the minimizer f (x) admits a representation of the form fz (Q) = MD z k K (Q, Qk ), where {Qk } is the set of k =1 training queries, z k is the kernel fusion parameter for z and Qk , and K (·, ·) is a Mercer kernel on the query space which has to b e p ositive definite. With this kernel representation, we can derive the corresp onding log-likelihood function by substituting original mixing prop ortion term P (z |Qt ) to b e, È P (z |Qt ) = 1 exp( Z M k D z k K (Qt , Qk )) =1 ApLQA is a sp ecial case of KpLQA if each element of K is chosen to b e the inner product b etween the query features of two queries. However, the flexibility of kernel selection 327 Data Set Query Num Doc Num Data Set Query Num Doc Num t02s 25 24263 - t03s 25 75850 t03d 25 47531 t04s 24 48818 t04d 24 124097 t05s 24 77979 t05d 24 74532 t04dx 88 124097 Table 1: Labels of the video collections and statistics. t d indicate development sets and t s indicate search sets where the embedded number is the year. t04dx is the external development set. has offered more p owers to the KpLQA model. For example, the kernel function can have different forms such as the p olynomial kernel K (u, v ) = (u · v + 1)p and the Radial Basis Function (RBF) kernel K (u, v ) = exp(- u-v 2 ). The latter one has the ability to pro ject the query features into an infinite dimensional feature spaces. Moreover, we can transform the distance metric b etween queries (e.g., edit distance b etween queries) into the implicit feature space in form of a Mercer kernel, instead of designing explicit features for each query. The optimization of KpLQA is similar as that of ApLQA except for changing the mixing prop ortion term in the E-step and M-step to the kernelized one. 4. EXPERIMENTS In this section, we present the exp erimental results on two retrieval applications based on the retrieval source combination, i.e., multimedia retrieval and meta-search. 4.1 Application I: Multimedia Retrieval Our exp eriments are designed based on the guidelines of the manual retrieval task in the TREC video retrieval evaluation(TRECVID), which requires an automatic video retrieval system to search relevant documents without any human feedback. In this task, the retrieval units are video shots defined by a common shot b oundary reference. The prop osed combination algorithms are evaluated using the queries and the video collections officially provided by TREC '02-'05 [16]. For TREC'03-'05, each of these video collections is split into a development set and a search set chronologically by source. The development sets are used as the training p ool to develop automatic multimedia retrieval algorithms and the search sets mainly serve as the testb eds for evaluating the p erformances of retrieval systems. For each query topic, the relevance judgment on search sets was provided officially by NIST. The relevance judgment on development sets was collab oratively collected by several human annotators using the Informedia client [5]. Moreover, we manually designed 40 additional queries and collected the ground truth on the TREC'04 development set in order to demonstrate ApLQA is able to learn from arbitrary queries that do not b elong to the testing set1 . We couple them with the TREC'04-'05 query topics to construct an external training corpus with 88 queries. Table 1 lists the lab els of video collections and their query/document statistics. As building blocks for multimedia retrieval, we generated a numb er of ranking features on each video document, including 14 high-level semantic features learned from development data (face, anchor, commercial, studio, graphics, These queries bring some (but insignificant) p erformance improvement in our retrieval exp eriments, as can b e verified by comparing p erformance b etween BpLQA and ApLQA. 1 weather, sp orts, outdoor, p erson, crowd, road, car, building, motion), and 5 uni-modal retrieval exp erts (text retrieval, face recognition, image-based retrieval based on color, texture and edge histograms). The detailed descriptions on the feature generation can b e found in [5]. In an attempt to incorp orate the ranking information in the learning process, we weighted the p ositive data stronger to balance the p ositive/negative data distribution and meanwhile shifted the median value of each feature to zero [19]. To initialize the EM algorithm in pLQA, we set the mixing parameters µz to some random numb ers and estimated the combination parameters z from Mz individual queries, which are automatically selected using a simple heuristic similar to the maximal margin relevance. In order to implement the ApLQA and KpLQA model, we also designed the following binary query features, i.e., if the query topic contains 1) sp ecific p erson names, 2) sp ecific ob ject names, 3) more than two noun phrases, 4) words related to p eople/crowd, 5) words related to sp orts, 6) words related to vehicle, 7) words related to motion, 8) similar image examples w.r.t. color or texture, 9) image examples with faces and finally a feature indicates 10) if the text retrieval module finds more than 100 documents. All these query features can b e automatically detected from the query description through some manually defined rules plus natural language processing and image processing techniques. For example, the first three query features can b e obtained by the methods of named entity extraction, NP tagging and shallow parsing, of which the details can b e found in our previous work [20]2 . The features of sp orts, vehicle and p eople can b e detected using a manually defined vocabulary with a limited set of words. The eighth and ninth query features can b e automatically generated by existing image processing and face detection techniques. Finally the last query feature can b e simply obtained from text retrieval statistics. Note that our current implementation requires some query features to b e generated from manually defined rules. It is an interesting direction to explore fully automatic approaches to extract query features but we leave it as our future work. 4.1.1 Illustration of latent query classes To illustrate the ability of pLQA to discover meaningful latent query classes, Figure 2 shows five latent query classes that are automatically learned from the BpLQA model on the TREC'05 development data. In this figure, all of the TREC'05 queries are listed except those without any training data in the development set. The blocks on the right show the scale of mixing prop ortions p(z |Q) of each query Q w.r.t. each query class z . These queries are organized into five groups based on the class IDs of their maximal mixing prop ortion in all query classes, i.e., arg maxz p(z |Q). It is interesting to examine whether the query grouping suggested by BpLQA are sensible. Among these five groups, the first one mainly contain all the "named p erson" queries in the training data and one query related to p erson app earance, "meeting with a large table". This group of queries usually has a high retrieval p erformance when using the text features and prefers the existence of p erson faces, while content-based image retrieval is not effective for them. The second group consists of three queries related to sp ort events This work [20] also provides evidence that ab ove query features can b e predicted with an accuracy ab ove 93%, which is sufficient to supp ort the following retrieval process. 2 328 QP 1 2 3 4 5 Query QC1 QC2 QC3 QC4 QC5 Condoleezza Rice Iyad Allawi Omar Karami Hu Jintao Tony Blair Mahmoud Abbas George Bush meeting with a large table tennis players on the court basketball players on the court a goal being made in soccer game map of Iraq with Baghdad shown tall building tanks and military vehicles people entering/leaving buildings road with one/more cars ship or boat people with banners/signs something on fire with flame/smoke helicopter in flight High Low 11000 t03d t04d t05d t03d+BIC t04d+BIC t05d+BIC Negative Training Log-Likelihood 10000 9000 8000 7000 6000 5000 0 5 10 15 # of Latent Classes Mix Prop. Figure 3: The negative training log-likeliho o d of BpLQA against the number of query classes. The dash lines indicate the original log-likeliho o d and the solid lines indicate the regularized log-likeliho o d with BIC. the search sets using the prop osed models and parameters learned from the development set. Since BpLQA cannot generalize its parameters to unseen queries, we use the development set on the same year as the search set to estimate its parameters. For example, the parameters estimated on the set t03d is reused in the search set t03s. For ApLQA and KpLQA, there is no such a constraint and therefore the parameters are estimated with a larger training set t04dx. As discussed b efore, we can obtain the numb er of query classes by optimizing the regularized log-likelihood with an additional BIC term. Figure 3 plots the curves of the negative log-likelihood and their regularized counterparts on three development sets (i.e., t03d, t04d and t05d) against the numb er of query classes. It can b e observed that when the numb er of query classes grows, the learning curves b ecome consistently lower and lower until they asymptotically reach saturated levels. Occasionally the curves even slightly raise at the end, indicating the training data seem to b e overfitted in the case of a large numb er of query classes. Based on the statistical model selection principle, the numb er of query classes for each training set can b e determined by seeking the lowest p oint on the regularized log-likelihood. Given the ;earning curves, we can find that the optimal numb ers of latent classes turn out to b e 4, 4 and 6 for the collections of t03d, t04d, t05d resp ectively. Table 2(a) lists a detailed comparison b etween BpLQA with the estimated numb er of query classes and several baseline methods including text retrieval (Text), query indep endent (QInd) and query-class combination (QClass) methods with five classes defined in [20]. QClass has shown to b e one of the b est overall retrieval systems in the past TRECVID evaluations. The parameters in all baseline methods were learned using the same training sets as BpLQA. All the results results are rep orted in terms of the mean average precision(MAP) up to 1000 documents, precision at top 30, 100 documents and recall at top 1000 documents. To analyze the results in more detail, we also group ed the queries in each collection and rep orted their MAPs in five different categories, i.e., named p erson, sp ecial ob ject, general ob ject, sp orts and general queries. As can b e observed from the results, QClass is usually sup erior to b oth QInd and Text. On average, it brings a roughly 3% absolute improvement (or 30% relative improvement) over the text retrieval. As compared to QClass, BpLQA achieves another 2% b oost in Figure 2: We organize TREC'05 queries into five groups (QP) based on which query class (QC) has the highest mixing proportions. The left two columns show group IDs and queries. The other columns indicate the query's mixing proportions w.r.t. five latent classes discovered by BpLQA, where darker blo cks mean higher value. including "basketball", "soccer" and "tennis". They often rely on b oth text retrieval and image retrieval results, b ecause these sp ort scenes in the news broadcast usually share common transcript words and image background. The last three groups all contain queries related to ob ject finding, however, they have many distinctions in the combination strategies. In the third group, the queries tend to search for ob jects that have similar visual app earances without any apparent motions. The case of the fourth group is quite different. They are mainly looking for the ob jects in the outdoors scene such as "road" and "military vehicle". Finally, the fifth group seems to b e a general group that contains all remaining queries. The queries in this group search for objects without visual similarities and thus place a high weight on the text retrieval since text retrieval is usually the most reliable retrieval comp onent in general. As can b e found, most of the discovered latent query classes are consistent with the observations made by many previous studies [20, 1], such as the classes of named p erson, ob jects and sp orts. Meanwhile, BpLQA is also able to suggest some query classes that have not b een suggested b efore such as the class of "outdoor ob jects". This analysis shows that BpLQA can achieve a reasonable query classification through a completely data-driven statistical learning approach instead of a manual handcrafting procedure. Finally, it is also helpful to analyze the query mixing patterns learned from BpLQA. For example, the query "Omar Karami" can b e describ ed by a mixture of the first and the second query classes. It is not obvious why the second query class is related at first sight. But after a further analysis, we found that Karami always showed up with a similar background when he was meeting foreigners and hence visual app earance turned out to b e a helpful clue to find him. Such a mixture treatment provides more flexibilities in retrieval modeling and offers deep er understandings on the queries. 4.1.2 Retrieval Results Next we present the multimedia retrieval results on all 329 Data t03s (a) t04s t05s Metho d Text QInd QClass* BpLQA** Text QInd QClass BpLQA Text QInd* QClass* BpLQA** Metho d Text QClass-x ApLQA KpLQA-R KpLQA-P Text QClass-x* ApLQA* KpLQA-R* KpLQA-P* Text QClass-x ApLQA* KpLQA-R* KpLQA-P Text QClass-x* ApLQA* KpLQA-R* KpLQA-P* MAP 0.146(+0%) 0.150(+2%) 0.173(+19%) 0.205(+40%) 0.078(+0%) 0.079(+0%) 0.087(+11%) 0.104(+33%) 0.073(+0%) 0.105(+44%) 0.108(+47%) 0.128(+75%) MAP 0.098(+0%) 0.132(+34%) 0.139(+41%) 0.140(+42%) 0.144(+46%) 0.146(+0%) 0.200(+37%) 0.210(+44%) 0.207(+42%) 0.206(+41%) 0.078(+0%) 0.094(+20%) 0.110(+41%) 0.111(+42%) 0.109(+40%) 0.073(+0%) 0.116(+58%) 0.129(+77%) 0.129(+77%) 0.130(+78%) P30 0.171 0.212 0.207 0.241 0.178 0.177 0.186 0.201 0.207 0.268 0.261 0.307 P30 0.113 0.131 0.136 0.139 0.139 0.171 0.236 0.249 0.248 0.249 0.178 0.199 0.220 0.212 0.213 0.207 0.292 0.294 0.307 0.290 P100 0.118 0.136 0.129 0.145 0.107 0.116 0.115 0.124 0.175 0.205 0.200 0.212 P100 0.076 0.086 0.084 0.084 0.083 0.118 0.137 0.144 0.143 0.144 0.107 0.125 0.119 0.120 0.128 0.175 0.211 0.212 0.221 0.214 R1k 0.477 0.520 0.531 0.565 0.361 0.375 0.380 0.379 0.339 0.387 0.396 0.388 R1k 0.421 0.425 0.454 0.450 0.462 0.477 0.542 0.562 0.565 0.544 0.361 0.381 0.381 0.379 0.380 0.339 0.402 0.397 0.396 0.390 Person 0.371 0.251 0.371 0.450 0.188 0.144 0.194 0.248 0.141 0.164 0.171 0.212 Person 0.124 0.316 0.364 0.365 0.388 0.371 0.407 0.463 0.469 0.467 0.188 0.194 0.255 0.262 0.255 0.141 0.173 0.209 0.207 0.211 SOb j 0.230 0.293 0.307 0.361 0.012 0.063 0.080 0.084 0.015 0.029 0.032 0.039 SOb j 0.183 0.207 0.194 0.201 0.206 0.230 0.336 0.358 0.340 0.347 0.012 0.080 0.053 0.067 0.030 0.015 0.031 0.042 0.040 0.042 GOb j 0.068 0.095 0.091 0.102 0.033 0.034 0.030 0.035 0.097 0.090 0.061 0.100 GOb j 0.079 0.075 0.078 0.075 0.070 0.068 0.088 0.106 0.107 0.097 0.033 0.046 0.044 0.040 0.037 0.097 0.100 0.104 0.108 0.100 Sport 0.031 0.101 0.088 0.100 0.046 0.108 0.090 0.118 0.075 0.271 0.309 0.316 Sport 0.023 0.004 0.014 0.014 0.023 0.031 0.106 0.103 0.100 0.107 0.046 0.108 0.110 0.108 0.116 0.075 0.322 0.328 0.327 0.331 Other 0.007 0.012 0.009 0.011 0.044 0.051 0.044 0.047 0.016 0.017 0.018 0.018 Other 0.011 0.013 0.017 0.017 0.019 0.007 0.015 0.017 0.015 0.018 0.044 0.045 0.051 0.050 0.055 0.016 0.017 0.017 0.018 0.018 Data t02s t03s (b) t04s t05s Table 2: The comparison of retrieval performance against combination metho ds. The numbers in brackets show MAP improvement over Text. Bold texts indicate the highest performance in each criterion. * means statistical significance over Text with p-value < 0.01 (sign tests) and ** means significance over both Text and QInd. See text for details. terms of MAP without any manual tuning on the query class definitions. As it results, the differences b etween BpLQA and Text/QInd b ecome statistically significant in two out of three collections. The ma jor p erformance growth factor for BpLQA can b e traced to a higher precision on the top documents, although the recalls b etween all these methods do not vary a lot. By comparing MAPs with resp ect to each query typ e, we find that BpLQA b enefits most from the Person and Sp ecial Ob ject typ e queries, as well as the General Ob ject typ e in t05s. This is b ecause these query typ es have their information needs clearly defined and thus they are able to b e improved by making b etter use of the training data. Table 2(b) compares text retrieval and query-class combination (QClass-x) with ApLQA, KpLQA using the RBF kernel with = 0.01 (KpLQA-R), KpLQA using the p olynomial kernel with p = 3 (KpLQA-P). All the parameters are estimated from the external training set t04dx. Each pLQA model is learned with six query classes based on the regularized likelihood estimation. The exp erimental settings are similar to those presented ab ove except that the results of t02s are evaluated in addition. Note that, QClass-x in this table are learned on a larger training set and thus it can outp erform its counterparts in the Table 2(a), which shows the imp ortance of sufficient training data. Despite this change, b oth ApLQA and KpLQA can still outp erform QClass-x by a margin of 1-2% (10-20% relatively) w.r.t. MAP. Similarly, the increase in precision is the main advantage brought by the pLQA models. Among these models, KpLQA show some advantages over the ApLQA model in 3 out of 4 collections, although the difference b etween them is not significant. However, we b elieve KpLQA has more opp ortunities to b e improved b ecause it is flexible to incorp orate the distance-metric-typ e features and we will explore this choice in the future. 4.2 Application II: Meta-Search Our next series of exp eriments are designed based on the application of meta-search that combines multiple search engines on a single text collection. The TREC-8 collection [18] is used as our testb ed which contains 50 query topics and around 2GB worth of documents. Each topic consists of b oth a short topic title and a long topic description. From the submitted outputs provided by all the participants, we extracted the top five manual retrieval systems and top five automatic retrieval systems as inputs of the meta search system. Their system codes are READWARE2, orcl99man, 8manex, CL99XTopt, iit99ma1, pir9Attd, att99atde, ibms99a, ok8amxc, and fub99td resp ectively. Each system has at b est return 1000 documents for each query. The relevance judgment was officially provided by NIST using a p ooling method. We adopt the sum normalization scheme [11] which normalizes the sum of scores from each submission runs to b e one and shifts minimum to b e zero. 330 Method BOU CombSUM CombMNZ QInd ApLQA MAP 0.465(+0%) 0.565(+21%) 0.536(+15%) 0.587(+26%) 0.608(+30%) P30 0.587 0.603 0.603 0.609 0.605 P100 0.414 0.420 0.396 0.441 0.450 R1k 0.645 0.932 0.914 0.914 0.942 exp ect that future investigation on designing b etter query features for ApLQA and introducing some distance-metrictyp e kernels to KpLQA could result in a further improvement on the p erformance of retrieval source combination. Table 3: The comparison of meta-search rerformance against different approaches on TREC-8. Acknowledgement This work was supp orted in part by the Advanced Research and Development Activity (ARDA) under contract numb er H98230-04-C-0406 and NBCHC040037, and by the National Science Foundation under Grant No. I IS-0535056. We also extracted the following query features for learning the ApLQA model: length of the query title, app earance of named entities in the query and the score ratio b etween the first ranked document and 50th ranked document for each of the ten systems. Together with a constant term, we generated a total of 13 query features for each query. These features are designed in an attempt to capture information ab out query difficulties and generalities. In the following, we examine the p erformance of various meta-search algorithms that combine all of the retrieval systems. The retrieval p erformance is averaged over the last 25 queries in terms of MAP, precision at top 30, 100 documents and recall at top 1000 documents. We evaluated the b est underlying retrieval system(BOU) in addition to four meta-search strategies, including CombSUM, CombMNZ, query indep endent combination (QInd) and ApLQA. For those algorithms that require parameter estimation(QInd and ApLQA), we use the first 25 queries as the training data. As shown in Table 3, CombSUM and CombMNZ can improve up on BOU by a margin of around 10% MAP. Between them, the p erformance of CombMNZ is slightly worse than that of CombSUM. With aid of the training set, QInd that uses flexible wights is sup erior to CombSUM that fixes equal weights for every retrieval system. Finally, by introducing the query features and allowing the combination weights vary across different queries, ApLQA offers an additional 2% improvement over QInd w.r.t. MAP. This advantage mainly comes from the extra combination flexibilities provided by ApLQA. 6. REFERENCES [1] T. S. Chua, S. Y. Neo, K. Li, G. H. Wang, R. Shi, M. Zhao, H. Xu, S. Gao, and T. L. Nwe. Trecvid 2004 search and feature extraction task by NUS PRIS. In NIST TRECVID, 2004. [2] T. Coleman and Y. Li. An interior, trust region approach for nonlinear minimization sub ject to b ounds. SIAM Journal on Optimization, 6:418­445, 1996. [3] A. Dempster, N. Laird, and D. Rubin. Maximum likeliho o d from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1­38, 1977. [4] N. Fuhr. Probabilistic mo dels in information retrieval. The Computer Journal, 35(3):243­255, 1992. [5] A. Hauptmann, M.-Y. Chen, M. Christel, C. Huang, W.-H. Lin, T. Ng, N. Pap ernick, A. Velivelli, J. Yang, R. Yan, H. Yang, and H. D. Wactlar. Confounded exp ectations: Informedia at trecvid 2004. In Proc. of TRECVID, 2004. [6] T. Hofmann. Probabilistic latent semantic indexing. In Proc. of the 22nd Intl. ACM SIGIR conference, pages 50­57, 1999. [7] I.-H. Kang and G. Kim. Query typ e classification for web do cument retrieval. In Proc. of the 26th ACM SIGIR, pages 64­71. ACM Press, 2003. [8] L. Kennedy, P. Natsev, and S.-F. Chang. Automatic discovery of query class dep endent mo dels for multimo dal search. In ACM Multimedia, Singap ore, Novemb er 2005. [9] G. Kimeldorf and G. Wahba. Some results on tchebycheffian spline functions. J. Math. Anal. Applic., 33:82­95, 1971. [10] R. Manmatha, F. Feng, and T. Rath. Using mo dels of score distributions in information retrieval. In Proc. of the 27th ACM SIGIR Conference on Research and Development in Information Retrieval, 2001. [11] M. Montague and J. A. Aslam. Relevance score normalization for metasearch. In Proceedings of the 10th international ACM CIKM conference, pages 427­433, New York, NY, USA, 2001. [12] R. Nallapati. Discriminative mo dels for information retrieval. In Proc. of the 27th SIGIR conference on Research and development in information retrieval, pages 64­71, 2004. [13] S. E. Rob ertson and K. S. Jones. Relevance weighting of search terms. Journal of the American Society for Informaiton Science, 27, 1977. [14] G. Schwarz. Estimating the dimension of a mo del. Annals of Statistics, 6(2):461­464, 1978. [15] J. A. Shaw and E. A. Fox. Combination of multiple searches. In Text REtrieval Conference, 1994. [16] A. Smeaton and P. Over. TRECVID: Benchmarking the effectiveness of information retrieval tasks on digital video. In Proc. of the Intl. Conf. on Image and Video Retrieval, 2003. [17] E. M. Vo orhees, N. K. Gupta, and B. Johnson-Laird. Learning collection fusion strategies. In Proc. of the 18th ACM SIGIR conference on Research and development in information retrieval, pages 172­179, 1995. [18] E. M. Vo orhees and D. Harman. Overview of the eighth text retrieval conference (trec-8). In TREC, 1999. [19] R. Yan and A. G. Hauptmann. Efficient margin-based rank learning algorithms for information retrieval. In International Conference on Image and Video Retrieval(CIVR), 2006. [20] R. Yan, J. Yang, and A. G. Hauptmann. Learning query-class dep endent weights in automatic video retrieval. In Proceedings of the 12th annual ACM international conference on Multimedia, pages 548­555, 2004. [21] E. Yom-Tov, S. Fine, D. Carmel, and A. Darlow. Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval. In Proceedings of the 28th international ACM SIGIR conference, pages 512­519, New York, NY, USA, 2005. ACM Press. 5. CONCLUSION In this pap er, we prop ose a series of new combination approaches called probabilistic latent query analysis (pLQA) to merge multiple retrieval sources, which unifies the combination weight optimization and query class categorization into a discriminative learning framework. Three pLQA models have b een discussed which evolve from a basic version(BpLQA) to an adaptive version (ApLQA) that op erates on the query feature space and a kernel version (KpLQA) that builds on a Mercer kernel representation. In contrast to the typical query-indep endent and query-class combination methods, pLQA can automatically discover latent query classes from the training data rather than relying on manually defined query classes. Also, it can associate one query with a mixture of query classes and thus non-identical combination weights. Finally, based on statistical model selection principles, we can obtain the optimal numb er of query classes by maximizing the regularized likelihood. Our exp eriments in two large-scale retrieval applications, i.e., multimedia retrieval and meta-search on the TREC collections, demonstrate the sup eriority of the prop osed methods which can achieve significant gains in average precision over the query-indep endent/query-class combination methods. We 331