Incorporating Query Difference for Learning Retrieval Functions in Information Retrieval [Extended Abstract] Hongyuan Zha Depar tment of CSE Pennsylvania State University University Park, PA 16802 Zhaohui Zheng Yahoo Inc. 701 First Avenue Sunnyvale, CA 94089 Haoying Fu, Gordon Sun Yahoo Inc. 701 First Avenue Sunnyvale, CA 94089 zha@cse.psu.edu ABSTRACT zhaohui@yahoo-inc.com gzsun@yahoo-inc.com We discuss information retrieval methods that aim at serving a diverse stream of user queries. We propose methods that emphasize the importance of taking into consideration of query difference in learning effective retrieval functions. We formulate the problem as a multi-task learning problem using a risk minimization framework. In particular, we show how to calibrate the empirical risk to incorporate query difference in terms of introducing nuisance parameters in the statistical models, and we also propose an alternating optimization method to simultaneously learn the retrieval function and the nuisance parameters. We illustrate the effectiveness of the proposed methods using modeling data extracted from a commercial search engine. For each query q , we can then specify a learning problem (classification or regression problem): find h H such that q h = arg min hH EPq L( , h(q , d)). q The goal of information retrieval, is to learn a retrieval function h that will be good for all the queries q Q. Therefore, we need to deal with potentially infinite number of related learning problems, each for one of the query q Q. To this end, we specify a distribution over Q: PQ (q ) can indicate, for example, the probability that a specific query q is issued to the information retrieval system which can be approximated. Then the optimization problem we need to solve is for the combined risk, hH min EPQ EPq L( , h(q , d)) (1) Categories and Subject Descriptors H.3.3 [Information Systems]: Information Search and Retrieval--Retrieval functions ; H.4.m [Information Systems]: Miscellaneous--Machine learning In practice, we sample a set of queries {qi }Q 1 from the i= distribution PQ , and for each query q , we also sample a set of documents from D for labeling to obtain {dqj , lqj }, q = 1, . . . , Q, j = 1, . . . , nq General Terms Algorithms, Experimentation, Theory Keywords relevance, retrieval function, machine learning, query dependence, least-squares regression, regularization where lqj L are labels obtained from human judges for example after relevance assessment. Ideally, the sampling of the queries should be according to PQ and the sampling of the documents for each query q should be according to Pq , the former is relatively easy to do while the later is a much more difficult issue. The optimization problem for the empirical counterpart of (1) is min Q nq XX q =1 j =1 1. PROBLEM FORMULATION hH L( qj , h(q , dqj )) + (h), We consider D, the set of all the documents in consideration, L the set of labels which can be either a finite or infinite set, and Q the set of all potential user queries. We model each query q Q as a probabilistic distribution Pq over D × L, Pq (d, ), d D, L which specifies the probability of document d being labeled as under query q . Now we define a loss function L over the set L × L, L : L × L R1 , the set of nonnegative real numbers, and + we also specify a class of functions H from which the retrieval function will be extracted, where for h H, h : Q × D L. Copyright is held by the author/owner(s). SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. where we also add a regularization term to control the complexity of h with (h) measuring the complexity of h, and is the regularization parameter that balances the fit of the model in terms of the empirical risk and the complexity of the model. 2. INCORPORATING QUERY DIFFERENCE IN THE EMPIRICAL RISK To be concrete, we consider the risk minimization problem in the context of regression, i.e., we assume the labels are real numbers, and to be consistent with convention, we will use yqj to denote the label qj . We assume the labeled set is represented as {xqj , yqj }, q = 1, . . . , Q, j = 1, . . . , nq , 721 here xqj denote the feature vector for the query-document pair {q , dqj }. To this end, we seek to find a function h H to minimize the following empirical risk, L(h) = Q nq XX (yqj - h(xqj ))2 . q =1 j =1 Table 1: Numb er of queries and query-url pairs on US and CN datasets Chinese English # queries 2649 910 # total query-document pairs 78188 70742 To incorporate query-dependent effects, we can consider the following modified empirical risk, and we seek to find h and gq , q = 1, . . . , Q, to minimize Q nq L(h, g ) = XX [yqj - gq (h(xqj ))]2 , q =1 j =1 (2) where gq (·) is a general monotonically increasing function, and g = [g1 , . . . , gQ ], i.e., we seek {h , g } = argmin{hH,g mono increasing} L(h, g ). The intention is that the function gq incorporate the difference for queries when using h(x) to predict the label y . From another viewpoint, for suitably chosen gq , we seek to - find h(x) to match gq 1 (y ). In this work we focus on the simple case where gq (·) is a linear function, i.e., gq (x) = q + q x, q = 1, . . . , Q with q 0. As we mentioned before, to control the size of the parameters q , q and the complexity of h, we also need to add regularization terms to the modified empirical risk (2) to obtain the regularized empirical risk L(h, , ) = Q nq XX q =1 j =1 Table 2: The dcgs and p ercentage dcg increases of retrieval function with aTVT over without aTVT on the English data set, and p values for different regularization parameters: and in the L2 case. Notice that the dcg for retrieval function without aTVT are 11.30. =1 10 50 100 =1 11.48(+1.55%,0.01) 11.53(+1.98%,0.002) 11.50(+1.75%,0.002) 11.51(+1.83%,0.007) 10 11.49(+1.68%,0.006) 11.52(+1.88%,0.002) 11.51(+1.79%,0.008) 11.49(+1.65%,0.003) [yqj - q - q h(xqj )]2 + p p human judgment process. In Table 1, we list some basic statistics for the two data sets. In this work, we use the recently popularized Discounted Cumulative Gain (DCG) methodology which seems to be more appropriate for assess relevance in the context of search engines. For a ranked list of N documents, we use the following variation of DCG, DCGN = N X i=1 + p p + + h (h), Gi , log2 (i + 1) where = [1 , . . . , Q ] and = [1 , . . . , Q ], and , and h are regularization parameters, and · p is the p norm of a vector. We will only consider the case for p = 1 or p = 2. In summary, we seek to find {h , , } = argmin hH, ,0 L(h, , ). (3) In general, we will not impose a parametric form for the function h, and we will employ the methodology of coordinate descent (alternating optimization) to solve the optimization problem (3). Specifically, we will alternate between optimizing against h and optimizing against and . The regularization parameter h will be determined during the nonlinear regression process for finding h discussed below while regularization parameters and will be determined by cross-validation. In what follows, we will refer the above algorithm as adaptive Target Value Tranformation (aTVT). We will compare aTVT against a particular system that uses nonlinear regression to learn a retrieval function based on the gradient boosting methods. where Gi represents the weights assigned to the label of the document at position i. We will use the symbol dcg to indicate the average of this value over a set of queries in our experiments. Table 2 lists the dcg for retrieval function with aTVT as compared to retrieval function without aTVT in the L2 case for the English data set (similar results hold for the chinese data set). The percentage dcg gains and the p-values from Wilcoxon signed rank tests are also presented. From thetable, we can see aTVT gives statistically significant dcg gains. The optimal regularization parameter combinations give about 2% dcg gain for the English data. 4. CONCLUDING REMARKS There are many ways to incorporate query difference in learning retrieval functions, the approaches of constructing appropriate query features being one of them even though it is usually not looked at from this viewpoint. In this paper, we present an approach through modifications of the empirical risk using nuisance parameters to accommodate the effects of query difference. The approach can be used even when there are query features contained in the feature vector of query-document pairs. 3. EXPERIMENTAL RESULTS In this section we report some experimental results on data generated from a commercial search engine. In our experiments, a set of queries are sampled from query logs, and a certain number of query-document pairs are labeled according to their perceived relevance judged by human editors. The labels are mapped to numerical values and the goal is to learn a retrieval function that can best mimic the 722