SIGIR 2007 Proceedings Poster Using Gradient Descent to Optimize Language Modeling Smoothing Parameters Donald M etzler metzler@cs.umass.edu Center for Intelligent Information Retrieval Depar tment of Computer Science University of Massachusetts Amherst, MA 01003 Categories and Sub ject Descriptors: H.3.3 Information Storage and Retrieval: Information Search and Retrieval General Terms: Theory, Experimentation Keywords: Language modeling, parameter estimation, gradient descent. 2. PARAMETER ESTIMATION In this section we describe, in general terms, how to estimate parameters using direct search and the RankNet cost function. 2.1 Direct Optimization 1. INTRODUCTION Most information retrieval models have some set of tunable parameters. It is often the case that the parameters of such models are either set to default values or tuned in a supervised or unsupervised manner. Historically, many retrieval models have been tuned to maximize some underlying retrieval metric, such as mean average precision. However, as the number of parameters increases, the direct maximization techniques become computationally expensive. For this reason, there has been a growing interest in both the information retrieval and machine learning communities to develop parameter estimation techniques that scale well. Since most information retrieval metrics are non-smooth with respect to model parameters, the machine learning techniques have focused on maximizing or minimizing some surrogate function that attempts to mimic or be highly correlated with retrieval metrics. Standard optimization techniques can then easily be applied to the surrogate function in order to estimate approximate parameters. There have been, however, few studies from an information retrieval perspective into how well such surrogate functions compare to the direct search approach. Recent work has investigated how the effectiveness of using BM25F parameters estimated by minimizing the RankNet cost function correlates with the effectiveness of an approach that directly maximizes NDCG [2]. The results showed that RankNet acts as a good surrogate and produces reasonable effectiveness when compared to a more computationally expensive direct search technique. Following up on this work, we wish to explore how to use the RankNet cost function to optimize language modeling smoothing parameters. In addition, we explore how well the parameters learned using RankNet compare to those found by a direct search technique for various metrics that rely on binary relevance judgments, such as mean average precision, binary preference, and precision at 10. Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. Given a model with one or more parameters, a set of relevance judgments, and a retrieval metric, it is straightforward to implement various algorithms that directly attempt to find the parameter setting that maximizes or minimizes the metric. One of the most našve approaches is to search i for the global optimum via a brute force search over the entire parameter space. Depending on whether the parameter space is bounded and the number of parameters, this may be computationally infeasible. Techniques such as coordinate ascent (and its variants) may be more efficient, but are not guaranteed to find a global optimum. These techniques are forced to estimate derivatives via finite differences or perform line searches, since analytical derivates cannot be computed in general. 2.2 RankNet Optimization Rather than directly search within the original, non-smooth retrieval metric space, we may instead optimize a surrogate function. In our work, we choose the RankNet cost function as our surrogate [1, 2]. The RankNet cost function is defined over pairwise preferences. That is, for a given query Q, we define the set RQ , such that (d1 , d2 ) RQ implies that document d1 should be ranked higher than document d2 . Given a set of binary relevance judgments, it is easy to construct RQ by taking the cross product of relevant and non-relevant documents. There are other ways to construct RQ , as well, although we found that we achieved the best results by using all of the pairwise preferences we had available to us. Once we have defined our pairwise preferences, the RankNet cost is computed according to: C (Q, R) = Q ( Q d1 ,d2 )RQ log(1 + exp(Y )) where Q is the set of queries, Y = g (Q; d2 ) - g (Q; d1 ), and g (Q; d) is the score of document d with respect to query Q using the current parameter setting. In order to minimize C , we perform coordinate descent, which requires us to compute gradients with respect to our model parameters. Given some model parameter , we apply the chain rule in order to compute the gradient of C with respect to as: 687 SIGIR 2007 Proceedings Poster Q ( C = Q Metric d1 ,d2 )RQ C Y Y C Y MAP Using the RankNet cost function, it is easy to see that is computed as: C exp [g (Q; d2 ) - g (Q; d1 )] = Y 1 + exp [g (Q; d2 ) - g (Q; d1 )] BPREF P@10 Therefore, the final piece that needs to be computed depends on the underlying scoring function. In order to analytically compute the partial, we must be able to differentiate our scoring function with respect to each parameter. It is then straightforward to compute Y according to: Y g (Q; d2 ) g (Q; d1 ) = - Estimate Direct RankNet Optimal Direct RankNet Optimal Direct RankNet Optimal ap .2072 .2081 .2088 .3490 .3437 .3504 .3360 .3400 .3520 wsj .3255 .2987 . 3290 .3392 .3294 . 3557 .4820 .4240 .4960 robust .2920 .2756 . 2931 .2821 .2661 . 2840 .4323 .4384 .4434 wt10g .1930 .1922 .2003 .1834 .1805 .1907 .3204 .3143 .3388 Note that the RankNet cost function does not depend on any specific retrieval metric. Therefore, RankNet will always learn the same parameters for a given set of pairwise preferences, regardless of the metric we wish to optimize for. Table 1: Test set effectiveness for parameters estimated using direct search and RankNet. Optimal effectivness values are also provided as an upp er b ound. Effectiveness is measured in terms of mean average precision, binary preference, and precision at 10. Italicized values indicate statistically significant improvements over direct search. Bold values indicate significant improvements over RankNet. search being significantly better than RankNet on the WSJ and ROBUST data sets. In addition, RankNet is significantly worse than optimal for every data set except AP, whereas direct search only differs significantly from optimal for WSJ and ROBUST. The results for P@10 show that RankNet is only significantly worse than direct search for WSJ. Indeed, RankNet appears to be more stable for P@10 than MAP and BPREF. These results indicate that RankNet is never significantly better than direct search for estimating the two-stage language modeling parameters. While RankNet often does produce reasonable effectiveness, it is considerably less consistent than direct search. This is likely the result of the cost function minimizing a surrogate function that only tends to be correlated with actual retrieval metrics. As pointed out by Taylor et al., it may be possible to improve the consistency of RankNet by measuring the retrieval metric on some validation set and using that as a stopping criteria [2]. Therefore, when computationally feasible, direct search still seems to be the most appropriate method for estimating parameters. However, RankNet acts as a good surrogate that scales better and often produces reasonable results. 3. LANGUAGE MODELING RANKING We wish to estimate the parameters for the language modeling query likelihood ranking function using two-stage smoothing [3]. The two-stage smoothing estimate, which has two parameters, generalizes both Dirichlet and Jelinek-Mercer smoothing, and therefore makes for a good general purpose language modeling ranking function. A document D is scored against query Q under this model as follows: w g (Q; D) = w Q ( log 1 - ) tfw,D + P (w|C ) + P (w|C ) + |D| here and are the smoothing parameters that we wish to estimate. Note that we rank according to the log query likelihood in order to simplify the mathematical derivations. The partial derivates of the scoring function, with respect to and , are computed as follows: P (w|C ) - w,D +|D| w g (Q; D) = tf +P (w|C ) (1 - ) w,D +|D| + P (w|C ) Q g (Q; D) = w Q tf +P (w|C ) Acknowledgments This work was supported in part by the CIIR, in part by NSF grant #CNS-0454018, in part by ARDA and NSF grant #CCF-0205575, and in part by Microsoft Live Labs. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect those of the sponsor. ( |D | (1 - ) (+|D|)2 1 - ) P (w|C ) - tfw,D |D | tfw,D +P (w|C ) +|D | + P (w|C ) 4. EVALUATION In this section we evaluate how the effectivness of language modeling parameters learned by minimizing the RankNet cost function compare to the effectiveness of parameters learned by using direct search. Experiments are carried out on four standard TREC ad hoc retrieval test collections, including three newswire data sets (AP '88'-'90, WSJ '87-'92, and the 2004 Robust Track data set), and a large web data set (wt10g). For training purposes, each set of topics is split into a training and test set. The results are given in Table 1. As the table shows, the outcomes for MAP and BPREF are the same, with direct 5. REFERENCES [1] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In ICML '05: Proceedings of the 22nd international conference on Machine learning, pages 89­96, 2005. [2] M. Taylor, H. Zaragoza, N. Craswell, S. Rob ertson, and C. Burges. Optimisation metho ds for ranking functions with multiple parameters. In Proc. 15th CIKM, pages 585­593, 2006. [3] C. Zhai and J. Lafferty. Two-stage language mo dels for information retrieval. In Proc. 25th SIGIR, pages 49­56. 2002. 688