A General Boosting Method and its Application to Learning Ranking Functions for Web Search Zhaohui Zheng Hongyuan Zha Tong Zhang Olivier Chapelle Keke Chen Gordon Sun Yahoo! Inc. 701 First Avene Sunnyvale, CA 94089 zhaohui,tzhang,chap,kchen,gzsun @yahoo-inc.com College of Computing Georgia Institute of Technology Atlanta, GA 30032 zha@cc.gatech.edu Abstract We present a general boosting method extending functional gradient boosting to optimize complex loss functions that are encountered in many machine learning problems. Our approach is based on optimization of quadratic upper bounds of the loss functions which allows us to present a rigorous convergence analysis of the algorithm. More importantly, this general framework enables us to use a standard regression base learner such as single regression tree for fitting any loss function. We illustrate an application of the proposed method in learning ranking functions for Web search by combining both preference data and labeled data for training. We present experimental results for Web search using data from a commercial search engine that show significant improvements of our proposed methods over some existing methods. 1 Introduction There has been much interest in developing machine learning methods involving complex loss functions beyond those used in regression and classification problems [13]. Many methods have been proposed dealing with a wide range of problems including ranking problems, learning conditional random fields and other structured learning problems [1, 3, 4, 5, 6, 7, 11, 13]. In this paper we propose a boosting framework that can handle a wide variety of complex loss functions. The proposed method uses a regression black box to optimize a general loss function based on quadratic upper bounds, and it also allows us to present a rigorous convergence analysis of the method. Our approach extends the gradient boosting approach proposed in [8] but can handle substantially more complex loss functions arising from a variety of machine learning problems. As an interesting and important application of the general boosting framework we apply it to the problem of learning ranking functions for Web search. Specifically, we want to rank a set of documents according to their relevance to a given query. We adopt the following framework: we extract so that we can rank the a set of features for each query-document pair, and learn a function documents using the values , say with larger values are ranked higher. We call such a function a ranking function. In Web search, we can identify two types of training data for learning a ranking function: 1) preference data indicating a document is more relevant than another with respect to a query [11, 12]; and 2) labeled data where documents are assigned ordinal labels representing degree of relevancy. In general, we will have both preference data and labeled data for 1 © ¤¦ ¨§¥ £ © ¤¦ ¨¥ ¡ ¤ ¡ © ¤¦ ¨§¥ ¢ ¤ © ¤¦ ¨§¥ training a ranking function for Web search, leading to a complex loss function that can be handled by our proposed general boosting method which we now describe. 2 A General Boosting Method where denotes a prediction function which we are interested in learning from the data, is a prechosen function class, and is a risk functional with respect to . We consider the following form of the risk functional : For example, each function can be a single variable function ( ) such as in regression: ; or a two-variable function ( ), such as those in ranking based on pairwise comparisons: , where indicates whether is preferred to or not; or it can be a multi-variable function as used in some structured prediction problems: , where is a loss function [13]. Assume we do not have a general solver for the optimization problem (1), but we have a learning algorithm which we refer to as regression weak learner. Given any set of data points , with corresponding target values , weights , and , the regression weak learner produces a function such tolerance that (3) Friedman [8] proposed a solution when the loss function in (2) can be expressed as which he named as gradient boosting. The idea is to estimate the gradient using regression at each step with uniform weighting, and update. However, there is no convergence proof. Following his work, we consider an extension that is more principly motivated, for which a convergence analysis can be obtained. We first rewrite (2) in the more general form: (5) 1 where . Note that depends on only through the function values and from now on we identify the function with the vector . Also the function is considered to be a function of variables. where lies between and . The right hand side is almost quadratic, and we can then replace it by a quadratic upper-bound 1 We consider that all inequality. are different, but some of the in (2) might have been identical, hence the 2 ¥ V y y q 6 o w o o 194¦ 2 |194¦ 2 @© o c 2 x © o b94¦ 2 i © ¥ ©¥ ¦ ¥ A 9¥ o 9¥ q 6 o c4¦ 2 i w o o )94¦ 2 |1e4¦ 2 @© o b94¦ 2 © ¥ © ¥ ©¥ ¥ 2 Our main observation is that for twice differentiable risk functional we can expand around using Taylor expansion as , at each tentative solution }8 Our goal is to use this weak learner to solve the original optimization problem (1). Here , i.e., can be expressed as with . (4) (6) G¥ u ©l6 6 6 i¦ ht77sf r 4qp o d 6``` e TSRS6 G ji k k V ¥6``` bU aSRSR6 G ¥ where is a loss function with respect to the first arguments 8 © ¨§¥ ¤¦ E © ¤¦ ¥ Y© ¨4¦ E E I © 6 ¥¦ g © 6 ¥¦ © 6 X¦ © X cX 4Ha4h§75¨y§5yYtscY6 V U aRRSR6 i G 4¦ ¥6``` ¥ I¥ E £ X i© ¥ pY© i g G 4ug yxutscX 6 i a6 G 4¦ ¥¦ X 6 w¦ v ! © ¥¥ I A 1¢ A q r d i© X g ¥¦ © X6 ¥ pch4fbeY4¦ I E d A I E E © V ¥6``` ¥ cX 6 bU aRSRR6 G 4¦ Ed E IG HF E B @54¦ 2 6 X6 V ¤¦ ¥ 6 Q Q Q 6 7© E Y7© WU P ¨TRSR7© G P ¨§4¦ ¤¦ ¥ ©¥ `Tl| i © v H© v ¨¦ o ¦ v g¤ h k 2 I u q v ¥ E ¥ d ¤¦ 4© ¨§¥ ¥ ¥ 2 6©© ¤¦ ¥ 6 ` ` 6 ¤¦ ¥ 7Y¨§aRSE` RR7© G ¨§4¦ 2 @54¦ 2 ©¥ d 6``` S RRSR6 G gf h h E A ©¤ ¨¦ v ¥ v §v @¨¥ © ¤¦ I GHF 67©Y© ¨¦§¥4¦ ¤ ©¥ w54¦ 2 C ED 6 © ¥ 2 % 0 $!' 754¦ 3&1)" ¥ #( E E E E We consider the following general optimization problem: (1) GHF v z G wF v %{ u y1( ! # g¤ x i © v H© v ¨¦ o ¦ v kD h D C ED A ©¥ 94¦ 2 5¥ 2 ©¥ 54¦ 2 8 q¥ E dx wnm$l d ¤6``` eYRSRR6 G ¤ ¥ © u¦ % §~19p ¥ (2) . , is a diagonal matrix upper bounding the Hessian between and . If we define , then is equal to the above quadratic form (up to a constant). So can be found by calling the regression weak learner . Since at each step we try to minimize an upper bound of , if we let the minimum be , it is clear that . This means that by optimizing with respect to the problem that can be handled by , we also make progress with respect to optimizing . The algorithm based on this idea is listed in Algorithm 1 for the loss function in (5). Convergence analysis of this algorithm can be established using the idea summarized above; see details in appendix. However, in partice, instead of the quadratic upper bound (which has a theoretical garantee easier to derive), one may also consider minimizing an approximation to the Taylor expansion, which would be closer to a Newton type method. Algorithm 1 Greedy Algorithm with Quadratic Approximation Input: let for let , with either or % Newton-type method with diagonal Hessian global diagonal upper bound on the Hessian % Upper-bound minimization let , where pick let pick step-size , typically by line search on let end The main conceptual difference between our view and that of Friedman is that he views regression as a "reasonable" approximation to the first order gradient , while our work views it as a natural consequence of second order approximation of the objective function (in which the quadratic term serve as an upper bound of the Hessian either locally or globally). This leads to algorithmic difference. In our approach, a good choice of the second order upper bound (leading to tighter bound) may require non-uniform weights . This is inline with earlier boosting work in which samplereweighting was a central idea. In our framework, the reweighting naturally occurs when we choose a tight second order approximation. Different reweighting can affect the rate of convergence in our analysis. The other main difference with Friedman is that he only considered objective functions of the form (4); we propose a natural extension to the ones of the form (5). 3 Learning Ranking Functions We now apply Algorithm 1 to the problem of learning ranking functions. We use preference data as well as labeled data for training the ranking function. For preference data, we use to mean that is preferred over or should be ranked higher than , where and are the feature vectors for corresponding items to be ranked. We denote the set of available preferences as In addition to the preference data, there are also labeled data, where is the feature of an item and is the corresponding numerically coded label.2 We formulate the ranking problem as computing a ranking function , such that satisfies as much as possible the set of preferences, i.e., , if while at the same time matches the label in a sense to be detailed below. 2 Some may argue that, absolute relevance judgments can also be converted to relative relevance judgments. and labeled as perfect, good, and bad, For example, for a query, suppose we have three documents is preferred over , is preferred respectively. We can obtain the following relative relevance judgments: over and is preferred over . However, it is often the case in Web search that for many queries there only exist documents with a single label and for such kind of queries, no preference data can be constructed. 3 2 w°s67© 76 ³¦ |² ´ ¤ f¯ ¢ ¢E E X ®|¤ E ·º ¶ )¶ 6¨Y6R`R`SR6 ° 6 ¬ ¤ ` X A 8¨µ¥ ¥ E E o 9¥ o 2 X ¥ ¤ · y¶ » ¶ 2 º ¶¸· 1¹ay¶ i p© v © v ¨¦ o ¦ v v TH o g¤ 6 u h k X © ¨§n© ¨§¥ X¦ ¥ ¦ ¤¦ E © ¤¦ ¥ ae¨y9£ 2 £ ¢ G k ¥ h 2 E ´E 2 2 E i ´ ©¥ 194¦ 2 x ) o c 2 x ) o e4¦ 2 ©¦ © ¥ » 1¶ o e§9¬ G Se¥ ª ¥ « w¦ §seª ©1 a©f ¨Y4qtt o l6 6 6 i¦ w¦ §s l P P wF &p f G d h i i¦ ¥ p©ae¤¨y9£ 2 ¤¢ i£ P P GHF ydY ¡i k `` 6 SR`S qa6 k yu 6w A e¥ w ¤ P G d P wF &pe¤ o X g v k v 4)e4¦ 2 v h d© ¥ i E © ³§¥ ¦ º 1¶ A » ¶ ¤ E where E 6 6``` £ B RRSR6 ` £ 6``` ±YRRSR6 ws6 A X ° THE O B J E C T I V E F U N C T I O N . We use the following objective function to measure the empirical risk of a ranking function , The objective function consists of two parts: 1) for the preference data part, we introduce a margin parameter and would like to enforce that ; if not, the difference is quadratically penalized; and 2) for the labeled data part, we simply minimize the squared errors. The parameter is the relative weight for the preference data and could typically be found by cross-validation. Q UA D R AT I C A P P ROX I M AT I O N . To this end consider the quadratic approximation (6) for . For simplicity let us assume that each feature vector , and only appears in and once, otherwise we need to compute appropriately formed averages. We consider as the unknowns, and compute the gradient of with respect to those unknowns. The comis just The components of the ponents of the negative gradient corresponding to negative gradient corresponding to and , respectively, are Both of the above equal to zero when . For the second-order term, it can be readily verified that the Hessian of is block-diagonal with -by- blocks corresponding to and and -by- blocks for . In particular, if we evaluate the Hessian at , the -by- block equals to for with and , respectively. We can upper bound the first matrix by the diagonal matrix leading to a quadratic upper bound. We summarize the above derivations in the following algorithm. and The fitting of is done by using a base regressor with the above training set; We weigh the above preference data by and the labeled data by respectively. 2) forming where is found by line search to minimize the objective function. is a shrinkage factor. The shrinkage factor by default is 1, but Friedman [8] reported better results (coming from better regularization) by taking . In general, we choose and by cross-validation. could be the degree of preference if that information is available, e.g., the absolute grade difference between each prefernce if it is converted from labeled data. Otherwise, we simply set it to be 1.0. When there is no preference data and the weak regression learner produces a regression tree, QBrank is identical to Gradient Boosting Trees (GBT) as proposed in [8]. can appear multiple times in Step 1), in this case we use the average gradient R E M A R K . An values as the target value for each distinct . Uª Î 6© ¤ 7¨¦ U o U e¼ G ¥ U ¬ U ¥ ªÎ ¥ g kA k ©¤ ¨¦ U o ` 6``` £ B RSRR6 w±7Y© ³¦ G ¥ U ¼g 76 ³¦ ° 6© ¥ ´ A ¢ 6© ½ 71£¼© ¤¨¦ G ¥ U ¥¼gH© X¨¦ G ¥ U ¥Tyw 3!ÂgS6 ¨¦W71¼© ¨¦ G ¥ U hw© ¨¦ E G ¥ U Tyw 3! E6 ¨¦ 6v X 6©£ ½ ¤ ¥g X ¥ 6 v E ¤ ¢ ¢ ¯q XY6 Ì ¤ ©¨¦ U ¤ o E ```6q SRS a6 E gd e¥ E A 4 Í E E E Algorithm 2 Boosted Ranking using Successive Quadratic Approximation (QBRank) Start with an initial guess , for , 1) we construct a training set for fitting by adding the following for each ©¥ 94¦ 2 q q © ¨¥ ¤¦ E ² ½ ¯ 8 The optimization problem we seek to solve is argmin function class. Note that depends only on the values using the general boosting framework discussed in section 2. E E GHF G HF q q ` i Y© ³¦¥¼g x¦ © ´ i 1¼© ¨§¼H© ¨Tyw 3¨¦ © £ ½ ¤¦ ¥ g X¦ ¥ 6 v ! ©¥ @54¦ 2 g ¢ ED k C ED k A `£½¼© ¨¦§hgH© ¨§ayw 3ÂÁ)h© ¨¼H© ¨§ayw 3! ¤¥ X ¦ ¥ 6 v ! g E 6 £ ½ E ¤ ¦ ¥ g X ¦ ¥ 6 v ¢ ¢ E E E © ¨¥ X¦ © E ¨§¥ ¤¦ ` ¦ 7© ³§¥Àg ´ © ³§¥ ¦ E ©¥ 94¦ E 2 E ° 6 ¦ ¥ 6 6``` H±7© ³¿r RRRS6 w±7© ¨T7© ¨§¥ ° 6 X¦ ¥ 6 ¤¦ ¥ © ³§a7© ¨§a7© ¨§¥ ¦ ¥ 6 X¦ ¥ 6 ¤¦ 6794¦ 2 )' ©¥ 0( E E E E 6``` B RRSR6 E A E E w Æ 6 Ä Æww E w q k ½ X¦ ¥ ¦ ¤¦ ±© ¨§t"© ¨§¥ E © q 6 q¦ # )a˳5ÊcÉ ½ ¦ X¦ ¥ g ¤¦ tW© ¨§Çw© ¨¥ ½ È X¦ ¥ g ¤¦ tb© ¨§Çw© ¨§¥ E X q E Î ¤ ½ ¦ X¦ ¥ g ¤¦ §"© ¨§$© ¨§¥ à E ¾ ¥ 6E Ä ÅA g A E A ¤ A A g à E © ³¥ ¦ ©¥ 54¦ 2 E A E È rÎ E 2 Î E E A ¤ where is some given and we can optimize it E E E E ¥ A ½ E X E © ¨§¥ X¦ E ¤ k , 4 Experiment Results We carried out several experiments illustrating the properties and effectiveness of QBrank using combined preference data and labeled data in the context of learning ranking functions for Web search [3]. We also compared its performance with QBrank using preference data only and several existing algorithms such as Gradient Boosting Trees [8] and RankSVM [11, 12]. RankSVM is a preference learning method which learns pair-wise preferences based on SVM approach. DATA C O L L E C T I O N . We first describe how the data used in the experiments are collected. For each query-document pair we extracted a set of features to form a feature vector. which consists of three parts, where 1) the query-feature vector comprises features dependent on the query only and have constant values across all the documents in the document set, for example, the number of terms in the query, whether or not the query is a person name, etc.; 2) the document-feature vector comprises features dependent on the document only and have constant values across all the queries in the query set, for example, the number of inbound links pointing to the document, the amount of anchor-texts in bytes for the document, and the language identity of the document, etc.; and 3) the query-document feature vector which comprises features dependent on the relation of the query with respect to the document , for example, the number of times each term in the query appears in the document , the number of times each term in the query appears in the anchor-texts of the document , etc. We sampled a set of queries from the query logs of a commercial search engine and generated a certain number of query-document pairs for each of the queries. A five-level numerical grade is assigned to each query-document pair based on the degree of relevance. In total we have 4,898 queries and 105,243 query-document pairs. We split the data into three subsets as follows: 1) we extract all the queries which have documents with a single label. The set of feature , which contains around 2000 queries vectors and the corresponding labels form training set giving rise to 20,000 query-document pairs. (Some single-labeled data are from editorial database, where each query has a few ideal results with the same label. Other are bad ranking cases submitted internally and all the documents for a query are labeled as bad. As we will see those type of singlelabeled data are very useful for learning ranking functions); and 2) we then randomly split the remaining data by queries, and construct a training set containing about 1300 queries and 40,000 query-document pairs and a test set with about 1400 queries and 44,000 query-document pairs. or to generate a set of preference data as follows: given a query and two documents We use and . Let the feature vectors for and be and , respectively. If has a higher grade than , we include the preference while if has a higher grade than , we include the preference . For each query, we consider all pairs of documents within the search results for that query except those with equal grades. This way, we generate around 500,000 preference pairs in total. We denote the preference data as and corresponding to and , respectively. E VA L UAT I O N M E T R I C S . The output of QBrank is a ranking function which is used to rank the documents according to . Therefore, document is ranked higher than by the ranking function if , and we call this the predicted preference. We propose the following two metrics to evaluate the performance of a ranking function with respect to a given set of preferences which we considered as the true preferences. 1) Precision at %: for two documents and (with respect to the same query), it is reasonable to assume that it is easy to compare and if is large, and and should have about the same rank if is close to . Base on this, we sort all the document pairs according to . We call precision at %, the fraction of non-contradicting pairs in the top % of the sorted list. Precision at 100% can be considered as an overall performance measure of a ranking function. 2) Discounted Cumulative Gain (DCG): DCG has been widely used to assess relevance in the context of search engines [10]. For a ranked list of documents ( is set to be 5 in our experiments), we use the following variation of DCG, , where represents the weights assigned to the label of the document at position . Higher degree of relevance corresponds to higher value of the weight. We will use the symbol to indicate the average of this value over a set of queries in our experiments. 5 Û Í X6 ¤ Y§Ì Ø Ó Ø Ó × ÑÖ E Ó X â Ó ÐÏ ~~¤ X Ò iÖ Ó ¤ © ¥ A å o Ó ° ° ¨¦ i yH HF äã â G Ï ¤ Ó X ¤ GÖ Ó E Ù Ó Ü © X¦ ¥ g © ¤¦ Sc¨§|@¨¥ Ü X ¤ × HÚ iÖ ©Ù 6Ò Ó x¦ E ßÞ áàtbÝ iÚ Ò X¤±¤ ©ØÓ6Ò Ë7 x¦ X ¤ Ò Û Ò × "Ö ©c¨¦§¥ X ¤ © ¤¦ ¨¥ 6d Ð Ï 6 Ð ¤6 Ï ³Ñ~¤ Y¤ Ð ¤ © X¦ ¥ m © ¤¦ c¨¢¨¥ © ¤¦ ¨§¥ ¤ ¤¨X Ü © X¦ ¥ g © ¤¦ Sc¨¨§¥ Ü × ÑÖ Û Ò q¤ Ò ¤ Ù Ó © 6Ô6q 6w ËÕ aË76 yx¦ A Ù i Ö Ó ¥ Ø Ó %K 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% QBrank 0.9446 0.903 0.8611 0.8246 0.7938 0.7673 0.7435 0.7218 0.7015 0.6834 PA R A M E T E R S. There are three parameters in QBrank: , , and . In our experiments, is the absolute grade difference between each pair . We simply set to be 0.05, and to be 0.5 without detailed tuning. For a fair comparsion, we used single regression tree with 20 leaf nodes as the base regressor of both GBT and QBrank in our experiments. E X P E R I M E N T S A N D R E S U LT S . We are interested in the following questions: 1) How does GBT using labeled data compare with QBrank or RankSVM using the preference data extracted from in GBT and QBrank? the same labeled data: ? and 2) Is it useful to include single-labeled data To this end, we considered the following six experiments for comparison: 1) GBT using , 2) GBT using , 3) GBT using , 4) RankSVM using , 5) QBrank using , and 6) QBrank using . Table 1 presents the precision at % on data for the ranking function learned from GBT with labeled training data , and QBrank and RankSVM with the corresponding preference data . This shows that QBrank outperforms both GBT and RankSVM with respect to the precision at % metric. The DCG-5 for RankSVM using is 6.181 while that for the other five methods are shown in Figure 1, from which we can see it is useful to include single-labeled data in GBT training. In case of preference learning, no preference pairs could be extracted from single labeled data. Therefore, existing methods such as RankSVM, RankNet and RankBoost that are formulated for preference data only can not take advantage of such data. The QBrank framework can combine preference data and labeled data in a natural way. From figure 1, we can see QBrank using combined preference data and labeled data outperforms both QBrank and RankSVM using preference data only, which indicates that singled labeled data are also useful to QBrank training. Another observation is that GBT using labeled data is significantly worse than QBrank using preference data extracted from the same labeled data3 . We suspect the deeper reason lies at the fundamental difference between a ranking problem and a regression problem. GBT is designed for regression problems and is therefore not necessarily the optimal choice for ranking problems. Ranking web search results is fundamentally a preference learning problem rather than a regression problem, i.e., we just want to rank more relevant results higher than those less relevance ones, we do not need to care about accurately predicting the grades of the documents. Notice that, we exclude all tied data (pairs of documents with the same grades) when converting preference data from the absolute relevance judgments, which can be significant information loss, i.e., including tied data can further improve performance. We further explore the effectiveness of including single-labeled data in QBrank and plot the dcgs against each iteration in Figure 1. The results indicate the usefulness of single-labeled data in QBrank. The clear convergence trend of QBrank is also demonstrated. 3 a 1% dcg gain is considered signficant on this data set for commercial search engines. 6 i ÚÛ ½ GÖ k GÖ iÚ Î k ν iÚ Í E X¤ Y6 Ì × @Ú E Û Table 1: Precision at % for QBrank, GBT, and RankSVM GBT 0.9328 0.8939 0.8557 0.8199 0.7899 0.7637 0.7399 0.7176 0.6977 0.6803 RankSVM 0.8524 0.8152 0.7839 0.7578 0.7357 0.7151 0.6957 0.6779 0.6615 0.6465 iÚ Û iÖ æ5G Ö i Úi Ö iÖ i iG ÖÖ æ Ú DCG-5 v. Iterations 6. 9 6. 85 6. 8 DCG-5 6. 75 6. 7 6. 65 6. 6 50 100 150 200 250 300 350 400 I te ra ti ons (Number of trees) GBT using L2 GBT using L1 GBT using L1+L2 QBrank using P2 QBrank using P2+L1 5 Conclusions and Future Work We proposed a general boosting method for optimizing complex loss functions. We also applied the general framework to the problem of learning ranking functions. Experimental results using a commercial search engine data show that our approach leads to significant improvements. In future work, 1) we will add regularization to the preference part in the objective function; 2) we plan to apply our general boosting method to other structured learning problems; and 3) we will also explore other applications where both preference and labeled data are available for training ranking functions. Appendix: Convergence results We introduce a few definitions. Definition 1 Definition 2 is scale-invariant if and . . , . Although we only consider global upper bounds, it is easy to see that results with respect to local upper bounds can also be established. If we choose on such that and , and the rate of convergence compared to any target , and the sequences and . 7 £vø ¢ ©§~19p ÷¥ u¦ % ©)e4¦3ñ§CDB!&# ã ¥ A 8 È© @) l e÷ ª S÷ ª ¦ i 8 9e÷ ª ¦ 763#· "· 0 ¦þ ( &$ ¦ )0 '%§§ " # ¸ © ¨ ¤ ý ¨ · 0 ú " ÿ 72 þ 42 ¤ý ü §û !û !ü ü ÿ 23ü ÿ 2 53ü 0 ¦ ©¨ ¤ ý ¨ 1 ©q S÷ ª Ç )q ÷ ª ¦ v iò l ¬ v ø ©Eu¦ % §~19pµ¨÷¥ ù E E xE ú ¥ ñ Theorem 1 Consider Algorithm 1, where is a convex function of . Let be an upper bound of the Hessian of . Assume that is scale-invariant. Let . Let be the normalized step-size, , and , then , then ê P Re o eense÷ ª éè èª ê P éTè è õ i ô i © ¥¦ ñ ó © ¥¦ ó ¥¦ o öò o 943uHh943ñ x © o h43ñ o ò ¥ ò ê P T~÷¥ è £ éè v¢ © ÷¥ ¦ T R Q P I G ( E' ì % 9¹3ñ S{ S¡3H)Fc&# w¦ ®Âe÷ ª ¦ þ ¦ ¢ £¡· © ¨ ¤ ý ¨ § ¤ý ü ¥û ÿ ˨û þ ýü © ¨ ¤ ý ¨ ÷ ª F ¬ v u Ev ñ E ó6 ¥ p © ¥¦ 54uñ Definition 4 Let satisfy: be a function of , an global upper bound and : ïé ðu Ñ v o sê P Tè v o è v o v ç v ¬yÜ v ç Ü v wcyt ê P S~Rè ¥î í ì %# éè ¥ © u¦ % §~19pÂÂ¥ Definition 3 Let , then of its Hessian with respect to . iÚ Figure 1: DCG v. Iterations. Notice that DCG for RankSVM using is 6.181. u o ç f hç i ae¨¦ o C r ê P Rè o è © ¤ ë é G k u q o u d 6 rYi only depends The proof is a systematic application of the idea outlined earlier and will be detailed in a separate publication. In practice, one often set the step size to be a small constant. In particular, for for some fixed , we can choose , and when ( otherwise). Theorem 1 gives the following bound when , The convergence results show that in order to have a risk not much worse than any target function , the approximation function does not need to be very complex when the complexity is measured by its 1-norm. It is also important to see that quantities appearing in the generalization analysis do not depend on the number of samples. These results imply that statistically, Algorithm 1 (with small step-size) has an implicit regularization effect that prevents the procedure from overfiting the data. Standard empirical process techniques can then be applied to obtain generalization bounds for Algorithm 1. References [1] BA L C A N N . , B E Y G E L Z I M E R A . , L A N G F O R D J . , Classification, manuscript, 2007. AND S O R K I N G . Robust Reductions from Ranking to [2] B E RT S E K A S D . Nonlinear programming. Athena Scientific, second edition, 1999. [3] B U R G E S , C . , S H A K E D , T. , R E N S H AW, E . , L A Z I E R , A . , D E E D S , M . , H A M I LT O N , N . , A N D H U L L E N D E R , G . Learning to rank using gradient descent. Proc. of Intl. Conf. on Machine Learning (ICML) (2005). [4] D I E T T E R I C H , T. G . , A S H E N F E LT E R , A . , B U L AT OV, Y. Training Conditional Random Fields via Gradient Tree Boosting Proc. of Intl. Conf. on Machine Learning (ICML) (2004). [5] C L E M E N C O N S . , L U G O S I G . , A N D VAYAT I S N . Ranking and scoring using empirical risk minimization. Proc. of COLT (2005). [6] C O H E N , W. W. , S C H A P I R E , R . E . , A N D S I N G E R , Y. Learning to order things. Journal of Artificial Intelligence Research, Neural Computation, 13, 14431472 (1999). [7] F R E U N D , Y. , I Y E R , R . , S C H A P I R E , R . E . , A N D S I N G E R , Y. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research 4 (2003), 933­969. [8] F R I E D M A N , J . H . Greedy function approximation: A gradient boosting machine. Annals of Statistics 29, 5 (2001), 1189­1232. [9] H E R B R I C H , R . , G R A E P E L , T. , sion. 115­132. AND O B E R M AY E R , K . Large margin rank boundaries for ordinal regres- [10] JA RV E L I N , K . , A N D K E K A L A I N E N , J . Ir evaluation methods for retrieving highly relevant documents. Proc. of ACM SIGIR Conference (2000). [11] J OAC H I M S , T. Optimizing search engines using clickthrough data. Proc. of ACM SIGKDD Conference (2002). [12] J OAC H I M S , T. , G R A N K A , L . , PA N , B . , A N D G AY, G . Accurately interpreting clickthough data as implicit feedback. Proc. of ACM SIGIR Conference (2005). [13] T S O C H A N TA R I D I S , I . , J OAC H I M S , T. , H O F M A N N , T. , A N D A LT U N , Y. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6:1453­1484, 2005. 8 iª ê P Re o eeª éè è `Wª ò )XnÇ ò ê P R~eSY9¹3h~x37yx3! ë ǧ9puñ x © G Se43ñ é è ÷¥ è © © ÷¥ ¦ ñ g © w ¦ ñ 6 w ¦ v ª q © ÷¥ ¦ « ¥¦ q i ®ò ª x E l 1q ú © ÷¥ ¦ ñ g © w ¦ ñ 6 w ¦ v é è ÷¥ ¦ × ¥ ª®ò ©Y9¹3±H xuayx3! ê P RHeè ë © Uo ¥ ¦ w ¿ S÷ ª )e43ñ x ) Ve÷ ª Âe43ñ © ¥¦ w ®ª m 5¥ © u¦ % §~5pÂÂ÷¥