SIGIR 2007 Proceedings Session 16: Learning to Rank II FRank: A Ranking Method with Fidelity Loss Ming-Feng Tsai1 , Tie-Yan Liu2 , Tao Qin3 , Hsin-Hsi Chen1 , Wei-Ying Ma2 Depar tment of Computer Science and Information Engineering National Taiwan University Taipei 106, Taiwan mftsai@nlg.csie.ntu.edu.tw hhchen@ntu.edu.tw 1 Microsoft Research Asia 4F, Sigma Center, No. 49, Zhichun Road, Haidian District Beijing 100080, China {tyliu, wyma}@microsoft.com 2 3 Depar tment of Electronic Engineering Tsinghua University Beijing 100084, China tsintao@gmail.com ABSTRACT Ranking problem is b ecoming imp ortant in many fields, esp ecially in information retrieval (IR). Many machine learning techniques have b een prop osed for ranking problem, such as RankSVM, RankBoost, and RankNet. Among them, RankNet, which is based on a probabilistic ranking framework, is leading to promising results and has b een applied to a commercial Web search engine. In this pap er we conduct further study on the probabilistic ranking framework and provide a novel loss function named fidelity loss for measuring loss of ranking. The fidelity loss not only inherits effective prop erties of the probabilistic ranking framework in RankNet, but p ossesses new prop erties that are helpful for ranking. This includes the fidelity loss obtaining zero for each document pair, and having a finite upp er b ound that is necessary for conducting query-level normalization. We also prop ose an algorithm named FRank based on a generalized additive model for the sake of minimizing the fidelity loss and learning an effective ranking function. We evaluated the prop osed algorithm for two datasets: TREC dataset and real Web search dataset. The exp erimental results show that the prop osed FRank algorithm outp erforms other learning-based ranking methods on b oth conventional IR problem and Web search. Keywords FRank, Fidelity Loss, Learning to Rank 1. INTRODUCTION In this information explosion era, accurately locating relevant information based on user information needs has b ecome imp ortant. The IR problem can b e formulated as a ranking problem: provided a query and a set of documents, an IR system should provide a ranked list of documents. Several methods have b een prop osed for solving the problem in the literature, such as Boolean model, vector space model, probabilistic model, and the language model; these methods can b e regarded as empirical IR methods. In addition to traditional IR approaches, machine learning techniques are b ecoming more widely used for the ranking problem of IR; the learning-based methods aim to use lab eled data for learning an effective ranking function. There are several different ways to use learning-based methods for solving IR problems. One may regard IR as a binary classification problem in which documents are categorized as relevant or irrelevant. However, for real Web search engines, pages cannot simply b e classified into two typ es relative to user information needs; they should have multiple grades, such as highly relevant, partially relevant, definitely irrelevant, and so on. Therefore, several learning-based methods, such as RankBoost [8], RankSVM [12], and RankNet [4], view the problem as a ranking problem. These methods mainly construct pairs b etween documents and use machine learning techniques for minimizing the numb er of misordered pairs. According to [4], the probabilistic ranking framework has many effective prop erties for ranking, such as pair-wise differentiable loss, and can b etter model multiple relevance levels; the prop osed RankNet algorithm p erforms well in practice and has b een used in a commercial Web search engine. However, the loss function in RankNet still has some problems: for instance, it has no real minimal loss and appropriate upp er b ound. These problems may cause the trained ranking function inaccurate and the training process biased by some hard pairs. Therefore, we conduct further studies on the probabilistic ranking framework, and prop ose a novel loss function named fidelity loss. The fidelity loss is inspired by the concept of Fidelity [14], which is widely used for measuring the difference b etween two states of a quantum Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval models, Search process General Terms Algorithms, Exp erimentation This work was conducted when the first author was visiting student at Microsoft Research Asia. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00. 383 SIGIR 2007 Proceedings Session 16: Learning to Rank II in the field of physics [2]. The fidelity loss function inherits those prop erties of the probabilistic ranking framework, and has several additional useful characteristics for ranking. For instance, the fidelity loss can achieve zero and is b ounded b etween 0 and 1 for each document pair; this prop erty is necessary for conducting query-level normalization. For the sake of efficiently minimizing the fidelity loss function, we further prop ose a ranking algorithm based on a generalized additive model. Our ranking algorithm named Fidelity Rank (FRank) combines the probabilistic ranking framework with the generalized additive model. We evaluated the prop osed FRank algorithm on b oth TREC dataset and a large-scale Web search dataset, which encompassed 19,600 queries collected from a commercial Web search engine. The exp erimental results show that the prop osed approach outp erforms other ranking methods on b oth conventional IR problem and Web search, and that the improvements are statistically significant. The remainder of this pap er is organized as follows. We briefly review the previous research in Section 2. In Section 3, we revisit the probabilistic ranking framework and the fidelity measure, and describ e the FRank algorithm. Section 4 rep orts exp erimental results and discusses the relationship b etween different methods. We conclude our pap er and discuss future work in Section 5. using click-through data. RankSVM aims to minimize the numb er of discordant pairs, which is similar to RankBoost, and to maximize the margin of pair. The margin maximization is equal to minimize the L2-norm of hyp erplane parameter . Given a query, if the ground truth asserts that document di is more relevant than dj , the constraint of RankSVM is (q, di ) > (q, dj ), where (q , d) is the feature vector calculated from document d relative to query q . Then, the ranking problem can b e expressed as the following constrained optimization problem. 1 min :V (, ) = · + C i,j,k 2 (di , dj ) r1 : (q1 , di ) (q1 , dj ) + 1 - i,j,1 (di , dj ) rn : (qn , di ) (qn , dj ) + 1 - i,j,n s.t. ij k : i,j,k 0 From the theoretical p ersp ective, RankSVM is well-founded in the structure risk minimization framework, and is verified in a controlled exp eriment. Moreover, the advantage of RankSVM is that it needs no human-lab eled data, and can automatically learn the ranking function from click-through data. Burges et al. [4] have prop osed a probabilistic ranking framework and used neural network for minimizing a pairwise differentiable loss in a method named RankNet. Assume that the ranking model is f : Rd R, so that the rank order of a set of samples is sp ecified by the real value that the model takes. Therefore, f (di ) > f (dj ) is taken to dj . Given a set of mean that the model asserts that di pairs of training samples (di , dj ) together with the target probability Pi of di b eing ranked higher than dj , a logisj tic function can b e used for mapping the output of ranking function to a probability as follows: Pij = eoij , 1 + eoij 2. RELATED WORK Many machine learning techniques [1, 4, 5, 6, 8, 12, 13] have b een studied for IR. In [13], Nallapati treats IR as a binary classification problem that directly classifies a document as relevant or irrelevant with resp ect to the given query. However, in real Web search, a page has multiple relevance levels, such as highly relevant, partially relevant, and definitely irrelevant. Therefore, some studies regard IR as a ranking problem: highly relevant web pages are ranked higher than partially relevant ones, and partially relevant ones are ab ove irrelevant ones. In the literatures, many learning-based ranking algorithms are prop osed; typical examples include RankBoost, RankSVM, and RankNet. Freund et al. [8] use the Boosting approach for learning a ranking function and prop osed RankBoost to solve the problem of combining preferences. Their method aims to minimize the weighted numb er of pairs of instances that are mis-ordered by the final ranking function. The algorithm can b e summarized as follows. For each round t, RankBoost chooses t and the weak learner ht so as to minimize the pair-wise loss, which is defined by ( Dt (xi , xj ) exp(t (ht (xi ) - ht (xj ))), Zt = x i ,x j ) where oij = f (di ) - f (dj ), and Pij = P (di dj ). Then, cross entropy is adopted as a loss function for training, which can b e represented as: Cij C (oij ) = -Pi log Pij - (1 - Pi ) log(1 - Pij ) j j = -Pi oij + log(1 + eoij ), j where Cij is the cross entropy loss of a pair (di , dj ), Pi is j the desired probability, and Pij is the modeled probability. RankNet uses back-propagation neural network for minimizing the criterion during training process, and then learn a ranking function. As compared to RankBoost and RankSVM, the loss function in RankNet is pair-wise differentiable, which can b e regarded as an advantage in cases in which ground truths come from several annotators that may disagree. So, RankNet p erforms well in practice and has b een successfully applied to a commercial search engine. where xi is the instance ranked higher than xj , Dt (xi , xj ) is the weight of pair (xi , xj ). After the optimal weak learner ht is selected, the weight Dt (xi , xj ) is adjusted with resp ect to t and ht . The rule of adjustment is to decrease the weight of pairs if ht gives a correct ranking (ht (xi ) > ht (xj )) and increase otherwise. Finally, this algorithm outputs the ranking function by combining selected weak learners. Owing to the greedy search nature of Boosting, RankBoost can b e implemented in parallel and thus scale up to a large-scale dataset. Joachims [12] prop oses RankSVM algorithm, which uses Supp ort Vector Machines for optimizing search p erformance 3. FIDELITY RANK Inspired by the success of RankNet, we investigate the probabilistic ranking framework and prop ose a novel loss function named fidelity loss. In this section we briefly describ e the probabilistic ranking framework, and discuss the fidelity loss and its useful prop erties. We also present a ranking algorithm named FRank for minimizing the loss and learning an effective ranking function. 384 SIGIR 2007 Proceedings (a) Cross Entropy Loss (Cij) Session 16: Learning to Rank II (b) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -10 P*=0.0 P*=0.5 P*=1.0 12 10 8 6 4 2 0 -10 -5 Fidelity Loss (Fij) P*=0.0 P*=0.5 P*=1.0 3.2 Fidelity Loss Function After investigating many different measures of probability distributions such as KL-divergence and information radius, we find that an optimal candidate for the criterion of ranking problem is Fidelity [14]. Fidelity is originally a distance metric in physics to measure the difference b etween two states of a quantum. The fidelity of two probability distributions is defined by x F (px , qx ) px qx , where {px } and {qx } are the probability distributions. Note that when thex distributions {px } and {qx } are identical, F (px , qx ) = px = 1. A b etter geometric interpretation of the fidelity is that the fidelity is the inner product b e tween vectors with comp onents px and qx , which lie on a unit sphere. The fidelity is a useful measurement for mathematics and physics. Moreover, it is a meaningful measure candidate for the loss function of probabilistic ranking framework. We adapt the fidelity measure so that the loss of a pair is measured by ( P Fij F (oij ) = 1 - ( 1 - Pi ) · (1 - Pij )), ij · Pij + j where Pi is the given target value for the p osterior probj ability and Pij is the modeled probability; then, after the logistic function is adopted, Fij b ecomes P 1 ( 1. 2 2 eoij 1 Fij = 1- ) + 1 - Pij ) × ( ) ij × ( oij oij 1+e 1+e Figure 1(b) plots Fij as a function of oij for three values Pi = {0, 0.5, 1}. If Pi = 0.5, it means that there is no j j information available as to the relative rank of the two samples, then Fij has its minimum at the origin. Note that the loss of this pair can obtain zero. This also gives us a principal way of training samples that we desire to have the same rank as the ground truth. Actually, the fidelity loss inherits all three prop erties of the probabilistic ranking framework in RankNet. In addition, it also p ossesses new prop erties that are helpful for ranking. We describ e these prop erties in detail as follows. 1. The fidelity loss is capable of obtaining the real minimal loss for each desired probability Pi of pairs. Unj like the cross entropy, the fidelity loss function has the zero loss for each document pair. Therefore, this new characteristic makes the trained ranking function more accurate. 2. The fidelity loss of a pair is b ounded b etween 0 and 1. If the loss of a pair has no appropriate upp er b ound, the loss of some hard pairs would b e too much when they continue to b e placed in the wrong p osition. This may cause the p erformance deteriorates b ecause the whole model is biased by these hard pairs. In this sense, fidelity loss is sup erior to cross entropy. 3. The fidelity loss of a query can easily b e considered by means of dividing the numb er of pairs in the given query. This query-level normalization is more consistent with the evaluations of IR, such as MAP and NDCG, b ecause they are mostly based on query level. 0 oij 5 10 -5 0 oij 5 10 Figure 1: Loss Function: (a) Cross Entropy Loss Function; (b) Fidelity Loss Function. 3.1 Probabilistic Ranking Framework According to [4], the mapping from the output of ranking function to a probability based on the probabilistic ranking framework has the following prop erties: 1. The ranking model puts consistency requirements on t dhe Pij (for example, if P (di dj ) = 0.5 and P (dj dk ) = 0.5); k ) = 0.5, then P (di 2. Any set of adjacency p osteriors can uniquely identify a target p osterior 0 Pi 1 for each pair (di , dj ); j 3. The exp ected strengthening (or weakening) of confidence in the ordering of a given pair can b e held (for instance, if P (di dj ) = 0.6 and P (dj dk ) = 0.6, then P (di dk ) > 0.6). When the outputs are mapp ed to probabilities, the general measure of probability distribution can b e applied as the criterion for training, such as cross entropy, KL-divergence, and information radius. In RankNet [4], cross entropy is adopted for this purp ose; the loss function of a pair can b e represented by Cij = -Pi oij + log(1 + eoij ), which is shown j in Figure 1(a). The cross entropy loss function provides a principal way to make the samples have the same rank of ground truths. Previous works show that the loss function is effective and the corresp onding RankNet algorithm p erforms well in practice. However, if we carefully investigate Figure 1(a), we find there are still some problems within this loss function. · The cross entropy loss function cannot achieve the real minimal loss, zero, exp ect for the pair with p osterior probability is 0 or 1; this problem may cause the trained ranking function inaccurate. · The cross entropy loss of a pair has no upp er b ound; in other words, this may cause the training procedure to b ecome biased by some hard pairs. · The query-level normalization is hard to define in cross entropy loss; this may also cause the whole model dominated by those queries with a large numb er of document pairs. Considering these problems, it is meaningful to investigate the other measures of probability distribution that can provide b etter prop erties for ranking on the probabilistic ranking framework. 385 SIGIR 2007 Proceedings Session 16: Learning to Rank II The fidelity loss of all queries, therefore, can b e defined as follows: q 1i Fij , |#q | j where |#q | is the numb er of pairs for query q and Fij is the fidelity loss of a pair. In this way, each query contributes equally to the total loss in the training process. Therefore, the model is not dominated by those queries with a large numb er of document pairs. Although the third prop erty of query-level fidelity loss looks natural, the similar extension for the previous methods is non-trivial. Each query is not guaranteed to contribute equally to the total loss since their loss function do not have an appropriate upp er b ound. With the new prop erties, we regard our prop osed fidelity loss as a more suitable measure of ranking loss. In the next subsection, we discuss how to efficiently minimize this loss function so as to learn an effective ranking function. Setting the derivative of Equation (2) with resp ect to k to zero, we have the equation as follows: P -1 i,j i,j 2 + k h H k 1 e k -1 -2 i,j i,j ) ij × ( H + k h - k 1+e k-1 P i,j i,j i,j Hk-1 +k hk h ·e × i × k H i,j + hi,j j i k k )2 (1+e k-1 D(i, j ) × -1 = 0. ( 2 j 1 1 - P) × ( 1 i,j i,j ) ij 2 H + k h k 1+e k-1 ( i,j i,j + k h i,j H k -hk ·e k-1 × 1 - Pij ) × i,j i,j H + h (1+e k-1 kk )2 3.3 Algorithm Derivation Considering the efficiency and scalability for large-scale datasets, we choose a generalized additive model similar to Boosting for minimizing the fidelity loss. The primary consideration is that the additive model is easier to implement in parallel since the examination of features is indep endent in the training process. In the additive model, the final ranking function is defined as: t t ht (x), H (x ) = where ht (x) is a weak learner and t is the combination coefficient of the weak learner, and t is the index of iterations. Consider the ranking function after the k-th iteration, H k (x ) = tk =1 For a general weak learner hk (x), solving the close form of k from the ab ove equation is quite difficult. For simplicity, we adopt the same choice of weak learners as in [8], which are binary weak learners, to construct the final ranking function. When a binary weak learner is introduced, the ab ove equation can b e simplified b ecause hi,j only takes values -1, k 0, and 1. Therefore, the equation can b e expressed by ( = i,j i,j 1 1 h Pi eHk-1 ) 2 e-k eHk-1 (1 - Pi ) 2 e-k j j D(i, j ) - i,j i,j 3 1 i,j (e-k + eHk-1 ) 2 (e-k (e-k + eHk-1 )3 ) 2 =1 k ( i,j i,j 1 1 h Pi eHk-1 ) 2 ek eHk-1 (1 - Pi ) 2 ek j j D(i, j ) - i,j i,j 3 1 i,j (ek + eHk-1 ) 2 (ek (ek + eHk-1 )3 ) 2 = -1 k . With further derivations and relaxations, we eventually obtain h Wi,j i,j =1 1 k = ln h k , (3) 2 Wi,j i,j = -1 k where i W k,j ( = D(i, j ) t ht (x) or Hk (x) = Hk-1 (x) + k hk (x). Pi eHk-1 ) 2 - eHk-1 (1 - Pi ) 2 j j ( 1 + e H k -1 ) 2 i,j 3 i,j 1 i,j 1 . (4) Then, the fidelity loss of Hk (x) over all queries is q 1i (k ) F, J (H k ) = |#q | j ij where Fij = 1 - ( - 1- (k ) P ij ×( e H k (x i )-H k (x j ) ) 1 + e H k (x i )-H k (x j ) ×( 1 1 2 1 ) 2 Consequently, the algorithm op erates in this way. In the k-th iteration, a new weak learner hk (x) with the minimal fidelity loss J (Hk ) is obtained according to the current pair i weight Wk,j , and then it is combined with the previous ones using the combination coefficient k . In a stepwise manner, we eventually can obtain the final ranking function H (x). We name this algorithm Fidelity Rank (FRank), for which details are summarized in Algorithm 1. . Pi ) j 1 + e H k (x i )-H k (x j ) 1 , |#q | 4. EXPERIMENTS 4.1 Evaluation Measures (1) Precision is an IR p erformance measure that quantifies the fraction of retrieved documents which are known to b e relevant. P @n is capable of measuring the accuracy within top n results of the returned rank list for a query. P @n = # of relevant docs in top n results n Given Hk (x) = Hk-1 (x) + k hk (x) and denoted D(i, j ) = the fidelity loss J (Hk ) can b e formulated as 1 P i,j i,j 2 + k h H k e k -1 1 - ij × ( i,j ) i,j i + k h H k 1+e k-1 D(i, j ) × ( 1 , (2) 2 j 1 - 1 - Pi ) × ( i,j ) i,j j + h H 1+e k-1 kk where hk (xj ). ij Hk,-1 Hk-1 (xi ) - Hk-1 (xj ) and hi,j k hk (xi ) - In general, most of p opular Web search engines display 10 returned documents for each page and many users only browse the results in the first page. Therefore, we use precision within ten returned documents to evaluate the p erformance of each ranking algorithm. 386 SIGIR 2007 Proceedings Session 16: Learning to Rank II Algorithm 1 Learning Algorithm of FRank Input: Pairs over all training queries and weak learner candidates hc (x), 1 c m, where m is the numb er of features Initialization: Pair weight by Eq. (1) and the numb er of weak learners k for t = 1 to k do for c = 1 to m do Calculate t,c by Eq. (3) Calculate fidelity loss over all queries by Eq. (2) end for ht (x) ht,c (x) with the minimal fidelity loss t the corresp onding t,c Pair weight is up dated by Eq. (4) end for k Output: H (x) = t=1 t ht (x) Table 1: The extracted features of TREC dataset Feature Name Numb er of Features BM25 [18] 1 MSRA1000 [19] 1 PageRank [15] 1 HostRank [20] 1 Relevance Propagation [16] 10 0.9 0.8 0.7 0.6 0.5 0.4 0.3 (Non-Learning) BM25 RankSVM RankBoost RankNet_Linear RankNet_TwoLayer FRank In addition, MAP [17, 18] is also considered as evaluation metric for evaluating ranking methods. Given a query qi , its average precision (AP) is calculated as: N j =1 (P (j ) × pos(j )) APi = # of relevant docs in qi where N is the numb er of documents retrieved, P (j ) is the precision value at p osition j , and pos(j ) is a binary function to indicate whether the document at p osition j is relevant. Then, we obtain MAP by use of averaging the AP values of all the queries. Considering that the web search dataset has multiple rating grades, we also use the normalized discount cumulative gain (NDCG) [9, 10] to evaluate the p erformance of ranking algorithms. For a given query, the NDCG value for a ranked document at p osition i is computed as follows: 1. Compute the gain 2r(j ) - 1 of each document j ; 2. Discount the gain of each document j by its ranked r(j ) - p osition, which can b e expressed by (l2 g(1+j1) ; o ) 3. Cumulate the discounted gain for document j at p osir(j ) i - tion i, which can b e express by j =1 (l2 g(1+j1) ; o ) 4. Normalize the discounted cumulative gain for docui r(j ) - ment j at p osition i using ni j =1 (l2 g(1+j1)) , o where r (j ) is the rating of the j -th document in the ordered list, and the normalization constant ni is chosen so that a p erfect ordering gets NDCG value 1. We use NDCG within ten returned documents to evaluate the p erformance in the following exp eriments. 0.2 0.1 0 MAP P@1 P@3 P@5 NDCG@1 NDCG@3 NDCG@5 Figure 2: Experimental Results on TREC Dataset. 4.3 Experiment on TREC Dataset TREC dataset is crawled from the .gov domain in early 2002, and has b een used as the data collection of Web Track since TREC 2002. There are totally 1,053,110 pages with 11,164,829 hyp erlinks in it. When conducting the exp eriments on this corpus, we used the topic distillation task in the Web Track of TREC 2003 [7] as query set, which has 50 queries in total. The ground truths of this task are provided by the TREC committee with binary judgment: relevant and irrelevant. The numb er of relevant pages for each query spans from 1 to 86. There are 14 features extracted from TREC dataset for training, as listed in Table 1. We use four-fold cross validation to evaluate p erformance owing to the size of TREC dataset. The exp erimental results are plotted in Figure 2 in which the values are averaged scores over the four trials. The prop osed algorithm, FRank, is a relatively effective ranking algorithm, as observed from a comparison with other ranking algorithms in Figure 2. All learning-based methods outp erform BM25; our prop osed FRank algorithm even obtains ab out 40% improvement over non-learning based method, BM25, in MAP. This would indicate that learning to rank is a promising direction in IR. The prop osed FRank algorithm gains more improvement at the top p osition of retrieved results, esp ecially precision at p osition 1, as indicated from Figure 2. This result indicates that the prop osed method is also suitable for Web search, which emphasizes the top p osition of result. Therefore, an exp eriment on real Web search is conducted in the following subsection. 4.2 Comparison Methods Three learning-based ranking algorithms are compared in this study: RankBoost [8], RankNet [4], and RankSVM [12]. For RankBoost, we implemented binary weak learner whose output is 0 and 1. For RankNet, we use linear and twolayer p erceptrons, and those are denoted as RankNet Linear and RankNet TwoLayer. For RankSVM, we directly use the binary code of SVMlight [11]. In addition, non-learning based IR model, BM25 [18], is utilized as a baseline for comparison with conventional IR approach. 4.4 Experiment on Web Search Dataset 4.4.1 Web Search Dataset The dataset consists of 19,600 queries and more than 3,300,000 web pages within a commercial Web search engine. These web pages are partially lab eled with ratings 387 SIGIR 2007 Proceedings Session 16: Learning to Rank II Table 2: Details of Web Search Datasets Numb er of queries Numb er of docs Training dataset 12,000 385,293 Validation dataset 3,800 663,457 Testing dataset 3,800 693,180 0.84 0.82 0.8 1800 1700 0.78 1600 1500 0.74 0.72 0.7 1300 0.68 0.66 30 60 1200 90 120 150 180 210 240 270 300 Number of Weak Learners 1400 NDCG@10 Fidelity Loss 0.76 Figure 3: Training Phase of FRank on Web Search Training Dataset. from 1 (irrelevant) to 5 (highly relevant) by annotators. Because the dataset is too large to lab el completely, the unlab elled pages are given the rating 0. Given a query, a page is represented by query-dep endent (e.g., term frequency) and query-indep endent (e.g., PageRank) features extracted from the page itself and the preprocessed indices. The total numb er of features is 619. Considering that these features reflect the business secrete of that search engine company, we would not describ e them in detail. However, this does not influence our exp erimental study since all the methods under investigation op erate on the same feature set. For ease of exp erimental study, we divide the dataset into three sets: 12,000 queries for training, 3,800 queries for validation, and 3,800 queries for testing. Some unlab eled pages would b e relevant, so that p erformance descend if we use the whole unlab eled pages for training. Therefore, 20 unlab eled documents are randomly selected as the documents with rating 0 for training. So, our training dataset has 385,293 documents; it totally contains 2,635,976 pairs for training. Table 2 summarizes the details of the training, validation, and testing datasets. 4.4.2 Training Process of FRank The training process is shown in Figure 3, in which the numb er of weak learners starts from 5. The prop osed FRank algorithm indeed decreases the fidelity loss with the increasing numb er of weak learners, and at the same time it increases the value of NDCG@10, as illustrated in Figure 3. These values change dramatically in top selected 30 weak learners, and then they vary smoothly in the following weak learners. 4.4.3 Performance Comparisons Only three exp erimental results of FRank, RankBoost, and RankNet on Web Search dataset are rep orted, since RankSVM runs out of memory in this dataset. After the training phase, the validation dataset is used for selecting the b est parameter setting for each method, such as the numb er of weak learners of FRank and RankBoost, and the numb er of ep ochs of RankNet. This procedure is taken to guarantee the b etter generalization of these ranking methods. Here, we use NDCG@10 as the criterion to select the parameter setting. Figure 4 plots the NDCG@10 curves of FRank, RankBoost, and RankNet on validation data. FRank outp erforms RankBoost for each weak learner, as demonstrated in Figure 4(a) where the numb er of weak learners starts from 10. RankBoost p erforms worse when the numb er of weak learners is few than 20; however, when the numb er of weak learners is over 20, RankBoost eventually obtains b etter p erformance. This is b ecause the p ower of representation of RankBoost is quite limited with few weak learners. Accordingly, we set the numb er of weak learners as 224 for FRank and 271 for RankBoost, as observed from the middle of flat tail in Figure 4(a). In Figure 4(b), RankNet TwoLayer is observed to b e more effective than RankNet Linear; however, its p erformance seems sensitive to the validation dataset. The p erformance of RankNet TwoLayer dropp ed when the numb er of ep ochs was more than 10; this would b e the problem of overfitting. In contrast, ranking algorithms, such as FRank and RankBoost, based on the generalized additive model are more robust against the problem of overfitting; this robustness is also an essential characteristic of the generalized additive model. We set the numb er of ep ochs as 25 for RankNet Linear and 9 for RankNet TwoLayer, as observed from Figure 4(b). After setting the parameter of models, we use the settings to examine the p erformance of the ranking algorithms on the testing dataset. Conventional IR model [3, 17, 18], i.e., BM25, is also conducted for a comparison of learning-based and non-learning based methods. In addition, an empirical approach of using linear combination of BM25 and PageRank is also p erformed, where the parameter is 0.8 for BM25 and 0.2 for PageRank (i.e., 0.8 × BM25 + 0.2 × PageRank); this parameter is obtained after tuning. Figure 5 plots NDCG values from 1 to 10 of ranking algorithms on the testing dataset. Simply using conventional IR model is inadequate for Web search, as indicated from a comparison of BM25 with those learning-based methods. This is b ecause the learning-based methods is capable of leveraging various features in a large-scale Web search dataset. The prop osed FRank algorithm is more effective than the other learned-based ranking algorithms from NDCG@1 to NDCG@10; RankNet TwoLayer also p erforms well on this large-scale dataset. The results indicate that the probabilistic ranking framework is a suitable framework for learning to rank. Although b oth FRank and RankNet are based on the framework, FRank is capable of ranking more accurately than RankNet; this is consistent with the discussion ab out the sup eriority of fidelity loss in Section 3.2. Moreover, the FRank algorithm p erforms well at the top p osition of ranking lists, as observed from a comparison of NDCG@1 with RankNet TwoLayer in Figure 5. A significance test for FRank and RankNet TwoLayer with a confidence level of 98% is p erformed to verify whether the improvement is statistically significant. The corresp onding p-values are 0.0114 for NDCG@1, 0.007 for NDCG@5, and 0.0056 for NDCG@10. This result indicates that, as to Web search, FRank is significantly b etter than RankNet, and also significantly outp erforms other ranking algorithms. Fidelity Loss NDCG@10 388 SIGIR 2007 Proceedings Session 16: Learning to Rank II (a) 0.72 0.7 0.68 0.71 0.705 0.7 (b) NDCG@10 0.66 0.64 0.62 0.6 0.58 50 100 150 FRank RankBoost 200 250 300 Number of Weak Learners NDCG@10 0.695 0.69 0.685 0.68 0.675 0 5 10 15 20 RankNet_TwoLayer RankNet_Linear 25 30 35 40 45 50 Number of Epochs Figure 4: Experimental Results on Web Search Validation Dataset: (a) FRank and RankBoost; (b) RankNet with Linear and Two-Layer Perceptrons. 0.74 0.72 0.7 0.68 0.66 0.64 0.62 0.6 0.58 0.56 0.54 0.52 1 2 3 FRank RankNet_TwoLayer RankBoost RankNet_Linear (Non-Learning) BM25 + PageRank (Non-Learning) BM25 4 5 6 7 8 9 10 NDCG@1 - NDCG@10 Table 3: NDCG@10 on Web Search Testing Dataset using Different Size of Training Dataset. Numb er of Training Queries 1,000 2,000 4,000 8,000 12,000 FRank 0.723 0.725 0.730 0.732 0.734 RankNet TwoLayer 0.699 0.711 .720 0.727 0.729 RankNet Linear 0.696 0.705 .712 0.715 0.717 RankBoost 0.698 0.706 0.716 0.724 0.723 RankSVM 0.704 0.711 0.719 None None worse when the numb er of training queries is 1,000; this is b ecause the models are statistically incomplete with few training queries. · Using more training data does not guarantee improvement on p erformance. As illustrated in Figure 6, p erformance increases when the numb er of training queries is added. However, when the numb er of queries is more than 8,000, the p erformance only slightly improves, and sometimes even decreases (e.g., RankBoost). · RankSVM p erforms as well as RankNet TwoLayer when the numb er of training queries is small. Therefore, we consider that if the scalability issue of RankSVM can b e fixed, it may b e an effective candidate for learning to rank in Web search. · The probabilistic ranking framework p erforms well when the amount of training data is large. This is b ecause more pair-wise information is capable of making the trained ranking function more accurate. · FRank outp erforms other ranking algorithms for all cases, and it is also more stable with resp ect to the numb er of training queries. The results suggest that FRank is more suitable for learning to rank in Web search. Figure 5: Experimental Results of Ranking Algorithms on Web Search Testing Dataset. 4.4.4 Experiment on Different Size of Training Dataset Considering RankSVM ran out of memory on the Web search training dataset, we further conducted several exp eriments on smaller-scale training datasets to provide more insight ab out RankSVM. In addition, investigating how the numb er of training queries affects p erformance is also essential for Web search. For this purp ose we separately trained these referenced ranking algorithms on 1,000, 2,000, 4,000, 8,000, and 12,000 queries. However, RankSVM ran out of memory with 8,000 queries in these exp eriments. This is mainly b ecause RankSVM has to construct as many constraints as the numb er of document pairs, and then the numb er of variables in the dual problem b ecomes enormous when training on large-scale dataset. Linear kernel is used for those case that RankSVM can op erate up on, and the parameter c of RankSVM is tunned on the validation dataset, which is similar to the procedure in Section 4.4.3. Figure 6 plots the results of NDCG@10 for each ranking algorithm trained on different numb ers of queries; Table 3 also summaries these results. We list several observations as follows. · The FRank algorithm p erforms effectively in small training dataset since it introduces query-level normalization in the fidelity loss function. However, the p erformances of other ranking algorithms are relatively 5. CONCLUSIONS This pap er presents an approach for learning to rank with the goal of improving the accuracy of conventional IR and Web search. On the basis of the probabilistic ranking framework, we prop ose a novel loss function named fidelity loss 389 SIGIR 2007 Proceedings Session 16: Learning to Rank II 0.74 0.73 [6] NDCG@10 0.72 0.71 FRank RankNet_TwoLayer RankBoost RankNet_Linear RankSVM 2000 4000 6000 8000 10000 12000 Number of Training Queries [7] 0.7 [8] 0.69 [9] Figure 6: NDCG@10 on Web Search Testing Dataset using Different Size of Training Dataset. to measure the loss of ranking, and accordingly present a ranking algorithm named FRank based on a generalized additive model. Exp eriments with significance test show that the FRank algorithm p erforms well in practice, even for conventional TREC dataset and real Web search dataset. Further research directions remain for future work: · For theoretical asp ects, we hop e to investigate how to prove the generalization b ound based on the probabilistic ranking framework. · Considering that many approaches can b e applied to minimize the fidelity loss, we would like to study whether it is more effective to combine the fidelity loss with other machine learning techniques, such as kernel methods. · On scalability issues, we plan to implement a parallel version of FRank that can handle even larger training datasets. [10] [11] [12] [13] [14] [15] 6. ACKNOWLEDGMENTS [16] Research of this pap er was partially supp orted by National Science Council, Taiwan, under the contract NSC952752-E-001-001-PAE. 7. REFERENCES [17] [18] [1] E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorp orating user b ehavior information. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 19­26, 2006. [2] N. Birrell and P. Davies. Quantum fields in curved space. Cambridge University Press New York, 1982. [3] A. Bookstein. Outline of a general probabilistic retrieval model. Journal of Documentation, 39(2):63­72, 1983. [4] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. Proceedings of the 22nd International Conference on Machine Learning, 2005. [5] Y. Cao, J. Xu, T. Liu, H. Li, Y. Huang, and H. Hon. Adapting ranking SVM to document retrieval. [19] [20] Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 186­193, 2006. K. Crammer and Y. Singer. Pranking with ranking. Advances in Neural Information Processing Systems, 14:641­647, 2002. N. Craswell, D. Hawking, R. Wilkinson, and M. Wu. Overview of the TREC 2003 Web Track. Proceedings of TREC, 2003, 2003. Y. Freund, R. Iyer, R. Schapire, and Y. Singer. An Efficient Boosting Algorithm for Combining Preferences. Journal of Machine Learning Research, 4(6):933­969, 2004. a aa K. J¨rvelin and J. Kek¨l¨inen. IR evaluation methods for retrieving highly relevant documents. Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 41­48, 2000. K. J¨rvelin and J. Kek¨l¨inen. Cumulated gain-based a aa evaluation of IR techniques. ACM Transactions on Information Systems (TOIS), 20(4):422­446, 2002. T. Joachims. Making large-Scale SVM Learning Practical. Advances in Kernel Methods-Supp ort Vector Learning, B. Sch¨lkopf and C. Burges and A. o Smola, 1999. T. Joachims. Optimizing search engines using clickthrough data. Proceedings of the eighth ACM SIGKDD international conference on Know ledge discovery and data mining, pages 133­142, 2002. R. Nallapati. Discriminative models for information retrieval. Proceedings of the 27th annual international conference on Research and development in information retrieval, pages 64­71, 2004. M. Nielsen and I. Chuang. Quantum computation and quantum information. Cambridge University Press, 2000. L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web, 1998. T. Qin, T. Liu, X. Zhang, Z. Chen, and W. Ma. A study of relevance propagation for web search. Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 408­415, 2005. S. Rob ertson. The probability ranking principle in IR. Journal of Documentation, 33(4):294­304, 1977. S. Rob ertson and S. Walker. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pages 232­241, 1994. R. Song, J. Wen, S. Shi, G. Xin, T. Liu, T. Qin, X. Zheng, J. Zhang, G. Xue, and W. Ma. Microsoft Research Asia at Web Track and Terabyte Track of TREC 2004. Proceedings of the Thirteenth Text REtrieval Conference Proceedings (TREC-2004), 2004. G. Xue, Q. Yang, H. Zeng, Y. Yu, and Z. Chen. Exploiting the hierarchical structure for link analysis. Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 186­193, 2005. 390