High Accuracy Retrieval with Multiple Nested Ranker
Irina Matveeva


Chris Burges
Microsoft Research One Microsoft Way Redmond, WA 98052

Timo Burkard
MSN Search One Microsoft Way Redmond, WA 98052

University of Chicago 5801 S. Ellis Ave Chicago, IL 60637

matveeva@uchicago.edu

cburges@microsoft.com

tburkard@microsoft.com

Andy Laucius
MSN Search One Microsoft Way Redmond, WA 98052

Leon Wong
MSN Search One Microsoft Way Redmond, WA 98052

andrewla@microsoft.com

leonw@microsoft.com

ABSTRACT
High precision at the top ranks has b ecome a new focus of research in information retrieval. This pap er presents the multiple nested ranker approach that improves the accuracy at the top ranks by iteratively re-ranking the top scoring documents. At each iteration, this approach uses the RankNet learning algorithm to re-rank a subset of the results. This splits the problem into smaller and easier tasks and generates a new distribution of the results to b e learned by the algorithm. We evaluate this approach using different settings on a data set lab eled with several degrees of relevance. We use the normalized discounted cumulative gain (NDCG) to measure the p erformance b ecause it dep ends not only on the p osition but also on the relevance score of the document in the ranked list. Our exp eriments show that making the learning algorithm concentrate on the top scoring results improves precision at the top ten documents in terms of the NDCG score.

1. INTRODUCTION
Traditionally, the goal of ad-hoc information retrieval was to achieve good p erformance in terms of b oth precision and recall. Recently, the focus has shifted to high precision at the top of the results list. With the growing size of the Web collections, users are now primarily interested in high accuracy defined as high precision at the top ranks [16, 13, 10, 18]. Users' studies showed that users typically look at very few results and mostly look at the results at the very top of the list returned by the search engine, see [9, 11, 8]. Recognizing this trend, TREC introduced the High Accuracy Retrieval from Documents (HARD) track that includes user sp ecific information to improve retrieval accuracy [6, 7]. Jarvelin et al. prop osed to base the evaluation of the IR methods on the retrieval of highly relevant documents [10] and presented normalized discounted cumulative gain (NDCG) as a new measure. NDCG was then applied to analyse the TREC's web track results [18]. More recently, Shah et. al [16] integrated some question answering techniques into the ad-hoc retrieval to improve precision at the top of the results list. They also addressed the issue of p erformance evaluation in terms of precision only and used the mean reciprocal rank (MRR) as p erformance measure. We p ose the problem of achieving high accuracy as learning the re-ranking of the results at the top of the results list. We prop ose the multiple nested ranker approach which is applied to the list of results returned by the search engine. This approach re-ranks the documents on the results list in stages, at each stage applying the RankNet [2] algorithm to learn a new ranking. Typically, ranking methods are applied to the full set of the p er query results. Even when the ranked list is generated iteratively, for example when using relevance feedback or meta-data, at each iteration the retrieval algorithm is applied to the full set of the documents. Boosting [3], and in particular RankBoost [4], also p erforms learning in stages. But it uses the whole training set as input at each stage, with more weight put on difficult examples. It is a very difficult task, however, to learn how to rank a very large numb er of documents for any p ossible query. We adapt the

Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval

General Terms
Algorithms, Exp erimentation, Performance

Keywords
Ad-hoc retrieval, high accuracy retrieval, re-ranking Work p erformed while visiting Microsoft Research

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00.

437


Figure 1: Training procedure for the multiple nested ranker. NET 1 is trained on the sets of 2500 documents D per query, NET 2 is trained on the sets of the top 1000 documents D per query, NET 3 is trained on the top 100 documents D , NET 4 is trained on the top 10 documents D . problem of learning the ranking of the retrieved results to the high accuracy task in the following way. We start with the assumption that the results list returned by the search engine already produces a sufficiently good ranking so that some relevant documents are placed somewhere near the top of the ranked list. Given the success of many retrieval systems and commercial search engines, this assumptions seems very reasonable, see for example [8]. Since we are interested in high accuracy as opp osed to recall, we concentrate on improving the ranks of relevant documents at top ranks. The multiple nested ranker p erforms re-ranking of the results in stages, at each stage generating a new distribution of the results. The training set for each subsequent stage is pruned to include only the results that are ranked high by the previous ranker. We will refer to this pruning procedure as telescoping. Telescoping splits the problem into smaller and, hop efully, easier sub-tasks to learn the ranking for each of the stages separately. At the last stage, only a few (e.g. 10) documents are re-ranked to make sure that the most relevant among them will b e placed on the top of the list. Since in real life the relevance assignment is often not binary, but reflects the degree of relevance of each of the results [10, 18], we evaluate the p erformance of our approach using normalized discounted cumulative gain (NDCG) [10]. The NDCG score dep ends not only on the p osition but also on the relevance score of the document in the ranked list. The rest of the pap er is organized as follows. Section 2 describ es the multiple nested ranker algorithm and outlines the RankNet algorithm. Section 3 describ es the NDCG measure, section 4 contains the details ab out the data set. Sections 5 and 6 describ e our exp eriments, section 7 contains

Figure 2: Re-ranking procedure for the multiple nested ranker. NET 1 is applied to the sets of 2500 documents D per query, NET 2 is applied to the sets of the top 1000 documents D per query, NET 3 is applied to the top 100 documents D , NET 4 is applied to the top 10 documents D . the analysis of the exp erimental results. We conclude with section 9.

2. MULTIPLE NESTED RANKER
We prop ose to use the multiple nested ranker as the second part of the retrieval process. A search engine retrieves documents in resp ond to the query and ranks them using some ranking algorithm which we refer to as "basic ranker". We make the assumption that the basic ranker already produces a good ranking and that a numb er of relevant documents are placed somewhere near the top of the ranked list. Telescoping is applied to the first few thousands results returned by the search engine to learn a b etter ranking for the relevant results. The multiple nested ranker algorithm has two comp onents: the telescoping procedure and a re-ranking algorithm. The re-ranking algorithm learns how to score documents so that the relevant documents receive higher scores. It uses the set of training queries Q = (q1 , ..., q|Q| ) to learn the scoring function. For each query qi we have a set of documents that were ranked among the top N 1 results by the basic ranker used in the search engine, Di = (di1 , ..., diN 1 ). Some of these documents have manually assigned relevance lab els, the rest is unlab eled. The training set for the re-ranking algorithm contains all documents returned for the training queries, D = (D1 , ..., D|Q| ). The multiple nested ranker approach uses the RankNet algorithm [2], discussed b elow. The RankNet learns a neural net to assign scores to documents. One net is learned for all training documents. The sets of documents corresp onding to individual queries are sorted by the net output to produce their ranking.

438


In the training phase, telescoping determines the subset of the data used to train the RankNet algorithm. Figure 1 illustrates how telescoping is applied to the results set for each query. At each stage the RankNet is presented with a new distribution of the p er query results containing subsets of the high ranked documents. At the first stage the RankNet is trained on the whole set of the top N 1 p er query results. In our exp eriments we used N1 =2500 documents p er query. The training procedure computes the first net, Net1 . We sort each set of documents Di by decreasing score according to Net1 . After that, the training set is modified so that only the top N2 documents that receive the highest scores according to Net1 remain for each query, i.e. Di = (di1 , ..., diN 2 ) and the next training set is D = (D1 , ..., D|Q| ). At the second stage the RankNet is trained on these sets of top N2 documents. The second stage produces Net2 and only the N3 top scoring documents p er query are kept for the next training set. Telescoping is also applied in the test phase. The reranking is done using the same numb er of stages as during training. At the first stage Net1 is applied to all N1 =2500 documents p er test query. Then Net2 is applied to the top N2 documents that receive the highest scores according to Net1 and so on. This amounts to fixing the Net1 ranks of the documents at ranks from N 1 to (N 2-1) after the first stage, re-ranking the top N 2 documents with Net2 , again fixing the ranks of the documents placed from the rank N 2 to (N 3-1) after the second stage, re-ranking the top N 3 results with Net3 and so on. Thus, after each telescoping stage we have a ranked list for all N 1 results p er query which we use for the evaluation, as can b e seen in Figure 2. We used four stages with N1 =2500, N2 =1000, N3 =100, N4 =10 and also three stages with N1 =2500, N2 =100, N3 =10. The same telescoping procedure was applied to the validation set. As opp osed to b oosting, this approach splits the problem into smaller pieces so that each net has a smaller and simpler task. Telescoping removes presumably difficult relevant documents at the b ottom of the ranked list from the training set and forces the algorithm to concentrate on the ranking of the high scoring relevant documents. In addition, as we decrease the size of the training set roughly exp onentially, more sophisticated algorithms can b e used to learn the ranking at later stages.

consider models where the learning algorithm is given a set of pairs of samples [A,B] in Rd together with the target Ż probabilities PAB that sample A is to b e ranked higher than sample B . With models of the form f : Rd - R, the rank order of a set of examples is sp ecified by the real values taken by f . More sp ecifically, it is assumed that f (xi ) > f (xj ) means that the model ranks xi higher than xj . With the modeled p osterior probabilities Pij  Prob("xi is ranked higher than xj ") and their target probaŻ bilities Pij , Burges et al. [2] develop their framework using the cross entropy cost function Ż Ż c(oij ) = -Pij log Pij - (1 - Pij ) log(1 - Pij ) where oij  f (xi ) - f (xj ). The map from the outputs to probabilities is modeled using a logistic function [1] Pij  So that final cost b ecomes Ż c(oij ) = -Pij oij + log(1 + eoij ) The ab ove cost function is very general. The RankNet algorithm uses it with the neural network models [12] to learn the ranking. Burges et al. [2] use a two-layer net with a single output node. As a reminder, the neural net output function for the ith sample is describ ed using the transfer function of each node j in the j th layer of the nodes, g j , and the weights wki on the n connections b etween the nodes in different layers with the corresp onding offsets bj i . Here the upp er indices index the kn node layer, and the lower indices index the nodes within each corresp onding layer. The net output function of a two-layer net with one output node for the ith sample, fi is X 32 2 X 21 wj g ( wj k xk + b2 ) + b3 ) fi  f (xi ) = g 3 ( j
j k

eoij 1 + eoij

The parameters k of the neural net model are up dated dep ending on their contribution to the cost function measured as the derivative  ck . The parameter value is up dated using  a p ositive learning rate k as k+1 = k +  k = k - k c  k

2.1 RankNet
For completeness, we provide a brief overview of the RankNet algorithm [2] to give the reader some intuition ab out the learning process. We omit the details b ecause this algorithm is used as a black b ox within the multiple nested ranker. At each stage of the multiple nested ranker the RankNet algorithm learns how to rank the results so that the relevant documents app ear at the top of the list. To achieve this, RankNet tries to learn the correct ordering of pairs of documents in the ranked lists of individual queries. The cost function of the RankNet algorithm dep ends on the difference of the outputs of pairs of consecutive training samples (x1 , x2 ). The cost is minimized when the document x1 with a higher relevance lab el receives a higher score, i.e. when f (x 1 ) > f (x 2 ). Burges et al. [2] prop ose to learn ranking using a probabilistic cost function based on pairs of examples. They

Burges et al. [2] generalize the ab ove derivations to the ranking problem in the following way. The cost function b ecomes a function of the difference of the outputs of two consecutive training samples: c(f1 -f2 ), assuming that the first sample has a higher or the same rank as the second sample. The gradient of the cost b ecomes c  f1  f2 =( - )c  k  k  k
,

where c  c (f1 - f2 ). The subscripts denote the index of the training sample. All other derivatives also take the form of the difference of a term dep ending on x1 and a term dep ending on x2 , which are coupled by an overall multiplicative factor of c which dep ends on b oth.

439


3.

EVALUATION
NDCG score

1.1

3.1 NDCG score
We use the normalized discounted cumulative gain measure (NDCG) [10] averaged over the queries to evaluate the p erformance of the multiple nested ranker algorithm. We choose this p erformance measure b ecause it incorp orates multiple relevance judgements and dep ends not only on the p osition but also on the relevance score of the document in the ranked list. Jarvelin et al. [10] showed that NDCG gives more credit to systems with high precision at top ranks than other evaluation measures. Our data was lab eled using 5 degrees of relevance, ranging from "excellent match" to "p oor match". To compute the NDCG score, we map the relevance levels to numerical values, with 4 corresp onding to the highest level of relevance and 0 corresp onding to the lowest level of relevance. Unlab eled documents were given rating 0. Jarvelin et al. [10] used a similar map, with lab els ranging from 3 to 0. The lab els can b e seen as weights or information gain for the user [18]. The difference in gain values assigned to highly relevant and relevant documents changes the NDCG score. Larger ratios put more weight on precision with resp ect to the highly relevant documents, see [18]. We used a relatively small gain ratio which was sufficiently discriminative in our exp eriments. The NDCG score is computed for the sorted list of results for the ith query qi as follows: NDCGqi = Ni
k X 2label(j ) - 1 logb (j + 1) j =1

1

0.9

0.8

0.7

0.6

2

4

6 Rank

8

10

Figure 3: The change in the NDCG score when swapping the fist result in the perfect ranking with each of the other ranks. Table 1: First data set. Number of queries, number of unlabeled results per query used for training, validation and testing. We used n={1,3,10}. #Queries # Unlab eled Training 3,514 #Lab eled*n Validation 691 1,000 Test 688 2,500

where Ni is the normalization constant chosen so that a p erfect ordering of the results for the query qi will receive the score of one. label(j ) is the gain value associated with the lab el of the document at the j th p osition of the ranked list. In the NDCG formula, the sum computes the cumulative information gain to the user from the already insp ected documents. logb (j + 1) is a discounting function that reduces document's gain value as its rank increases. The base of the logarithm, b, controls the amount of the reduction. We used b=2 in our exp eriments which corresp ond to a sharp er discount. Unlab eled documents affect the NDCG score by changing the ranks of the lab eled documents. Since some unlab eled documents may b e very relevant, NDCGqi = 1 is hard to achieve even for a good ranker. We computed NDCG at the top k=10 document since it is the numb er of results usually viewed by users. The NDCG score is computed for each query and then averaged. To obtain an intuition ab out the change in the NDCG score, consider the following p erfect ranking R=[4,4,3,3,2,2,2,1,1,1]. In this case, NDCG(R)=1. When we swap the first result with every of the other lab els, we receive the NDCG scores that are plotted in Figure 3. For example, swapping the first lab el 4 with the lab el 2 at the p osition five gives a two p ercentage p oints decrease in the NDCG score.

4.

DATA

In our exp eriments we used several thousands queries in English from August 2005 provided by an Internet search engine. We had 26,744 queries, each with up to 2500 returned

documents. These documents are the top 2500 results p er query as produced by the basic ranking algorithm. The document vectors have query-dep endent features extracted from the query and four document sources: the anchor text, the URL, the document title and the b ody of the text. They also have some query-indep endent features. The document vectors had around 400 features many of which were weighted counts. For each query there is a numb er of manually lab eled results. As mentioned b efore, five degrees of relevance were used for lab eling this data set, ranging from 4 (meaning "excellent match") to 0 (meaning "p oor match"). Unlab eled documents were given lab el -1. Since originally the lab els were produced for evaluation and comparison of top ranked documents, some documents with lab el 0 are quite relevant. Burges et al. [2] found that adding randomly chosen unlab eled documents as additional examples of low relevance documents to the training set helps to improve the p erformance. We had a similar approach in our exp eriments. At each stage of telescoping, we sampled the current training sets for individual queries in the following way. We used all lab eled documents and added at random a certain numb er n of unlab eled results for the same query. This numb er was a multiple of the total numb er of lab eled results, we used n = {1, 3, 10}. The unlab eled documents were rated 0 during training. The numb er of unlab eled examples included in the training set had a noticeable impact on the p erformance. We tried a few values of the multiplicative factor n, ranging from 0 to 10. The p erformance with no unlab eled examples was not satisfactory. The other values gave similar p erformance; in all exp eriments describ ed here we included in the training set three times as many unlab eled examples p er query as there are lab eled. To sp eed up the training, we also sampled the original validation set by keeping all the lab eled documents and adding at random 1000 unlab eled documents p er query. We did

440


1

0.9

4
Pair-wise accuracy NDCG

NDCG vs. Pair-wise accuracy

Epoch 1 Epoch 2

0.8

3

0.6

Label

0.7

2 1

0.5

0
0.4

-1
0.3 0 5 10 15 Epochs 20 25 30

0

500

1000 1500 Rank

2000

2500

Figure 4: NDCG score vs. pair-wise accuracy on the validation set for 30 training epochs at the first stage of telescoping. not change the distribution of the documents in the test set and used all lab eled and unlab eled documents available for a given query.

Figure 5: The distribution of labels over ranks for one query between the training epochs 1 and 2.

5.

EXPERIMENTS ON A SUBSET OF THE DATA

ep och. Thousands of pair-wise errors incurred by placing the documents with lab el "1" at the p osition 2000-2500, lower than many documents with lab el "0", are repaired and there are only few new errors introduced by shifting the highly relevant documents to lower ranks. Therefore, the pair-wise accuracy increases. However, only the p ositions of these latter documents are imp ortant for the NDCG score as well as for the user.

5.1 Small Data Set
To validate our approach, we used a subset of the data in the first set of the exp eriments, see Table 1.

5.3 Results for the Small Data Set
First, we used 4 telescoping stages with the numb er of top results for telescoping b eing N1 = 2500, N2 = 1000, N3 = 100, N4 = 10. We included 3 unlab eled examples p er each lab eled result in the training set at each stage. Previous results with the RankNet algorithm showed that a two-layer net outp erforms the RankNet with a linear net and other related approaches on this task [2]. At the first stage of telescoping, the data sets are not changed, and the net is computed and used on all 2,500 results p er query. Therefore, we use the p erformance of a two-layer net at the first stage as our baseline. We tried different numb ers of hidden nodes (nH ). As shown in Table 2, on this data set a two-layer net with 4 hidden nodes had the b est p erformance on the validation and the test set after the first stage. Thus, for this data set our baseline is the average NDCG score of 0.451. Using telescoping with a linear net improves the average NDCG score by over 2 p ercentage p oints from 0.445 to 0.473. The multiple nested ranker with linear nets outp erforms the baseline and also achieves the same or b etter p erformance than the multiple nested ranker with two-layer nets for the numb ers of the hidden nodes that we tried. Table 2 shows that the multiple nested ranker approach improves the p erformance for almost all neural net parameters that we tried. The two-layer net with two hidden nodes had the worst p erformance. However, this net p erformed similar to the net with four hidden nodes when we used a different prop ortion of unlab eled results in the training set. The linear net and the two-layer net with nH = {4, 8, 16, 32} hidden nodes achieved an over 2 p ercentage p oints increase in the NDCG score b etween the first and the last stages of telescoping. The largest increase was for the nH = {16, 32}. However, the NDCG of these nets for the first stage of telescoping was lower than for the linear net. There was an insignificant decrease in the NDCG score b etween the third and the last stages of telescoping for nH = {4, 8}. These nets achieve an improvement of the NDCG score compared to the first stage of telescoping, showing that our approach

5.2 Effect of the RankNet Cost Function
As outlined in section 2.1, the cost function of the RankNet algorithm dep ends on the difference of the outputs of two consecutive training samples. Due to the current form of the cost function, the RankNet algorithm tries to learn the correct pair-wise ordering of the documents regardless of their p osition in the ranked list. It is, therefore, p ossible that during training the net improves the pair-wise error by significantly moving up documents that are at the b ottom of the list even at the price of slightly moving down some of the relevant results at the top of the list. Telescoping is designed to alleviate this problem by removing the difficult documents at the low ranks and making the ranker to concentrate on the top results. First, we needed to verify this assumption. Averaged over all queries, the pair-wise error and the NDCG are very well correlated. Figure 4 shows the pair-wise accuracy and the NDCG score on the validation set averaged over the queries after each of the training iterations at the first telescoping stage. In this example, the correlation coefficient is 0.946. However, there are cases where their changes are anti-correlated. For single queries this effect can b e quite striking, as illustrated in Figure 5. Figure 5 shows the distribution of lab els for a particular query over the ranks after the first and the second ep ochs of training. As b efore, lab el "4" stands for "excellent match", lab el "0" means "p oor match", and lab el "-1" is used for unlab eled documents. Figure 5 clearly shows how some documents with lab el "1", which are p oor match, improve their ranks by over 1000 p ositions in the ranked list, moving from the ranks b etween 2000 and 2500 to the ranks b etween 500 and 1000. At the same time, the documents with lab els "4", "3" and "2" that were at the top p ositions after the first ep och of training are moved a few ranks to the left after the second

441


Av Number

Table 2: First data set. Average NDCG score at the top 10 results for a linear net (nH =0) and a twolayer net with different numbers of hidden nodes nH . Telescoping with 4 stages S t. St 1 2 3 4 1 2 3 4 1 2 3 4 nH 0 0 0 0 2 2 2 2 4 4 4 4
0.1

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 Rank 8 9 10 top top top top 2500 1000 100 10

av. NDCG 0.445 (± 0.023) 0.460 (± 0.022) 0.470 (± 0.022) 0.473 (± 0.022) 0.450 (± 0.022) 0.455 (± 0.022) 0.444 (± 0.023) 0.454 (± 0.023) 0.451 (± 0.022) 0.449 (± 0.022) 0.473 (± 0.022) 0.469 (± 0.022)

St 1 2 3 4 1 2 3 4 1 2 3 4

nH 8 8 8 8 16 16 16 16 32 32 32 32

av. NDCG 0.436 (± 0.022) 0.452 (± 0.022) 0.470 (± 0.022) 0.468 (± 0.022) 0.436 (± 0.023) 0.456 (± 0.022) 0.466 (± 0.022) 0.472 (± 0.022) 0.436 (± 0.022) 0.443 (± 0.022) 0.463 (± 0.022) 0.473 (± 0.022)

Figure 7: Average number of results with label "excellent" or "good" match at the top ten ranks.

Table 3: NDCG scores averaged over 11 random reshufflings of the training set with the linear net. Stage 1 2 av. NDCG 0.444 (±0.002) 0.459 (±0.001) Stage 3 4 av. NDCG 0.470 (±0.001) 0.473 (±0.002)

Difference in Av. Number

0 unlabeled label 0 label 1 label 2 label 3 label 4 Diff (1000,2500) Rank Diff (100,1000)

-0.1

-0.2

Figure 6: Difference in average numbers of results with particular label in the first top ten results between the first and the second (Dif f (1000, 2500)) and between the second and the third (Dif f (100, 1000)) stages.

is not sensitive to the choice of the net parameters. Figure 6 illustrates the effect of telescoping on the distribution of relevant documents at the top ten p ositions in the ranked list. It shows the difference in average numb ers of results with a particular lab el b etween the subsequent stages of telescoping. The actual numerical values of the difference are small b ecause for most queries there are only a few relevant results and not all queries have results with the top 2 levels of relevance. There is no difference b etween the 3 and 4 stages b ecause these numb ers are computed for the top 10 results at the 3 stage which b ecome the training set for the last stage; therefore the last step is omitted in the figure. Compared to the first stage, the numb er of unlab eled examples at top ten ranks at the second stage decreased and the numb er of lab eled relevant examples increased. Unlab eled examples and low relevance examples with lab el 0 and do not contribute to the NDCG score directly, but they affect the ranks and thus the contribution of other lab eled examples. At the third stage, the numb er of unlab eled results decreased and the largest increase was for the results with lab el 0 corresp onding to the lab el "p oor match". However, since the numb er of relevant lab eled examples increased as well, the overall NDCG score improved. During training, unlab eled examples are used as examples with lab el zero. Therefore,

it is interesting to see that the RankNet prefers to improve the ranks of documents with the lab el "p oor match" but not the ranks of unlab eled documents. This may b e attributed to the aforementioned prop erty of lab eling. Many of the results with lab el 0 that are placed among the first 1000 by the first net and among the first 100 by the second net may b e in fact quite relevant. Whereas the randomly chosen unlab eled documents are probably not relevant. Figure 7 shows the average numb er of results with the two highest levels of relevance at each of the top ten p ositions in the ranked list. We plotted the results for the test set and for all stages of telescoping. The right most columns corresp ond to the last stage of telescoping. In five out of ten cases, the top 10 stage has most documents lab eled "excellent match" or "good match". In seven out of ten cases, there are more results with lab el "excellent match" or "good match" in the final ranking than at the first stage which is reflected in the higher NDCG score. To see whether the multiple nested ranker is sensitive to the initial parameters of the training the neural nets, we ran this exp eriment over 11 random reshufflings of the training set with the linear net. Table 3 shows the NDCG scores averaged over 11 runs for the test using the linear net. It can b e seen that the p erformance is very stable.

5.4 Importance of Individual Stages
Since the numb er of telescoping stages that we used in the previous set of exp eriments was rather arbitrary, we investigated the contribution of each stage individually. As shown in Figure 7, the distribution of the relevant documents at the top of the ranked list after second stage improves relative to the first stage for most of the ranks. This means that the first net separates the relevant and irrelevant documents quite well. It was however, interesting to see, whether the first net can already produce a good separation of relevant documents so that only the top 100 need to b e re-ranked. We used telescoping with fewer stages, omitting the second stage with 1000 top ranked results as the training set.

442


Table 4: First data set. Average NDCG score at the top 10 results for a linear net (nH=0) and a two-layer net with different numbers of hidden nodes (nH). Telescoping with 3 stages S t. St 1 2 3 1 2 3 1 2 3 nH 0 0 0 2 2 2 4 4 4 av. NDCG 0.445 (±0.023) 0.469 (±0.022) 0.473 (±0.022) 0.450 (±0.022) 0.468 (±0.022) 0.469 (±0.022) 0.451 (±0.022) 0.463 (±0.022) 0.458 (±0.022) St 1 2 3 1 2 3 1 2 3 nH 8 8 8 16 16 16 32 32 32 av. NDCG 0.436 (±0.022) 0.468 (±0.022) 0.466 (±0.022) 0.436 (±0.023) 0.453 (±0.022) 0.464 (±0.022) 0.436 (±0.022) 0.470 (±0.022) 0.471 (±0.022)

Table 6: Second data set. Average NDCG score at the top 10 results for a linear net (nH=0) and a two-layer net with different numbers of hidden nodes. Telescoping with 4 stages. St 1 2 3 4 1 2 3 4 nH 0 0 0 0 2 2 2 2 av. NDCG 0.462 (± 0.013) 0.467 (± 0.013) 0.479 (± 0.013) 0.483 (± 0.013) 0.454 (± 0.013) 0.465 (± 0.013) 0.477 (± 0.013) 0.479 (± 0.013) St 1 2 3 4 1 2 3 4 nH 4 4 4 4 8 8 8 8 av. NDCG 0.455 (± 0.013) 0.470 (± 0.013) 0.479 (± 0.013) 0.481 (± 0.013) 0.444 (± 0.013) 0.457 (± 0.013) 0.480 (± 0.013) 0.483 (± 0.013)

Table 5: Second data set. Number of queries, number of unlabeled results per query used for training, validation and testing. We used n={1,2,3}. #Queries # Unlab eled Training 23,407 #Lab eled*n Validation 1,132 300 Test 2,205 2,500

Table 7: Second data set, NDCG scores on the test set, averaged over 5 random reshufflings of the training set for the linear net. Stage 1 2 av. NDCG 0.459 (±0.001) 0.468 (±0.000) Stage 3 4 av. NDCG 0.479 (±0.001) 0.481 (±0.001)

Table 4 shows the results. It app ears that the first net is sufficient to place relevant documents that can b e learned efficiently at the top 100 p ositions of the ranked list. For all net parameters, the NDCG improvement from the 2500 directly to 100 stage as shown in Table 4 is comparable to the improvement b etween these two stages when the 1000 stage is used in b etween.

6.

EXPERIMENTS ON THE FULL DATA SET

For our second data set we used the whole training set, a subset of the validation set and the full test set, see Table 5. Again, at each telescoping stage, we sampled the training set by keeping all the lab eled documents and adding at random some unlab eled documents. We did not change the test set. Table 6 shows the NDCG scores at each stage of telescoping for this data set. Similar to our first exp eriments, the multiple nested ranker with a linear net achieves a significant improvement in the NDCG score from 0.461 to 0.483. For this data set, the linear net outp erformed all two-layer nets that we tried. On this data set, the two-layer net improves the NDCG score b etween each stage of telescoping. We rep eated this exp eriment over 5 random reshufflings of the training set with the linear net. As in the previous case, the multiple nested ranker app ears very robust to the initial setting, see Table 7. When we used fewer telescoping stages with a linear net and omitted the second stage with the top 1000 results, the improvement was very similar to the improvement achieved with four telescoping stages, from 0.462 to 0.480. The NDCG score after the first stage with the top 2500 results was 0.462; the NDCG score after the second stage with the top 100 results was 0.467 and the NDCG score with the top 10 results was 0.48.

7.

ANALYSIS

The exp erimental results presented here showed that telescoping improves the precision at the top ranks robustly over

the numb er of settings. The improvement on the large data set was similar to the improvement on the small data set. The exact numb er of the telescoping stages also did not app ear to b e crucial for the p erformance of our approach. The multiple nested ranker with three telescoping stages gave the same improvement as with four stages. The numb er of hidden nodes in the two-layer neural net that we used in our exp eriments is much more imp ortant for the p erformance. However, the relative improvement due to telescoping was over two p ercentage p oints for most parameters that we tried. The two exceptions were the net with two hidden nodes for the small data set and the net with four hidden nodes for the small data set with three telescoping stages. In our preliminary exp eriments, the net with two hidden nodes also showed a two p oint improvement in the NDCG score on the small data set when we included more unlab eled examples for each training example. The role of the unlab eled examples needs further investigation. They are considered to b e lab eled as not relevant during the training phase. However, the fact that they are placed among the top 100 and 10 at the last stages of telescoping suggests that they may also b e relevant. Since the training set is pruned after each stage, it is p ossible that some of the relevant documents are excluded from the following re-ranking. It is not a problem when the ma jor focus is high accuracy. The ranks of those documents remain fixed at each of the following stages. Since all of them were b elow rank 10, they do not contribute to the NDCG score at any of the stages. In our exp eriments, the fraction of the relevant documents that were placed at ranks b elow 1000 after the first stage and b elow 100 after the second stage was very small. This supp orts the claim that RankNet produces a good ranking at every stage by placing relevant documents near the top of the results list. The multiple nested ranker approach refines their ranking by improving the ranks of the highly relevant documents.

443


8.

RELATED APPROACHES
[3]

The high scoring documents from the top p ositions on the results list have b een used extensively in the variants of relevance feedback to expand the query and compute new weights for the query terms [15, 14]. Although these approaches also p erform ranking in stages, at each stage the retrieval algorithm is often applied to the whole document collection. He et al. [5] used the user-sp ecific information for reranking the top n documents. Document's score was changed by some predefined factor on the basis of the genre or domain preference provided by the user. Xiao et al. [19] re-rank the top n results using an additional similarity score computed based on the query and the document title. Boosting [3] p erforms learning in stages. A new distribution of the data p oints is generated at each stage dep ending on the p erformance of the weak learner at the previous stage. Boosting puts more weight on learning the examples which were difficult for the previous learner. The aim of telescoping is, on the contrary, to exclude such difficult data p oints, i.e. low scoring high relevance documents, and to concentrate the learning on the top scoring results. In addition, b oosting does not control the size of the training set for the subsequent learner. As we reduce the training set size roughly exp onentially, more sophisticated rankers can b e used at later stages. Shen et al. [17] is one of the closely related approaches to re-ranking. They use a variant of the p erceptron learning algorithm to learn new ranks for the top scoring results. One variant of their algorithm learns to separate the top r scoring results from the rest. This algorithm can b e extended to work in stages similar to telescoping by separating the top r = {2500, 1000, 100, 10} results. However, this approach would not exclude low ranking high relevance documents from the training set on later stages.

[4]

[5]

[6] [7] [8]

[9]

[10]

[11]

[12]

[13]

9.

CONCLUSION AND FUTURE WORK

[14]

We presented the multiple nested ranker algorithm for an efficient re-ranking of the high scoring results at the top of the ranked list. We applied this approach to real world data. Our exp eriments showed that at each telescoping stage the RankNet ranker learns the ranking for a new distribution of documents. The ranker concentrates on the relevant documents which are placed near the top of the ranked list. The improvement in the averaged NDCG score confirms that the new sub-problems are easier to learn so that in the end a b etter ranking of the top few documents is computed. The fact that the low scoring documents are removed from the training set at later stages of training, can b e viewed as an attempt to introduce the information ab out the rank of the documents into the training procedure of the RankNet algorithm. The next step in the development of this algorithm will b e to modify the RankNet algorithm to use this information directly during training.

[15]

[16]

[17] [18]

[19]

to rank using gradient descent. In Proceedings of ICML, 2005. Y. Freund. Boosting a weak learning algorithm by ma jority. Information and Computation, pages 121(2):256­285, 1995. Y. Freund, R. Iyer, and R. Shapire. An efficient b oosting algorithm for combining preferences. In Journal of Machine Learning Research, volume 4, pages 933­969, 2003. D. He and D. Demner-Fushman. HARD exp eriment at Maryland: From need negotiation to automated HARD process. In Proceedings of TREC, 2003. A. James. HARD track overview in TREC 2003. In Proceedings of TREC, 2003. A. James. HARD track overview in TREC 2004. In Proceedings of TREC, 2004. B. J. Jansen and A. Spink. An analysis of web documents retrieved and viewed. In Proceedings of the International Conference on Internet Computing, pages 65­69, 2003. B. J. Jansen, A. Spink, J. Bateman, and T. Saracevic. Real life information retrieval: A study of user queries on the web. In Proceedings of SIGIR, pages 5­17, 1998. K. Jarvelin and J. Kekalainen. IR evaluation methods for retrieving highly relevant documents. In Proceedings of SIGIR, pages 41­48, 2000. T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough data as implicit feedback. In Proceeding of SIGIR, pages 154­161, 2005. Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Mueller. Efficient backprop. In Neural Networks: Tricks of the Trade, pages 9­50. Springer, 1998. W. Lin, A. Hauptmann, and R. Jin. Web image retrieval reranking with a relevance model. In Proceedings IEEE/WIC, 2003. J. Ponte. Language models for relevance feedback. In W. Croft, editor, Advances in Information Retrieval, pages 73­96, 2000. J. J. Roccio. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing, pages 313­323. Prentice Hall, 1971. C. Shah and W. B. Croft. Evaluating high accuracy retrieval techniques. In Proceedings of SIGIR, pages 2­9, 2004. L. Shen and A. K. Joshi. Ranking and reranking with p erceptron. Machine Learning, 60(1-3):73­96, 2005. E. M. Voorhees. Evaluation by highly relevant documents. In Proceedings of SIGIR, pages 74­82, 2001. Y. Xiao, R. Luk, K. Wong, and K. Kwok. Some exp eriments with blind feedback and re-ranking for Chinese information retrieval. In Proceedings NTCIR, 2005.

10. REFERENCES
[1] E. B. Baum and F. Wilczek. Sup ervised learning of probability distributions by neural networks. Neural Information Processing Systems, pages 52­61, 1987. [2] C. J. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning

444