Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem

Yisong Yue Thorsten Joachims Department of Computer Science, Cornell University, Ithaca, NY 14853 USA

yyue@cs.cornell.edu tj@cs.cornell.edu

Abstract
We present an on-line learning framework tailored towards real-time learning from observed user behavior in search engines and other information retrieval systems. In particular, we only require pairwise comparisons which were shown to be reliably inferred from implicit feedback (Joachims et al., 2007; Radlinski et al., 2008b). We will present an algorithm with theoretical guarantees as well as simulation results.

2007; Radlinski et al., 2008b). For example, to elicit whether a user prefers ranking r1 over r2 , Radlinski et al. (2008b) showed how to present an interleaved ranking of r1 and r2 so that clicks indicate which of the two has higher utility. This leads to the following on-line learning problem addressed in this paper. Given a space of retrieval functions and a (noisy) pairwise test for comparing any two retrieval functions, we wish to find a sequence of comparisons that has low regret (i.e., we eventually find a close to optimal retrieval function and never show clearly bad results in the process). We call this the Dueling Bandits Problem, since only ordinal feedback is observable, not cardinal feedback as required by conventional bandit algorithms (e.g., for optimizing web advertising revenue). In this paper, we formalize the Dueling Bandits Problem and an appropriate notion of regret. Furthermore, we propose a gradient-descent method which builds on methods for on-line convex optimization (Zinkevich, 2003; Kleinberg, 2004; Flaxman et al., 2005). The method is compatible with many existing classes of retrieval functions, and we provide theoretical regret bounds and an experimental evaluation.

1. Introduction
When responding to queries, the goal of an information retrieval system ­ ranging from web search, to desktop search, to call center support ­ is to return the results that maximize user utility. So, how can a retrieval system learn to provide results that maximize utility? The conventional approach is to optimize a proxymeasure that is hoped to correlate with utility. A wide range of measures has been proposed to this effect (e.g., average precision, precision at k, NDCG), but all have similar problems. Most obviously, they require expensive manual relevance judgments that ignore the identity of the user and the user's context. This makes it unclear whether maximization of a proxy-measure truly optimizes the search experience for the user. We therefore take a different approach based on implicit feedback gathered directly from users. But how can a learning algorithm access the utility a user sees in a set of results? While it is unclear how to reliably derive cardinal utility values for a set of results (e.g. U (r) = 5.6), it was shown that interactive experiments can reliably provide ordinal judgments between two sets of results (i.e. U (r1 ) > U (r2 )) (Joachims et al.,
Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

2. Related Work
Most prior works on learning from implicit feedback take an off-line approach. Usage logs (containing data such as clicks) are typically transformed into relevance judgments or integrated into the input features (e.g., Agichtein et al., 2006; Carterette & Jones, 2007; Dupret & Piwowarski, 2008). Such approaches are limited to passive learning from implicit feedback since they cannot control the initial results presented to users, and thus must use biased training data. Related on-line methods use absolute measures of individual retrieved results (Pandey et al., 2007; Langford & Zhang, 2007; Radlinski et al., 2008a). While theoretical analyses show good regret (as formulated using absolute measures), in many settings such regret

Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem

formulations might not reflect real user satisfaction. For example, clicks are affected by presentation bias ­ users tend to click on higher results regardless of relevance (Joachims et al., 2007). Any objective based on absolute measures must use careful calibration. In contrast, the interleaving method proposed by Radlinski et al. (2008b) offers a reliable mechanism for deriving relative preferences between retrieval functions.

Algorithm 1 Dueling Bandit Gradient Descent
1: Input: , , w1 2: for query qt (t = 1..T ) do 3: Sample unit vector ut uniformly. 4: wt  PW (wt + ut ) //projected back into W 5: Compare wt and wt 6: if wt wins then 7: wt+1  PW (wt + ut ) //also projected 8: else 9: wt+1  wt 10: end if 11: end for

3. The Dueling Bandits Problem
We define a new on-line optimization problem, called the Dueling Bandits Problem, where the only actions are comparisons (or duels) between two points within a space W (e.g., a parameterized space of retrieval functions in a search engine). We consider the case where W contains the origin, is compact, convex, and contained in a d-dimensional ball of radius R1 . Any single comparison between two points w and w (e.g., individual retrieval functions) is determined independently of all other comparisons with probability P (w w)= 1 + (w, w ), 2 (1)

concave, there exists a unique maximum v(w ). Probabilistic comparisons are made using a link function  : R  [0, 1], and are defined as P (w w ) = (v(w) - v(w )).

Thus (w, w ) = (v(w) - v(w )) - 1/2. Link functions behave like cumulative distribution functions (monotonic increasing, (-) = 0, and () = 1). We consider only link functions which are rotation-symmetric ((x) = 1 - (-x)) and have a single inflection point at (0) = 1/2. This implies that (x) is convex for x  0 and concave for x  0. One common link function is the logistic function L (x) = 1/(1 + exp(-x)). We finally make two smoothness assumptions. First,  is L -Lipschitz, and v is Lv -Lipschitz. That is, |(a) - (b)|  L a - b . Thus (·, ·) is L-Lipschitz in both arguments, where L = L Lv . We further assume that L and Lv are the least possible. Second,  is second order L2 -Lipschitz, that is, | (a) -  (b)|  L2 a - b . These relatively mild assumptions provide sufficient structure for showing sublinear regret.

where (w, w )  [-1/2, 1/2]. In the search example, P (w w ) refers to the fraction of users who prefer the results produced by w over those of w . One can regard (w, w ) as the distinguishability between w and w . Algorithms learn only via observing comparison results (e.g., from interleaving (Radlinski et al., 2008b)). We quantify the performance of an on-line algorithm using the following regret formulation:
T

T =
t=1

(w , wt ) + (w , wt ),

(2)

where wt and wt are the two points selected at time t, and w is the best point known only in hindsight. Note that the algorithm is allowed to select two identical points, so selecting wt = wt = w accumulates no additional regret. In the search example, regret corresponds to the fraction of users who would prefer the best retrieval function w over the selected ones wt and wt . A good algorithm should achieve sublinear regret in T , which implies decreasing average regret. 3.1. Modeling Assumptions We further assume the existence of a differentiable, strictly concave value (or utility) function v : W  R. This function reflects the intrinsic quality of each point in W, and is never directly observed. Since v is strictly
1 An alternative setting is the K-armed bandit case where |W| = K (Yue et al., 2009)

4. Algorithm & Analysis
Our algorithm, Dueling Bandit Gradient Descent (DBGD), is described in Algorithm 1. DBGD maintains a candidate wt and compares it with a neighboring point wt along a random direction ut . If wt wins the comparison, then an update is taken along ut , and then projected back into W (denoted by PW ). DBGD requires two parameters which can be interpreted as the exploration () and exploitation () step sizes. The latter is required for all gradient descent algorithms. Since DBGD probes for descent directions randomly, this introduces a gradient estimation error that depends on  (discussed Section 4.2). We will show in Theorem 2 that, for suitable  and , DBGD achieves sublinear regret in T ,  E[T ]  2T T 3/4 26RdL,

Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem

= (v(wt ) - v(a + (1 - )b)) - 1/2  (v(wt ) - v(a) - (1 - )v(b)) - 1/2  (v(wt ) - v(a)) + (1 - )(v(wt ) - v(b)) - 1/2 =  t (a) + (1 - ) t (b)
Figure 1. Example relative loss functions ( t (w)  (wt , w)) using the logistic link function, W  R, and value function v(w) = -w2 , for wt = -3, -2, -1. Note that the functions are convex in the area around w = 0.

The first inequality follows from monotonicity of (x). The second inequality holds since (x) is convex for x  0 (holds for a, b  Wt ). Since Wt is convex (due to concavity of v), we conclude that t is partially convex. 4.1. Estimating Gradients We now elaborate on the update procedure used by DBGD. Flaxman et al. (2005) observed that d ct (wt )  Eu [ct (wt + u)u] , (5)  where  > 0, d denotes the dimensionality, and u is a uniformly random unit vector. Let Xt (w) denote the event of w winning a comparison with wt : Xt (w) = 1 w.p. 1 - P (wt w) . 0 w.p. P (wt w) (6)

where T approaches 1 from above as T increases. For 64R2 d2 L4 L4 v example, when T > 132 L2 L4 2 , then T < 2.


Making an additional convexity assumption2 described in Corollary 2 yields a much simpler result,  E[T ]  2T 3/4 10RdL. To analyze DBGD, we first define relative loss as
t (w)

 (wt , w),

(3)

which is the distinguishability between wt and any other point. We will also define  (w) as


We can model the update in DBGD (ignoring ) as Xt (PW (wt + ut ))ut , which we now show, in expectation, matches the RHS of (5) (ignoring d/) with an additional projection. Lemma 1. Let ct (w) = P (wt w) =
t (w)

(w)  (w , w).

(4)

This relative loss function is depicted pictorally in Figure 1 for the logistic link function and v(w) = -w2 . Analysis Approach. Our analysis follows two conceptual phases. We first present basic results demonstrating the feasibility of performing gradient descent on the relative loss functions t (3). These results include proving that t is partially convex3 , and how pairwise comparisons can yield good gradient estimates. We then build on existing results (Zinkevich, 2003; Flaxman et al., 2005) to show that DBGD minimizes our regret formulation (2). We begin by observing that t is partially convex. Observation 1. For link functions (x) and value functions v(w) satisfying assumptions from Section 3.1, t (w) is partially convex for wt = w . Proof. Define Wt = {w : v(w)  v(wt )}, which has a non-empty interior for wt = w . For a, b  Wt and   [0, 1] we know that v(a + (1 - )b)  v(a) + (1 - )v(b), since v is concave. We then write
2

+ 1/2.

Then for  > 0 and uniformly random unit vector u, EXt ,u [Xt (PW (wt + u))u] = -Eu [ct (PW (wt + u))u]. Proof. Let S denote the unit sphere. Then we see that EXt ,u [Xt (wt + u)u] can be written as = Eu [EXt [Xt (PW (wt + u))|u]u] = =
S S

EXt [Xt (PW (wt + u))|u]udu (1 - ct (PW (wt + u)))udu c (PW (wt S t + u))udu

=0-

= -Eu [ct (PW (wt + u))u] 4.2. Gradient Quality & Function Smoothing We now characterize the quality of the proposed gradient approximation (5). Let ct denote a smoothed ^ version of some function ct , ct (w) = ExB [ct (PW (w + x))], ^ where x is selected uniformly within the unit ball B. We can show using Stokes Theorem that our sampled gradient direction is an unbiased estimate of ct . ^

t (a + (1 - )b)

as

The assumption currently lacks theoretical justification, but is observed empirically in many settings. 3 A function f : W  R is partially convex if there is a convex region with a non-empty interior and containing w where f is convex.

Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem

Lemma 2. Fix  > 0, over random unit vectors u, Eu [ct (PW (w + u))u] =  ct (w), ^ d

which follows from  being second order L2 -Lipschitz. Since t,x (wt,x ) - t,x (w )  0, the term inside the expectation in (11) is also non-negative. Using our definition of  (8), we can write (11) as  Ex [t (wt,x ) v(wt,x ) · (wt,x - w )] + 3L = Ex [ t (wt,x ) · (wt,x - w )] + 3L = Ex [ t (wt,x ) · (wt,x - wt + wt - w )] + 3L  Ex [ t (wt,x ) · (wt - w )] + (3 + )L (12)  =  ^t (wt ) · (wt - w ) + (3 + )L where (12) follows from observing that Ex [
t (wt,x )

where d is the dimensionality of x. (Proof analagous to Lemma 2.1 of Flaxman et al., 2005) Combining Lemma 1 and Lemma 2 implies that DBGD is implicitly performing gradient descent over ^t (w) = ExB [ t (PW (w + x))]. (7)

Note that |^t (w) - t (w)|  L, and that ^t is parameterized by  (suppressed for brevity). Hence, good regret bounds defined on ^t imply good bounds defined on t , with  controlling the difference. One concern is that ^t might not be convex at wt . Observation 1 showed that t is convex at wt , and thus  satisfies t (wt ) - t (w )  t (wt ) · (wt - w ). We now show that ^t (wt ) is "almost convex" in a specific way. Theorem 1. For  defined as = and   0,
L Lv L2

· (wt,x - wt )]  Ex [

t (wt,x )

]  L.

4.3. Regret Bound for DBGD Thus far, we have focused on proving properties regarding the relative loss functions t and ^t . We can easily bound our regret formulation (2) using t . Lemma 3. Fix  > 0. Expected regret is bounded by

L , L - Lv L2

(8) E [T ]  -2E

T t (w t=1 

) + LT.

, then

^t (wt ) - ^t (w )   ^t (wt ) · (wt - w ) + (3 + )L. Proof. First define wt,x  PW (wt + x), and also  t,x (w)  (wt,x , w). We rewrite ^t (wt ) - ^t (w ) as = ExB [ t (PW (wt + x)) - t (PW (w + x))]  ExB [ t,x (wt,x ) - t,x (w )] + 3L (9)   ExB [ t,x (wt,x ) · (wt,x - w )] + 3L (10) where (9) follows from being L-Lipschitz, and (10) follows from wt,x and w both being in the convex region of t,x . Now define t (y)  (v(wt ) - y), and t,x (y)  (v(wt,x ) - y). We can see that
t (wt,x ) 

Proof. We can write expected regret as E [T ]  2E = -2E by noting that |  (wt ) -   t (w ) = - (wt ).
 T  (wt ) t=1

+ LT + LT

T  t=1 t (w )

(wt )|  L, and also that

We now analyze the regret behavior of the smoothed loss functions ^t . Lemma 4 provides a useful intermediate result. Note that the regret formulation analyzed in Lemma 4 is different from (2).
 Lemma 4. Fix   0, LLL2 , and define  as in (8). v Assume a sequence of smoothed relative loss functions ^1 , . . . , ^T (^t+1 depending on wt ) and w1 , . . . , wT  W defined by w1 = 0 and wt+1 = PW (wt - gt ), where  > 0 and g1 , . . . , gT are vector-valued random variables with (a) E[gt |wt ] = ^t , (b) gt  G, and (c) R W  RB. Then for  = GT ,

= t (v(wt,x )) v(wt,x ).

and similarly
t,x (wt,x )

= t,x (v(wt,x )) v(wt,x ).

We can then write (10) as = Ex t,x (wt,x ) v(wt,x ) · (wt,x - w ) + 3L. (11) We know that both t,x (y)  0 and t (y)  0, and t,x (v(wt,x )) = -L , since that is the inflection point. Thus -L  t (v(wt,x ))  -L + Lv L2 ,

T

E
t=1

 ^t (wt ) - ^t (w )  RG T + (3 + )T. (13)

(Adapted from Lemma 3.1 in Flaxman et al., 2005) Proof. Theorem 1 implies the LHS of (13) to be

Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem
T

=
t=1 T

E [^t (wt ) - ^t (w )] E [  ^t (wt ) · (wt - w ) + (3 + )L ]
t=1 T

Proof. Adapting from Flaxman et al. (2005), if we let d gt = - Xt (PW (wt + ut ))ut ,  using Xt as described in (6), then by Lemma 1 and Lemma 2 we have E[gt |wt ] = ^t (wt ). By restricting T in (18), we guarantee   (0, L /Lv L2 ). We can then apply Lemma 4 using the update rule wt+1 (14) = PW (wt - gt ) = PW (wt +  d Xt (PW (wt + ut ))ut ) 


=
t=1 T

E [E[gt |wt ] · (wt - w )] + (3 + )LT E[gt · (wt - w )] + (3 + )LT
t=1

=

Following the analysis of Zinkevich (2003), we will use the potential function wt - w 2 . In particular we can rewrite wt+1 - w 2 as = PW (wt - gt ) - w  wt - gt - w = wt - w  wt - w
 2  2 2 2 2 2

which is exactly the update rule of DBGD if we set  = /d. Note that gt = d d Xt (PW (wt + ut ))ut  .  

(15) - 2(wt - w ) · gt
 2

+  gt
2

+  G - 2(wt - w ) · gt

 Setting G = d/ and noting our choice of  = R/ T , R we have  = GT . Applying Lemma 4 yields  T Rd T  E ^t (wt ) - ^t (w )  + (3 + )LT. (20)  t=1 Combining Lemma 3 and (20) yields E[T ]  -2E = 2E  2E  
T  t=1 t (w ) T t=1 t (wt )

where (15) follows from the convexity of W. Rearranging terms allows us to bound gt · (wt - w ) as  wt - w
2

- wt+1 - w 2
T t=1 2

+ LT

2

+  2 G2

(16)

- t (w ) + LT + 5LT

We can thus bound
T

E[gt · (wt - w )] by
2


t=1

E

wt - w
 2

- wt+1 - w 2  G 2
2 2 2

+  2 G2
2

T  t=1 ^t (wt ) - ^t (w )  2Rd T + (11 + 2)LT    2Rd T + 13LT  2Rd 13LT 1/4 

=E

w1 - w 2

+T


R G +T 2 2

(17)

Choosing  =


completes the proof.

which follows from choosing w1 = 0 and W  RB. Combining (14) and (17) bounds the LHS of (13) by  Choosing  = R2 G2 +T 2 2
R  G T

+ (3 + )T.

Corollary 1. Using choices of w1 , , and  as stated in Theorem 2, if  4 4 2RdLv L2 1+  T > ,  13LL for  > 0, then  E[T ]  2(1 + )T 3/4 26RdL.

finishes the proof.

We finally present our main result. Theorem 2. By setting w1 = 0,   2Rd R 2RdLv L2  = , = , T > 1/4 13LT T 13LL DBGD achieves expected regret (2) bounded by  E [T ]  2T T 3/4 26RdL where  L 13LT 1/4   T = . L 13LT 1/4 - Lv L2 2Rd
4

, (18)

The potential non-convexity of ^t significantly complicates the regret bound. By additionally assuming that ^t is convex at wt (which we have observed empirically in many settings), we arrive at a much simpler result. Corollary 2. Assume for all possible wt that ^t is convex at wt , which implies ^t (wt ) - ^t (w )  Then for w1 = 0,  = ^t (wt ) · (wt - w ).
R  T   2Rd , 5LT 1/4

and  =  E[T ]  2T 3/4 10RdL.

, we have

(19)

(Proof very similar to Theorem 2 and is omitted)

Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem
1

Table 1. Average regret of DBGD with synthetic functions.
0.8

P1 (DBGD) P1 (BGD1) P1 (BGD2) P5 (DBGD) P5 (BGD1) P5 (BGD2)

L Factor P1 P2 P3 P4 P5

0.6 0.465 0.803 0.687 0.500 0.710

0.8 0.398 0.767 0.628 0.378 0.663

1 0.334 0.760 0.604 0.325 0.674

2 0.303 0.780 0.637 0.304 0.798

3 0.415 0.807 0.663 0.418 0.887

Average Regret

0.6

0.4

0.2

0

0

100

200

300

400 500 600 700 Iterations (Multiples of 100)

800

900

1000

4.4. Practical Considerations Choosing  to achieve the regret bound stated in Theorem 2 requires knowledge of t (i.e., L), which is typically not known in practical settings. The regret bound is indeed robust to the choice of . So sublinear regret is achievable using many choices for , as we will verify empirically. In the analysis w1 = 0 was chosen to minimize its distance to any other point in W. In certain settings, we might choose w1 = 0, in which case our analysis still follows with slightly worse constants.

Figure 2. Average regret for L = 1

5. Experiments
5.1. Synthetic Value Functions We first experimented using synthetic value functions, which allows us to test the robustness of DBGD to different choices of . Since L is unknown, we introduced  a free parameter L and used  = T -1/4 L 0.4Rd. We tested on five settings P1 to P5 . Each setting optimizes over a 50-dimensional ball of radius 10, and uses the logistic transfer function with different value functions that explore a range of curvatures (which affects the Lipschitz constant) and symmetries: v1 (w) = -wT w, v3 (w) = -
i:odd

DBGD only observes random outcomes. Thus BGD assumes strictly more information4 . We evaluated two versions: BGD1 using P (wt w), and BGD2 using w) - 1/2. We expect BGD2 to pert (w) = P (wt form best since the sign of t (w) reveals significant information regarding the true gradient. Figure 2 shows the average regret for problems P1 and P5 with L = 1. We observe the behaviors of DBGD and BGD being very similar for both. Interestingly, DBGD outperforms BGD1 on P5 despite having less information. We also observe this trend for P2 and P3 , noting that all three problems have significant linear components. 5.2. Web Search Dataset For a more realistic simulation environment, we leveraged a real Web Search dataset (courtesy of Chris Burges at Microsoft Research). The idea is to simulate users issuing queries by sampling from queries in the dataset. For each query, the competing retrieval functions will produce rankings, after which the "user" will randomly prefer one ranking over the other; we used a value function based on NDCG@10 (defined below) to determine the comparison outcome probabilities. We stress that our usage of the dataset is very different from supervised learning settings. In particular, (extensions of) our algorithm might be applied to experiments involving real users where very little is known about each user's internal value function. We leverage this dataset as a reasonable first step for simulating user behavior in an on-line learning setting. The training, validation and test sets each consist of 1000 queries. We only simulated on the training set, although we measured performance on the other sets to check for, e.g., generalization power. There are about 50 documents per query, and documents are labeled by 5 levels of relevance from 0 (Bad) to 4 (Perfect). The compatibility between a document/query pair is
Our analysis yields matching upper bounds on expected regret for all three methods, though it can be shown that the BGD gradient estimates have lower variance.
4

v2 (w) = -|w|
2

w(i)

-
i:even

w(i)

v4 (w) = -
i

exp w(i) + exp -w(i)
w(i) e[ ]+ - i:(i%3=1) i:(i%3=2) -w(i) ] + e[

v5 (w) = v3 (w) -

The initial point is w1 = 1 5/d. Table 1 shows the regret over the interesting range of L values. Performance degrades gracefully beyond this range. Note that the regret of a random point is about 1 since most points in W have much lower value than v(w ). We also compared against Bandit Gradient Descent (BGD) (Flaxman et al., 2005). Like DBGD, BGD explores in random directions at each iteration. However, BGD assumes access to P (wt w), whereas

Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem
0.6

Training NDCG@10

Table 2. Average (upper) and Final (lower) NDCG@10 on Web Search training set (sampling 100 queries/iteration) \ 0.5 0.8 1 3 0.5 0.8 1 3 0.001 0.524 0.533 0.537 0.529 0.559 0.564 0.568 0.557 0.005 0.570 0.575 0.575 0.565 0.591 0.593 0.592 0.581 0.01 0.580 0.582 0.584 0.573 0.592 0.593 0.595 0.582 0.05 0.569 0.576 0.577 0.575 0.569 0.574 0.582 0.577 0.1 0.557 0.566 0.568 0.571 0.565 0.559 0.570 0.576

0.58 0.56 0.54 0.52 0.5 0.48 Sample 1 Sample 10 Sample 100 0 1 2 3 4 5 6 Comparisons 7 8 9 10 x 10
6

Figure 3. NDCG@10 on Web Search training set

represented using 367 features. A standard retrieval function computes a score for each document based on these features, with the final ranking resulting from sorting by the scores. For simplicity, we considered only linear functions w, so that the score for document x is wT x. Since only the direction of w matters, we are thus optimizing over a 367-dimensional unit sphere. Our value function is based on Normalized Discounted Cumulative Gain (NDCG), which is a common measure for evaluating rankings (Donmez et al., 2009). For query q, NDCG@K of a ranking for documents of q is 1 NK
(q) k=1 K

2rk - 1 , log(k + 1)

 values. Table 2 shows the average (across all iterations) and final training NDCG@10 when comparing 100 queries per update. Performance peaks at (, ) = (1, 0.01) and degrades smoothly. We found similar results when varying the number of queries compared per update. Figure 3 depicts per iteration NDCG@10 for the best models when sampling 1, 10 and 100 queries. Making multiple comparisons per update has no impact on performance (the best parameters are typically smaller when sampling fewer queries). Sampling multiple queries is very realistic, since a search system might be constrained to, e.g., making daily updates to their ranking function. Performance on the validation and test sets closely follows training set performance (so we omit their results). This implies that our method is not overfitting. For completeness, we compared our best DBGD models with a ranking SVM, which optimizes over pairwise document preferences and is a standard baseline in supervised learning to rank settings. More sophisticated methods (e.g., Chakrabarti et al., 2008; Donmez et al., 2009) can further improve performance. Table 3 shows that DBGD approaches ranking SVM performance despite making fundamentally different assumptions (e.g., ranking SVMs have access to very specific document-level information). We caution against over-optimizing here, and advocate instead for developing more realistic experimental settings.

where rk is the relevance level of the kth ranked (q) document, and NK is a normalization factor5 such that the best ranking achieves NDCG@K=1. For our experiments, we used the logistic function and 10×NDCG@10 to make probabilistic comparisons. We note a few properties of this setup, some going beyond the assumptions in Section 3.1. This allows us to further examine the generality of DBGD. First, the value function is now random (dependent on the query). Second, our feasible space W is the unit sphere and not convex, although it is a well-behaved manifold. Third, we assume a homogenous user group (i.e., all users have the same value function ­ NDCG@10). Fourth, rankings vary discontinuously w.r.t. document scores, and NDCG@10 is thus a discontinuous value function. We addressed this issue by comparing multiple queries (i.e., delaying multiple iterations) before an update decision, and also by using larger choices of  and . Lastly, even smoothed versions of NDCG have local optima (Donmez et al., 2009), making it difficult to find w (which is required for computing regret). We thus used NDCG@10 to measure performance. We tested DBGD for T = 107 and a range of  and
5

6. Conclusion
We have presented an on-line learning framework based on pairwise comparisons, and naturally fits with recent work on deriving reliable pairwise judgments. Our proposed algorithm, DBGD, achieves sublinear regret. As evidenced by our simulations based on web data, DBGD can be applied much more generally than suggested by our theoretical analysis. Hence, it begs for more sophisticated formulations which account for properties such as heterogenous user behavior, query dependent value functions, and the discontinuity of

Note that NK will be different for different queries.

(q)

Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem Table 3. Comparing Ranking SVM vs. final DBGD model using average NDCG@10 and per-query win/tie/loss counts. Model NDCG@10 W/T/L SVM 0.612 ­ Sample 1 0.596 490/121/389 Sample 5 0.593 489/121/390 Sample 10 0.589 504/118/378 Sample 25 0.593 489/118/393 Sample 50 0.596 472/119/409 Sample 100 0.595 490/116/394

rankings. Another interesting direction is adaptively choosing  and  for any-time regret analyses. Our framework is extendable in many ways, such as integrating pairwise document preferences (Joachims et al., 2007; Carterette et al., 2008), and diversity (Yue & Joachims, 2008; Radlinski et al., 2008a). Progress in this area can lead to cost-effective systems for a variety of application domains such as personalized search, enterprise search, and also small interest groups.

Past Observations. ACM Conference on Information Retrieval (SIGIR) (pp. 331­338). Flaxman, A., Kalai, A., & McMahan, H. B. (2005). Online Convex Optimization in the Bandit Setting: Gradient Descent Without a Gradient. ACM-SIAM Symposium on Discrete Algorithms (SODA) (pp. 385­394). Joachims, T., Granka, L., Pan, B., Hembrooke, H., Radlinski, F., & Gay, G. (2007). Evaluating the Accuracy of Implicit Feedback from Clicks and Query Reformulations in Web Search. ACM Transactions on Information Systems (TOIS), 25, 7:1­26. Kleinberg, R. (2004). Nearly tight bounds for the continuum-armed bandit problem. Neural Information Processing Systems (NIPS) (pp. 697­704). Langford, J., & Zhang, T. (2007). The EpochGreedy Algorithm for Contextual Multi-armed Bandits. Neural Information Processing Systems (NIPS) (pp. 817­824). Pandey, S., Agarwal, D., Chakrabarti, D., & Josifovski, V. (2007). Bandits for Taxonomies: A Modelbased Approach. SIAM Conference on Data Mining (SDM) (pp. 216­227). Radlinski, F., Kleinberg, R., & Joachims, T. (2008a). Learning Diverse Rankings with Multi-Armed Bandits. International Conference on Machine Learning (ICML) (pp. 784­791). Radlinski, F., Kurup, M., & Joachims, T. (2008b). How Does Clickthrough Data Reflect Retrieval Quality? ACM Conference on Information and Knowledge Management (CIKM) (pp. 43­52). Yue, Y., Broder, J., Kleinberg, R., & Joachims, T. (2009). The K-armed Dueling Bandits Problem. Conference on Learning Theory (COLT). Yue, Y., & Joachims, T. (2008). Predicting Diverse Subsets Using Structural SVMs. International Conference on Machine Learning (ICML) (pp. 1224­ 1231). Zinkevich, M. (2003). Online Convex Programming and Generalized Infinitesimal Gradient Ascent. International Conference on Machine Learning (ICML) (pp. 928­936).

Acknowledgements
The work was funded under NSF Award IIS-0713483, NSF CAREER Award 0237381, and a gift from Yahoo! Research. The first author is also partly funded by a Microsoft Research Graduate Fellowship and a Yahoo! Key Technical Challenges Grant. The authors also thank Robert Kleinberg, Josef Broder and the anonymous reviewers for their helpful comments.

References
Agichtein, E., Brill, E., & Dumais, S. (2006). Improving Web Search Ranking by Incorporating User Behavior Information. ACM Conference on Information Retrieval (SIGIR) (pp. 19­26). Carterette, B., Bennett, P., Chickering, D. M., & Dumais, S. (2008). Here or There: Preference Judgments for Relevance. European Conference on Information Retrieval (ECIR) (pp. 16­27). Carterette, B., & Jones, R. (2007). Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks. Neural Information Processing Systems (NIPS) (pp. 217­224). Chakrabarti, S., Khanna, R., Sawant, U., & Battacharyya, C. (2008). Structured Learning for Non-Smooth Ranking Losses. ACM Conference on Knowledge Discovery and Data Mining (KDD) (pp. 88­96). Donmez, P., Svore, K., & Burges, C. (2009). On the Local Optimality of LambdaRank. ACM Conference on Information Retrieval (SIGIR). Dupret, G., & Piwowarski, B. (2008). A User Browsing Model to Predict Search Engine Click Data from