Efficient Selection of Multiple Bandit Arms: Theory and Practice

Shivaram Kalyanakrishnan shivaram@cs.utexas.edu Peter Stone pstone@cs.utexas.edu Department of Computer Science, The University of Texas at Austin, Austin Texas 78712-0233, USA

Abstract
We consider the general, widely applicable problem of selecting from n real-valued random variables a subset of size m of those with the highest means, based on as few samples as possible. This problem, which we denote Explore-m, is a core aspect in several stochastic optimization algorithms, and applications of simulation and industrial engineering. The theoretical basis for our work is an extension of a previous formulation using multi-armed bandits that is devoted to identifying just the one best of n random variables (Explore1). In addition to providing PAC bounds for the general case, we tailor our theoretically grounded approach to work efficiently in practice. Empirical comparisons of the resulting sampling algorithm against stateof-the-art subset selection strategies demonstrate significant gains in sample efficiency.

phase must be maximized (and thus the regret minimized), in their problem the aim is to minimize the total number of samples needed to reliably identify an arm with an expected reward that is within of the maximum achievable (an " -optimal" arm). The parameter serves to specify tolerance. Casting the problem into a PAC framework, Even-Dar et al. (2006) provide an algorithm that identifies an -optimal arm in an n-armed bandit with probability at least 1-, incurring a sample complexity that is O( n log( 1 )). This 2  sample complexity matches, up to constants, a lower bound derived by Mannor & Tsitsiklis (2004). In this paper we generalize the problem formulated by Even-Dar et al. (2006): our aim is to identify m arms of an n-armed bandit such that the expected reward of each chosen arm is at least pm - , where pm is the mth highest expected reward among the bandit's arms. We denote this problem "Explore-m" (thus, the special case studied by Even-Dar et al. (2006) is Explore-1). We extend the "median elimination" algorithm proposed by Even-Dar et al. (2006) to provide an algorithm for Explore-m that satisfies a PAC constraint similar to the one for Explore-1, while achieving a sample complexity that is O( n log( m )). We ar2  gue that in practice, sample efficiency can be further improved by adapting to data through a sequential procedure, while preserving the PAC guarantee. Empirical results affirm significant gains for the resulting adaptive method when compared with state-of-the-art subset selection methods (Chen et al., 2008; HeidrichMeisner & Igel, 2009). This paper is organized as follows. We define the Explore-m problem in Section 2. In Section 3 we provide three PAC algorithms for Explore-m and analyze their sample complexity. In sections 2 and 3 we closely follow the presentation style of Even-Dar et al. (2006). In Section 4 we present an adaptive method for solving Explore-m in practice. Empirical evaluation and comparisons follow in Section 5. We discuss related work and conclude in Section 6.

1. Introduction
We consider the problem of efficient subset selection: given n real-valued random variables, our task is to reliably identify m among them with the highest means, while keeping the total number of samples minimal. This problem merits attention from several fields, including industrial engineering (Koenig & Law, 1985), simulation (Kim & Nelson, 2001), and evolutionary computation (Schmidt et al., 2006). The theoretical basis for our work is an extension of recent research from Even-Dar et al. (2006) (and Mannor & Tsitsiklis (2004)), who consider a problem of "pure exploration" in a multi-armed bandit (Berry & Fristedt, 1985). Unlike traditional bandit problems in which the rewards accrued during the exploratory
Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

Efficient Selection of Multiple Bandit Arms: Theory and Practice

2. Problem Statement
We consider an n-armed bandit with its arms numbered 1, 2, . . . , n. Each sample (or "pull") of arm a yields a reward of either 0 or 1, generated randomly from a Bernoulli distribution with mean pa . Without loss of generality we assume that
p1 > p2 > . . . > pn . (1)

high probability the m arms with the highest empirical averages (denoted p) are all ( , m)-optimal. ^ Algorithm 1 Direct(n, m, , )
1: for all a  Tn do 2: Sample arm a 2 ln( n ) times; let pa be its average ^ 2  reward. 3: end for 4: Find S  Tn such that |S| = m, and i  S j  (Tn - S): (^i  pj ). p ^ 5: Return S.

The arm distributions are independent, and do not depend on the history of pulls.1 Arm a is defined to be ( , m)-optimal,  > 0, m  {1, 2, . . . , n}, if
pa  pm - . (2)

To solve the Explore-m problem, an algorithm may sample arms of the n-armed bandit and record the results of its pulls; the algorithm is required to terminate and return a set of m arms. Such an algorithm L is defined to be ( , m, )-optimal,   (0, 1), if with probability at least 1 - , each of the arms it returns is ( , m)-optimal. Note that we do not require the m arms returned by L to be in any particular order. The sample complexity of L is the total number of pulls it performs before termination. Let us denote by Ti the set {1, 2, . . . , i}. From (1) and (2) we see that every arm in Tm is ( , m)-optimal. Hence, there are at least m ( , m)-optimal arms, and so the Explore-m problem is well-defined. Let us denote by B the set of arms that are not ( , m)-optimal. In general 0  |B|  (n - m). Note that Explore-m is a trivial problem when m = n.

Theorem 1. Direct(n, m, , ) is ( , m, )-optimal with sample complexity O( n log( n )). 2  Proof. Recall that B is the set of arms in Tn that are not ( , m)-optimal. From (1) and (2) we can relate Tm and B as follows:
i  Tm j  B : (pi - pj > ). (3)

Since |S| = m, an arm j in B can occur in S only if there is some arm i in Tm such that pi  pj (line 4). In ^ ^ turn (3) implies that the latter event can only occur if ^ pi  pi - 2 or pj  pj + 2 . Switching to probabilities, ^ applying the union bound and Hoeffding's inequality, we get:
P (j  B : (j  S))  P (i  Tm : (^i  pi - )) + P (j  B : (^j  pj + )) p p 2 2 X X P (^j  pj + ) p  P (^i  pi - ) + p 2 2 jB iT
m

3. Algorithms and PAC Analysis
Even-Dar et al. (2006) present two ( , 1, )-optimal algorithms for Explore-1: a na¨ algorithm that ive achieves a sample complexity O( n log( n )), and a "me2  dian elimination" algorithm that improves the sample complexity to O( n log( 1 )). We generalize these ex2  isting algorithms to construct three ( , m, )-optimal algorithms for Explore-m. Our first algorithm, "Direct," essentially implements the na¨ strategy ive of pulling each arm a fixed number of times. The second algorithm, "Incremental," uses the median elimination algorithm as a subroutine. The sample complexities of both methods are improved by our third algorithm, "Halving," which modifies the median elimination algorithm to identify m arms instead of 1. In the following text, ln denotes the natural logarithm and log denotes the logarithm to the base 2. DIRECT: Under Direct (Algorithm 1), arms are sampled a fixed number of times (line 2) such that with
The analysis and algorithms in this paper easily extend to the case where arm distributions have known, bounded ranges, and also when some have equal means.
1

 |Tm |e

- 2

2

2 2

ln( n ) 

+ |B|e

- 2

2

2 2

ln( n ) 

  (|Tm | + |B|)( )  . n

Since each arm is pulled exactly 2 ln( n ) times, the 2  sample complexity of Direct is O( n log( n )). 2  INCREMENTAL: Unlike Direct, Incremental (Algorithm 2) proceeds through m rounds. At the beginning of round l, Sl is the set of arms that have been selected, and Rl the set of arms remaining (line 1). During round l, an ( , 1)-optimal arm in Rl is selected with high probability by invoking the median elimination algorithm (Even-Dar et al., 2006) (line 3). We show that an ( , 1)-optimal arm in Rl is necessarily ( , m)-optimal in Tn . Algorithm 2 Incremental(n, m, , )
1: 2: 3: 4: 5: 6: S1  {}; R1  Tn . for l = 1 to m do  a Median-Elimination(Rl , , m ). Sl+1  Sl  {a }; Rl+1  Rl - {a }. end for Return Sm+1 .

Efficient Selection of Multiple Bandit Arms: Theory and Practice

Theorem 2. Incremental(n, m, , ) is ( , m, )optimal with sample complexity O( mn log( m )). 2  Proof. Since |Rl | = n-l +1, and l  m, Rl necessarily contains an arm a in Tm . Among the true means of the arms in Rl , let p be the highest. It follows from (1) and (2) that for any arm a that is ( , 1)-optimal with respect to Rl , pa  p -  pa -  pm - : i.e., a is ( , m)-optimal with respect to Tn . On round l, since Median-Elimination (line 3) returns an arm that is  not ( , 1)-optimal in Rl with probability at most m , the probability that Incremental selects an arm that is not ( , m)-optimal is at most . The sample complexity of Incremental is derived from its m calls to Median-Elimination. Since |Rl |  n, each call performs at most O( n log( 1 )) 2 
m

is made in round l if pl - pl+1 > l . Note that (1) m m log( n ) log( n ) p1 = pm , (2) l=1 m l < , and (3) l=1 m l < m . In effect, it suffices to show that the probability of making a mistake in round l is at most l : this log( n ) +1 > ) < , or would establish that P (pm - pm m in other words, that Halving is ( , m, )-optimal. We show that for a mistake to occur on round l, at least one of two events, E1 and E2 , must occur; however, P (E1 ) + P (E2 |¬E1 )  l . Let Al = {al , i  1, 2, . . . , m}: Al contains the m arms i from Rl with the highest true means. E1 denotes the event a  Al : (^a  pa - 2l ). By applying Hoeffding's p inequality and the union bound we obtain:
2

P (E1 )  me

l - 2

2 2 l

ln( 3m ) 
l


l . 3

(4)

pulls (see Even-Dar et al., 2006, Lemma 12), giving a total sample complexity that is O( mn log( m )). 2  HALVING: While Incremental selects an arm every round, Halving (Algorithm 3) eliminates multiple arms every round based on their inferior empirical averages. From Rl , the set of arms remaining at the beginning of round l, half proceed to round l + 1 (except that m proceed to the last round). Arms are sampled enough times each round (line 5) such that at least m ( , m)-optimal arms are likely to survive elimination. Specifically, round l is associated with parameters l and l , and we ensure that with probability at least 1 - l the mth highest true mean in Rl+1 is not lower than the mth highest true mean in Rl by more than l . The sequences ( l ) and (l ) (lines 2 and 9) are designed such that Halving is ( , m, )-optimal with sample complexity O( n log( m )). 2  Algorithm 3 Halving(n, m, , )
1: R1  Tn .  2: 1  4 ; 1  2 . n 3: for l = 1 to log( m ) do 4: for all a  Rl do 5: Sample arm a 2 ln( 3m ) times; let pa be its av^ 2 l l erage reward. 6: end for 7: Find Rl  Rl such that |Rl | = max( |Rl | , m), and 2 i  Rl j  (Rl - Rl ): (^i  pj ). p ^ 8: Rl+1  Rl . 3 1 9: l+1  4 l ; l+1  2 l . 10: end for n 11: Return R log( m ) +1 .

Let Bl = {j  Rl , pj < pl - l }: Bl is the set of m arms that are not ( l , m)-optimal in Rl . We call an arm b in Bl "bad" if its empirical average equals or exceeds that of some arm in Al . If E1 does not occur, note that b can be bad only if pb  pb + 2l ; a similar ^ application of Hoeffding's inequality shows that the l probability of the latter event is at most 3m . Let the number of bad arms in Bl be #bad; counting bad arms as Boolean results of |Bl | coin tosses, we obtain that l E[#bad|¬E1 ] is at most |Bl | 3m . E2 denotes the event that #bad  |Rl+1 | - m + 1. A moment's reflection informs us that if E1 does not occur, a mistake can be made on round l only if E2 occurs. Markov's inequality establishes that
P (E2 |¬E1 ) = P (#bad  (|Rl+1 | - m + 1)|¬E1 ) E[#bad|¬E1 ] |Bl | l  ( ) |Rl+1 | - m + 1 |Rl+1 | - m + 1 3m |Rl | - m l 2  ( )  l . (5) |Rl+1 | - m + 1 3m 3 

Arithmetic for the last step follows from the observation that |Rl |  2|Rl+1 | (lines 7 and 8). Together, (4) and (5) complete our proof. In Algorithm 3 the total number of pulls across all n log( m ) 2|Rl | 3m Slight rounds (line 5) is 2 ln(  ) . l=1 l l modifications to the steps for bounding a similar sum when m = 1 (see Even-Dar et al., 2006, Lemma 12) establish that it is O( n log( m )). 2  Indeed Halving achieves the lowest sample complexity bound (O( n log( m ))) among the algorithms 2  presented in this section. At present we are unaware of a tighter lower bound for Explore-m than the one automatically carrying over from Explore1 (( n log( 1 ))) (Mannor & Tsitsiklis, 2004). Thus, 2  proving matching upper and lower bounds for Explore-m is an open problem.

Theorem 3. Halving(n, m, , ) is ( , m, )-optimal with sample complexity O( n log( m )). 2  Proof. Let us sort the arms in Rl in decreasing order of their true means. Let al be the ith arm in the sorted i list and let pl be its true mean. We say a "mistake" i

Efficient Selection of Multiple Bandit Arms: Theory and Practice

4. Adaptive Bounds in Practice
Which instances of Explore-m are easy and which ones are difficult? It seems intuitive that when the top m and bottom n - m arms are separated by a relatively large margin, or when arm distributions have low variances, fewer samples would be needed to reliably identify m ( , m)-optimal arms. However, for given n, m, , and , note that any of the algorithms presented in Section 3 performs the same number of pulls, regardless of the arm distributions. These algorithms are designed to be sufficient for achieving a PAC guarantee in the worst case, when differences between the arms' true means are small. In this section we focus on improving sample efficiency in practice by adapting to the spacing between arm means and their variances, guided by their empirical returns. Starting with a conceptually simple "Uniform" algorithm that samples arms equally often, we consider progressive improvements to conserve samples in practice while retaining the PAC guarantee. Experiments in Section 5 show that the sequential procedure thus developed achieves significant gains in sample complexity over existing sampling methods. Recall that numbers 1, 2, . . . , n index the true means of the arms in decreasing order. In the absence of knowledge about these indices, let us index the arms (1), (2), . . . , (n) after each pull such that their empirical means are in non-decreasing order, i.e., p(1)  ^ p(2)  . . .  p(n) . Let us separate the highest m arms ^ ^ into a set High = {(1), (2), . . . , (m)} and leave the rest in the set Low = {(m + 1), (m + 2), . . . , (n)}. With each arm h in High let us associate numbers h and LBh such that by Hoeffding's inequality, with probability at least 1 - h , ph  LBh . Likewise for each arm l in Low we ensure that with probability at least 1 - l , U Bl  pl . Here LB(i) and U B(i) are respectively lower and upper bounds on the true mean of (i). If we return High as our answer, we can meet n PAC requirements by ensuring that (a) i=1 (i)  , and (b) h  High, l  Low : (LBh +  U Bl ). If we adopt the convention of adding to the sample means (and lower bounds) of arms in High, criterion (b) essentially states that lower bounds of arms in High and upper bounds of arms in Low must not collide 2 . The Direct algorithm in Section 3 precisely  achieves criteria (a) and (b) by setting i = n and ensuring that the widths of its bounds are smaller than 2 . For concreteness, we consider a randomized imWe say an arm in High collides if its lower bound is lower than the upper bound of some arm in Low; collision is defined similarly for arms in Low.
2

plementation of Direct, which we denote Uniform (and equivalently, as policy 1 ).
1 : Among arms with bound widths greater than  when (i) = n , pull an arm uniformly randomly.
2

Figure 1 illustrates the evolution of our sampling policies on an instance of Explore-m with n = 5, m = 2, = 0.1,  = 0.05. Figure 1(a) shows a schematic description of the various arms and their bounds when 1 terminates. By being no larger than 2 , the widths of the bounds are sufficient for a PAC guarantee. The question that arises in immediate response is: what is necessary of the bounds in order to uphold a PAC guarantee? Indeed we observe that as long as arms from High and Low do not collide, the PAC constraint would still be met. Correspondingly we update our policy to only select from among arms that are currently colliding; we denote the resulting policy 2 .
2 : Among arms that collide when (i) = arm uniformly randomly.
 , n

pull an

By only picking arms that collide, 2 conserves samples in comparison to 1 . A key step in further improving sample efficiency follows from the observation that although bounds have different widths when 2 terminates (Figure 1(b)), they are all computed under  the same value (i) = n . Recall that we do not need the (i) to be equal; only that their sum not exceed . In practice we find that significant economy can be achieved by allocating "larger portions of " to the arms that need it: arms with true means close to the boundary between High and Low ((2) and (3) in Figure 1). We posit that arms with true means farther away from the boundary ((1) and (5)) would require relatively fewer samples to shrink their bounds enough to avoid collisions, even for low values of (i) . Under both 1 and 2 we fix (i) a priori such that their sum does not exceed , and sample arms enough times to eliminate collisions. By contrast, under our third policy, 3 , we manipulate the (i) at every stage such that no collisions occur. Specifically we pick a "cutoff" c such that for every arm h in High, LBh = c, and for every arm l in Low, U Bl = c (Figure 1(c)). For c each arm (i) we set (i) = (i) such that its relevant upper or lower bound coincides with c. The quantity c (i) is obtained by inverting Hoeffding's inequality, and bounds the probability that the mean of arm (i) vioc lates the cutoff c. (i) effectively signifies how hard it is to make arm (i) collision free eventually: arms c with high (i) are likely to need more samples. We translate this intuition into policy 3 , which samples c arm (i) with a probability proportional to (i) . Empirical results (Section 5) confirm that 3 achieves a

Efficient Selection of Multiple Bandit Arms: Theory and Practice

p

High
p p

+ 0.1 (1) + 0.1 (2) (3) (4)

0.01, 0.05 0.01, 0.15 0.01, 0.05 0.01, 0.08 c 0.01, 0.05 0.01, 0.05 0.01, 0.05 0.01, 0.1 0.001, 0.1 0.001, 0.1 0.01, 0.06 0.01, 0.09 0.021, 0.06 0.005, 0.09 0.001, 0.15 0.022, 0.08 c 0.021, 0.06 0.005, 0.09 0.001, 0.15 0.022, 0.08

Low

p p

(5)

R

Hoeffding's bound

R

R

R

tight bound

(a) 1 (Uniform)

(b) 2

(c) 3

(d) 4 (Adapt)

Figure 1. Illustrative example of Explore-m (n = 5, m = 2, = 0.1, and  = 0.05) showing progressively efficient sampling policies. The figure shows examples of terminal conditions when policies 1 through 4 are employed. Beside each bound is shown "(i) , width(i) ": if arm (i) is in High, width(i) = (^(i) + ) - (LB(i) + ), and if it is in Low, p width(i) = U B(i) - p(i) . The depicted values of (i) and width(i) are not precise or drawn to scale; they are illustrative. ^

significant reduction in sample complexity over 2 .
3 : Pull arm (i) with a probability proportional to
c (i) .

larly effective when some arms have relatively small variances.
4 : Implement 3 using the tightest applicable bounds to compute LB(i) and U B(i) .

c Just as (i) indicates arm (i)'s relative need for samn c ples, the aggregate bound  c = 1 - i=1 (1 - (i) ) summarizes the progress made by our sampling policy. For a given value of , after every pull we can compute  c as a bound on the probability that our current answer is incorrect; we can stop if  c  . Notice that inc deed we can compute (i) and  c at every stage for any sampling policy, including 1 and 2 .

Naturally the choice of the cutoff c affects  c . Since it functions as a lower bound for arms in High and as an upper bound for arms in Low, c must lie in (m) the interval I(m+1) = (^(m+1) , p(m) + ), as in Figp ^ c ure 1(c). Indeed if each (i) is derived using Hoeffding's inequality, we find that it is possible to efficiently compute c = argmincI (m)  c . This computation relies
(m+1)

We refer to our final policy, 4 , as "Adapt". Although Adapt is extremely sample efficient in practice (as we see in the following section), and is indeed a PAC algorithm, it does not have a provably bounded worst case sample complexity. However, a worst case bound can be easily enforced by overlaying Adapt with a rule not to sample any arm more times than the Direct algorithm would. With more careful monitoring through rounds in our Halving algorithm, we could restrict Adapt to a worst case sample complexity of O( n log( m )). In our experiments we do not find such a 2  need, as the PAC guarantee is typically realized early.

5. Experimental Results
In this section we compare various policies for Explore-m on a test instance with m = 50, n = 15, = 0.1, and  = 0.15. The arms all have their true means drawn uniformly from the interval [0, 1]; each arm implements a uniform distribution with a  standard deviation of 1 (a width of 12), which is relatively large compared to the spacing between the means. We report statistics on this instance of Explore-m based on averages of at least 1000 independent trials. In our comparisons the Adapt algorithm (4 ) predominantly achieves the best results with the fewest samples. ADAPT: Figure 2 summarizes statistics comparing policies 1 through 4 . Figure 2(a) plots the value ^ of  c against the number of pulls. The order among the policies is consistent with our expectation: notice that only 4 is able to achieve the desired threshold ^ of  c   = 0.15 within the 30, 000 samples plot-

on convexity properties of Hoeffding's bound. Searching for an optimal cutoff becomes less attractive when c other bounds are used to compute (i) , as we observe shortly. For our experiments we use a cutoff c such ^ (m) that it divides the interval I(m+1) in proportion to the standard errors of the mean of arms (m + 1) and (m). We borrow this strategy from Chen et al. (2008), who consider a similar subset selection problem. Although easy to compute, c is a sub-optimal cutoff, as it only ^ depends on arms (m) and (m + 1), whereas c includes terms from all the arms. Our transition from 3 to a policy 4 is simple, yet practically significant. Rather than relying exclusively on Hoeffding's bound at every stage, we consider the tightest of the bounds that apply (Figure 1(d)). In particular, the recently proposed "empirical Bernstein bound" (Mnih et al., 2008) incorporates empirical variances into the calculation of bounds, and is particu-

Efficient Selection of Multiple Bandit Arms: Theory and Practice
1 0.8
NSS 1 0.99 0.98 0.97 0.96 0.95 0.94 0.93 0.92 0.91 0.9 0 5

3.5
1

3 Samples / 1000 2.5 2 1.5 1 0.5 0
25 30

UNIFORM/1 2 3 ADAPT/4

c

0.6 0.4 0.2 0 0 5

UNIFORM/1 2 3 ADAPT/4

0.99 15 UNIFORM/1 2 3 ADAPT/4 10 15 20 Samples / 1000 30

10 15 20 Samples / 1000

25

30

1 5 10 15 20 25 30 35 40 45 50 Arm index (in decr. order of true means)

(a) (b) (c) Figure 2. Comparison of policies 1 through 4 on test instance (n = 50, m = 15, = 0.1,  = 0.15). For each policy, (a) ^ shows  c and (b) shows NSS as a function of the number of pulls. In (c) arms 1, 2, . . . , 50 are in decreasing order of their true means, and a histogram shows the number of times each arm has been sampled. Notice peaks around 15 (m) and 16 (m + 1) under policies 3 and 4 .

ted. While 4 takes roughly 22,000 pulls to deliver the PAC guarantee, straightforward implementations of Direct (Algorithm 1) and Halving (Algorithm 3) would take millions of pulls at the same and  values.
^ While  c determines the PAC guarantee for choosing m ( , m)-optimal arms, we evaluate the actual set of arms that would be returned at any stage based on the m sums of their true means, i.e., i=1 p(i) . We normalize this "sum of selected means" such that it lies in [0, 1]. We denote the normalized sum N SS: note that the set Tm will have N SS = 1. Figure 2(b), which plots NSS against samples, shows a clear gulf between 3 and 4 on the one hand, and 1 and 2 on the other. Even after 30,000 pulls, this gap is significant (p < 0.0001).

How do policies 1 through 4 ration samples among the arms? Figure 2(c) shows a histogram plotting the number of pulls as arms are sorted in decreasing order of their true means. As we proceed from 1 to 4 a growing bias towards "contentious" arms (on either side of m and m + 1) is apparent. Recall that the arms have equal variances; this leads to a fairly symmetric pattern about m and m + 1 for policies 3 and 4 . OCBA-m: Chen et al. (2008) derive a sampling method similar to Adapt by adopting a Bayesian perspective. Under the "Optimal Computing Budget Allocation" (OCBA-m) framework, they model the true means of the arms with non-informative prior distributions, which get refined as samples are collected. First, each arm is sampled a fixed number (n0 ) of times to obtain an estimate of its variance (see Chen et al., 2008, pp. 584­585). Subsequently arm (i) is allotted samples, where s(i) is the sample standard deviation of arm (i), and c is the cutoff described in ^ Section 4. Sampling is performed in batches of size : only arms that have not already been sampled their allotted number are sampled. Consistent with our fully sequential approach in Adapt, we set  = 1 in our implementation of OCBA-m, picking arms probabilistically in proportion to their allotments.
s2 (i) (p(i) -^)2 ^ c

Clearly Adapt and OCBA-m are alike in spirit, focusing their attention on arms with higher variances and those close to the cutoff. However, while Adapt provides PAC bounds for subset selection, the Bayesian formulation under OCBA-m forces the use of approximations that are only valid in the asymptotic case. Indeed we find in our experiments that the performance of OCBA-m depends crucially on n0 , the number of rounds of uniform sampling. Chen et al. (2008) recommend setting n0  5, and use n0 = 20 in their experiments; however, for such low values we notice premature plateauing on our test instance (Figure 3(a)). For higher values of n0 (100, 500), OCBA-m coincides with Uniform until a large number of pulls, and then outperforms it for a period when it switches to intelligent sampling. We postulate that ideally the parameter n0 must adapt to the relative spacing between arms and their variances, which it is not always possible to perceive a priori. In informal experiments, noting that OCBA-m does not have a tolerance parameter like , we compare the methods by setting = 0 (PAC bounds are still possible under Adapt as long as the true means are separated). Additionally we set Gaussian distributions for the bandit arms with a standard deviation of 1 (for its bounds Adapt assumes that the support of this distribution is limited to six standard deviations). Although we do not report extensive results from these variations here, in these experiments Adapt still outperforms OCBA-m, for all the n0 values reported in Figure 3(a), after 30, 000 samples (p < 0.0001). RACING: In recent work Heidrich-Meisner & Igel (2009) extend the idea of racing algorithms (Maron & Moore, 1997) to the subset selection problem. Their algorithm ("Racing") proceeds for a predetermined number of rounds R. In each round arms are selected if with sufficient confidence they have higher means than n - m others; they are discarded if they have lower means than m others. Such elimination is meant to progressively focus sampling on contentious arms.

Efficient Selection of Multiple Bandit Arms: Theory and Practice
1 0.98 0.96 NSS 0.94 0.92 0.9 0.88 0.86 0 5 10 15 20 25 Samples / 1000 5 30 ADAPT OCBA-m UNIFORM n0 500 100 50 20 NSS 1 0.99 0.98 0.97 0.96 0.95 0.94 0.93 0.92 0.91 0.9 0 5

5 Samples / 1000
1 R=1 0.995 25 ADAPT RACING UNIFORM 10 15 20 25 30

OCBA-m ADAPT n0 = 100 n0 = 500 RACING (R = 1)

4 3 2 1 0

30

1 5

10 15 20 25 30 35 40 45 50

Samples / 1000

Arm index (in decr. order of true means)

(a) (b) (c) Figure 3. (a) Comparisons with OCBA-m on our test instance. The plot shows NSS against the number of samples. Under OCBA-m, curves are shown for multiple values of n0 . (b) Similar curves for Racing, with multiple values of R (only R = 1 is marked). (c) A histogram showing the number samples allocated to the bandit arms by the algorithms.

Like Adapt, Racing can also provide a PAC guarantee of choosing the best subset (there is no tolerance parameter ). However, the guarantee can only be provided if m arms have indeed been selected at the end of R rounds. Herein arises a dilemma: setting R to a high value could provide more rounds for sampling; however, the permissible mistake probabilities while eliminating arms are (roughly) inversely proportional to R. Racing has to enforce conservative mistake probabilities because elimination is an irrevocable decision, and further, bounds on the true means of arms are not allowed to expand from round to round. On our test instance, we are unable to select a subset of size m within the specified number of rounds R despite trying multiple values (between 1 and 500). In response, we continue to run the algorithm (see HeidrichMeisner & Igel, 2009, p. 404) beyond R rounds, using the confidence probabilities prescribed for round R. Not surprisingly, the best NSS results are achieved when we set R = 1 (Figure 3(b)), which provides the largest scope for eliminating arms. As R is increased (shown, but not marked in the figure), Racing progressively tends towards the Uniform algorithm, since fewer and fewer arms get eliminated. We posit that Racing is likely to perform better on a smaller problem instances, a hypothesis that is confirmed when we set n = 10, m = 3, = 0, and  = 0.15. Indeed Racing (R = 1) achieves a higher N SS value after 30, 000 pulls compared to Uniform, and is not distinguishable from Adapt (p < 0.05). By and large, we observe that the superior sample efficiency of Adapt is most pronounced when the number of arms n is large, and the arms have relatively high variances. Clear differences in the sampling strategies of Adapt, OCBA-m, and Racing are visible in Figure 3(c), which plots histograms of their pulls. OCBAm and Adapt are both more peaked than Racing. However, we notice that Adapt has a gradual slope on either side of its peak, while OCBA-m rises more dramatically near the separating boundary.

6. Related Work and Discussion
In the previous sections we have already encountered work that is very closely related to our contribution; we conclude by highlighting some further connections. Our exploration strategy in the multi-armed bandit attempts to minimize the number of samples to achieve a prescribed level of confidence about the selected subset. By contrast, alternative formulations for exploration in bandits consider strategies to maximize the chances of identifying the best arm when provided a fixed budget of samples (Madani & Lizotte, 2004). Even-Dar et al. (2006) extend their analysis of Explore-1 to develop exploration strategies for MDPs, resulting in reinforcement learning algorithms with similar PAC guarantees. In general an optimal policy for an MDP only relies on a single optimal action for every state, and so it appears unlikely that our generalization to Explore-m will have a direct bearing on their problem. However, any practical implementation of their algorithm could benefit from using Adapt, which applies equally to the case when m = 1. Subset selection has been widely studied under varying assumptions. In early work Koenig & Law (1985) provide a two-stage sampling procedure for selecting a subset of size m containing the l best of k independent normal populations. Kim & Nelson (2001) formalize the notion of an "indifference zone" in subset selection, which corresponds closely to the parameter in Explore-m. In their survey of empirical methods for choosing the single best candidate from a population, Inoue et al. (1999) suggest that Bayesian approaches, which address the average case, could be more beneficial in practice than worst case formulations. Through Adapt we contribute an algorithm that effectively manipulates bounds to achieve efficiency in practice while preserving a PAC guarantee. Efficient subset selection is a primary subroutine in several evolutionary algorithms (Schmidt et al., 2006) and stochastic optimization methods (de Boer et al., 2005). We expect that Adapt, which is straightforward to implement, can easily be integrated into exist-

Efficient Selection of Multiple Bandit Arms: Theory and Practice

ing code bases to improve sample efficiency. However, it must be noted that while Explore-m specifically implements subset selection, in general evolutionary algorithms could employ other selection schemes, such as tournament and proportionate selection (Miller & Goldberg, 1996). Also, some evolutionary algorithms seek to maintain good on-line performance (Whiteson & Stone, 2006), which the "pure exploration" nature of Explore-m does not match. The tolerance parameter in Explore-m can naturally control the tradeoff between the quality of selections on each iteration of an evolutionary algorithm and the algorithm's overall sample efficiency. Indeed is the only input parameter of Adapt, which seeks ^ to reduce its mistake probability  c with each sample. Note that OCBA-m also provides a probabilistic mea^ sure similar to  c to gauge the algorithm's progress; in contrast, Racing only preserves its PAC guarantee for a fixed number of rounds. As our experiments show, the effects of the input parameters to Racing (R) and OCBA-m (n0 ) are not easy to intuit. We believe that the superior empirical performance of Adapt, supported by its ease of use and principled theoretical grounding, make it a promising method for the fundamental problem of subset selection.

Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Mach. Learn. Research, 7: 1079­1105, 2006. Heidrich-Meisner, Verena and Igel, Christian. Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search. In Proc. ICML 2009, pp. 401­408. ACM, 2009. Inoue, Koichiro, Chick, Stephen E., and Chen, ChunHung. An empirical evaluation of several methods to select the best system. ACM Transactions on Modeling and Computer Simulation, 9(4):381­407, 1999. Kim, Seong-Hee and Nelson, Barry L. A fully sequential procedure for indifference-zone selection in simulation. ACM Transactions on Modeling and Computer Simulation, 11(3):251­273, 2001. Koenig, Lloyd W. and Law, Averill M. A procedure for selecting a subset of size m containing the l best of k independent normal populations, with applications to simulation. Communications in statistics. Simulation and computation, 14(3):719­734, 1985. Madani, Omid and Lizotte, Daniel J. Greiner, Russell. Active model selection. In Proc. UAI 2004, pp. 357­ 365. AUAI Press, 2004. Mannor, Shie and Tsitsiklis, John N. The sample complexity of exploration in the multi-armed bandit problem. Journal of Mach. Learn. Research, 5: 623­648, 2004. Maron, Oded and Moore, Andrew W. The racing algorithm: Model selection for lazy learners. Artificial Intelligence Review, 11(1­5):193­225, 1997. Miller, Brad L. and Goldberg, David E. Genetic algorithms, selection schemes, and the varying effects of noise. Evolutionary Computation, 4(2):113­131, 1996. Mnih, Volodymyr, Szepesv´ri, Csaba, and Audibert, a Jean-Yves. Empirical Bernstein stopping. In Proc. ICML 2008, pp. 672­679. ACM, 2008. Schmidt, Christian, Branke, J¨rgen, and Chick, u Stephen E. Integrating techniques from statistical ranking into evolutionary algorithms. In Appl. of Evolutionary Comp., volume 3907 of LNCS, pp. 752­763. Springer, 2006. Whiteson, Shimon and Stone, Peter. On-line evolutionary computation for reinforcement learning in stochastic domains. In Proc. GECCO 2006, pp. 1577­1584. ACM, 2006.

Acknowledgments
We thank Todd Hester, Tobias Jung, Juhyun Lee, Shimon Whiteson, and the anonymous reviewers of this paper for their useful suggestions. This work has taken place in the Learning Agents Research Group (LARG) at the Artificial Intelligence Laboratory, The University of Texas at Austin. LARG research is supported in part by grants from the National Science Foundation (CNS-0615104 and IIS-0917122), ONR (N00014-09-10658), DARPA (FA8650-08-C-7812), and the Federal Highway Administration (DTFH61-07-H-00030).

References
Berry, Donald A. and Fristedt, Bert. Bandit problems. Chapman and Hall Ltd., 1985. Chen, Chun-Hung, He, Donghai, Fu, Michael, and Lee, Loo Hay. Efficient simulation budget allocation for selecting an optimal subset. INFORMS Journal on Computing, 20(4):579­595, 2008. de Boer, Pieter-Tjerk, Kroese, Dirk P., Mannor, Shie, and Rubinstein, Reuven Y. A tutorial on the crossentropy method. Annals of Operations Research, 134(1):19­67, 2005. Even-Dar, Eyal, Mannor, Shie, and Mansour, Yishay.