Hoeffding and Bernstein Races for Selecting Policies in Evolutionary Direct Policy Search
Verena Heidrich-Meisner verena.heidrich-meisner@neuroinformatik.rub.de Christian Igel christian.igel@neuroinformatik.rub.de Institut fЈr Neuroinformatik, Ruhr-UniversitЈt Bochum, 44780 Bochum, Germany u a

Abstract
Uncertainty arises in reinforcement learning from various sources, and therefore it is necessary to consider statistics based on several roll-outs for evaluating behavioral policies. We add an adaptive uncertainty handling based on Hoeffding and empirical Bernstein races to the CMA-ES, a variable metric evolution strategy proposed for direct policy search. The uncertainty handling adjusts individually the number of episodes considered for the evaluation of a policy. The performance estimation is kept just accurate enough for a sufficiently good ranking of candidate policies, which is in turn sufficient for the CMA-ES to find better solutions. This increases the learning speed as well as the robustness of the algorithm.

estimates are not reliable enough to allow for learning. If too many episodes are considered, the learning process gets too slow. Unfortunately, it is usually not possible to determine an appropriate sample size for a given problem a priori (in practice we just make it "large enough"). The optimal number may vary in the course of learning and between candidate policies (e.g., really bad behaviors can be detected after just a few roll-outs). We employ the covariance matrix evolution strategy (CMA-ES, Hansen et al., 2003; Suttorp et al., 2009) for direct policy search, which gives striking results on RL benchmark problems (Gomez et al., 2008; HeidrichMeisner and Igel 2008). The CMA-ES adapts the policy as well as parameters of its own search strategy (such as a variable metric) based on ranking policies. This is already much less susceptible to noise than estimating absolute performances or performance gradients (Heidrich-Meisner & Igel, 2008). Still, the ranking must be sufficiently accurate to evolve better policies, and the accuracy of the ranking depends on the degree of uncertainty as well as on the number of roll-outs considered per performance estimation of each candidate solution. We propose to augment the CMA-ES for RL with an adaptive uncertainty handling scheme based on Hoeffding or empirical Bernstein races (Maron & Moore, 1994; Maron & Moore, 1997; Audibert et al., 2007; Mnih et al., 2008), which dynamically adapts the number of episodes for evaluating a policy such that the ranking of new candidate solutions is just reliable enough to drive the learning process. All individuals participate in a selection race, in which their performances are sampled. A policy remains in the race as long as one cannot be sure with an a priori fixed probability whether the policy is promising or not, i.e., as long as the confidence interval for the estimated performance based on Hoeffding or empirical Bernstein bounds does not clearly distinguish it from the other candidates. The selection race is finished when a complete and accurate ranking is ob-

1. Introduction
Dealing with uncertainty is one of the major issues in reinforcement learning (RL). When solving (partially observable) Markov decision processes solely based on observations and interactions with the environment, uncertainty and randomness arise from several sources. The initial state usually varies, state-transitions and reward signals can be stochastic, and the state observations may be noisy. We consider RL methods that search in a parametrized policy space, in which the search direction can be determined using estimates of the performance of behavioral policies or estimates of performance gradients. Uncertainty and randomness require that these estimates are based on a sample of several episodes (roll-outs). The sample size is a crucial parameter. If too few episodes are considered, the
Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

Hoeffding and Bernstein Races for Selecting Policies

tained. In the next section, we describe the CMA-ES for direct policy search. In section 3 we introduce our new uncertainty handling scheme based on racing algorithms. Then we present some empirical results before we end with our conclusions.

Algorithm 1: rank-Е CMA-ES
1

2 3

initialize m (0) = xinit ,  (0) , evolution paths (0) (0) p = pc = 0 and covariance matrix C (0) = I (0) (unity matrix), tlimit = 3 // k counts number of generations for k = 0, . . . do // create new offspring  N(m(k) ,  (k) C (k) ) for l = 1, . . . ,  do xl // evaluate new offspring, see section 3 ^ (k+1) reflects quality of x(k+1) , l = 1, . . . ,  // Xl l ^ (k+1) }, t(k+1) )  ^ (k+1) , . . . , XЕ ({X1 limit selectRace (k+1) (k+1) (k) ({x1 , . . . , x }, Е, a, b, tlimit, ) // recombination and selection P (k+1) m (k+1)  Е wi xi: i=1 // step size control (k+1) (k)  (1 - c )p + p p (k+1) -1 -m (k) c (2 - c )Еeff C (k) 2 m  (k) ,, ,, ЋЋ (k+1) c  (k+1)   (k) exp d E[ p (0,I ) ] - 1 N  // covariance matrix update (k+1)  pc p (k+1) (k) -m (k) (1 - cc )pc + cc (2 - cc )Еeff m  (k)
c C (k+1)  (1 - ccov )C (k) + Еcov pc cov " "P (k) (k) T Е 1 ccov 1 - Еcov wi z i: z i: i=1 (k+1) (k+1) 2

2. Evolutionary direct policy search
Evolution strategies (ESs) are random search methods, which iteratively sample a set of candidate solutions from a probability distribution over the search space (here a parametrized space of policies), evaluate these potential solutions, and construct a new probability distribution over the search space based on the gathered information (Beyer, 2007). In ESs, this search distribution is parametrized by a set of candidate solutions, the parent population with size Е, and by parameters of the variation operators that are used to create new candidate solutions (the offspring population with size ) from the parent population. Arguably the most elaborate ES for real-valued optimization is the covariance matrix adaptation ES (CMA-ES, Hansen et al., 2003; Suttorp et al., 2009), in which the main variation operator is additive Gaussian perturbation. The CMA-ES is a variable metric algorithm adapting shape and strength of its search distribution, and it is regarded as one of the most efficient evolutionary algorithms for real-valued optimization (Beyer, 2007). The CMA-ES is robust in the sense that it does not rely on tweaking of hyperparameters, see below. In a recent study, the CMA-ES (without rankЕ update, i.e., an outdated version) was compared to 8­12 (depending on the task) other RL algorithms including value-function and policy gradient approaches (Gomez et al., 2008). On the four test problems where the CMA-ES was considered, it ranked first, second (twice), and third. For further examples of successful applications of the CMA-ES for RL (Pellecchia et al., 2005; Siebel & Sommer, 2007) and additional comparisons on RL benchmarks (Heidrich-Meisner and Igel 2008; 2009) we refer to the literature. In each generation k of the CMA-ES, which is shown (k+1)  Rn , in Algorithm 1, the lth offspring xl l  {1, . . . , }, is generated by additive multi-variate Gaussian mutation and weighted global intermediate (k+1) (k)  m(k) + (k) z l with murecombination, i.e., xl (k) tation  (k) z l   (k) N (0, C (k) ) and recombination (k) (k) Е m(k)  l=1 wl xl: . Here xl: denotes the lth best individual of the  offspring. Ranking the  candidates (policies) requires their evaluation, which is discussed in detail in section 3. Considering the best Е

4 5

6 7

8 9

10

pc

(k+1) T

+

of the offspring in the recombination implements nonelitist, rank-based selection. For the equal recombination used here A common choice for the recombination weights is wl  ln(Е + 1) - ln(l), w 1 = 1, w  RЕ .

The CMA-ES adapts both the n-dimensional covariance matrix C (k) of the normal mutation distribution as well as the global step size  (k)  R+ . The covariance matrix update has two parts, the rank-1 update considering the change of the population mean over time and the rank-Е update considering the successful variations in the last generation. The rank-1 update is based on a low-pass filtered evolution path p(k) of successful (i.e., selected) steps p(k+1)  (1-cc ) p(k) + c c cc (2 - cc )Еeff m(k+1) - m(k)  (k)

and aims at changing C (k) to make steps in the promising direction p(k+1) more likely by morphing the copc . The backward variance towards pc time horizon of the cumulation process is approximately c-1 , where cc = 4/(n + 4) is roughly inversely c linear in the dimension of the path vector. The variЕ 2 -1 ance effective selection mass Еeff = is l=1 wl a normalization constant. The rank-Е update aims
(k+1) (k+1) T

Hoeffding and Bernstein Races for Selecting Policies

at making the single steps that were selected in the last iteration more likely by morphing C (k) towards z i: have
(k)

3. Selection races
Due to uncertainty and randomness, the performance estimates, which are required for selecting policies, should be based on samples of several episodes (rollouts). Our goal is to ensure with a given confidence that the Е selected policies have indeed the best mean performances while using as few roll-outs as possible. To this end, we want to systematically control in each iteration of our direct policy search (i) the overall number of roll-outs, and (ii) the distribution of roll-outs among the candidate policies. This can be achieved by adopting racing algorithms (Maron & Moore, 1994; Maron & Moore, 1997) for selection. We view the performance of the policy encoded by xi , i  {1, . . . , }, as a real-valued random variable Xi . The return (accumulated reward) Xi,t of the tth episode corresponds to a realization of Xi . We assume that the returns are almost surely bounded with known bounds a and b, Pr(Xi,t  [a, b]) = 1, and define R = |a - b|. Further, we assume an a priori fixed confidence level 1 -  and a maximum number of evaluations per individual tlimit (alternatively, we may fix a maximum number of evaluations per iteration). If ^ we consider the empirical mean Xi,t = 1 t =1 Xi,t t t of t realizations (i.e., a performance estimate based on t episodes), Hoeffding's inequality bounds the deviation of the empirical mean from the true expectation E[Xi ]. With probability of at least 1 -  we have log 2  ^ Xi,t - E[Xi ]  R . This rather loose bound
2t

z i:

(k)

T

.

Putting both updates together, we ccov (k+1) (k+1) T p pc Еcov c 1 Еcov
Е

C (k+1)  (1 - ccov )C (k) + + ccov 1 -

wi z i: z i:
i=1

(k) (k) T

.

The constants ccov and Еcov are fixed learning rates. The learning rate of the covariance matrix update  ccov = (n+2 2)2 is roughly inversely proportional to the degrees of freedom of the covariance matrix. The parameter Еcov mediates between the rank-Е update (Еcov  ) and the rank-one update (Еcov = 1). The default value is Еcov = Еeff . The global step size  (k) is adapted on a faster timescale. It is increased if the selected steps are larger and/or more correlated than expected and decreased if they are smaller and/or more anticorrelated than expected:  (k+1)   (k) exp c d p -1 E[ N (0, I) ]
(k+1)

,

where the (conjugate) evolution path is p(k+1)  (1 - c ) p(k)   + c (2 - c )Еeff C (k)
Еeff +2 n+Еeff +3 is a eff -1 2 max 0, Еn+1 -1 2

-1 2

m(k+1) - m(k) .  (k)

Again, c = d = 1 +

fixed learning rate and + c is a damping fac-

The values of the learning rates and the damping factor are well considered and have been validated by experiments on many basic test functions (Hansen et al., 2003). They need not be adjusted dependent on the problem and are therefore no hyperparameters of the algorithm. Also the population sizes can be set to default values, which are  = max(4 + 3 ln n , 5) and Е =  for offspring and parent population, respec2 tively. If we fix C (0) = I, the only hyperparameter to be chosen problem dependent is the initial global step size  (0) (apart from the parameters controlling sample size and uncertainty handling, see section 3).

is defined as BD -1 B T , where tor. The matrix C 2 T BD B is an eigendecomposition of C (B is an orthogonal matrix with the eigenvectors of C and D a diagonal matrix with the corresponding eigenvalues) and sampling N (0, C) is done by sampling BDN (0, I).

may be improved by applying Bernstein's inequality instead, which requires the usually unknown standard deviation of Xi . However, the recently derived empirical Bernstein bound (Audibert et al., 2007; Mnih et al., 2008) depends only on the empirical standard deviat ^ tion i,t = 1 t =1 (Xi,t - Xi,t )2 . With probability of ^2 t at least 1 -  it holds ^ ^ Xi,t - E[Xi ]  i,t 2 log 3 3R log 3   + . t t

The dependency on R decreases linearly with the sample size t and not just with its square root, but now the bound depends also on i,t / t. If the standard de^ viation is small compared to R, this bound is tighter than the Hoeffding bound. Racing algorithms are iterative methods trying to identify with a high probability the best among several options. In each iteration, a couple of options are sampled. In their standard formulation, which we adopt here, there is an upper bound on the number of iterations.

Hoeffding and Bernstein Races for Selecting Policies

Procedure selectRace({x1 , . . . , x }, Е, a, b, tlimit, ).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(k)

16

ut = |U| (k) while t < tlimit  |S| < Е do t  t+1 ut = |U| // needed for confidence intervals // reevaluate undecided individuals forall xi  U do Xi,t  performance(x i ) P ^ Xi  1 t =1 Xi,t // recompute mean t t compute new confidence interval h i ^ ^ Xi,t - ci,t , Xi + ci,t

averaged return

S =  // set of selected individuals D =  // set of discarded individuals U = {xi | i = 1, . . . , } // set of undecided individuals t1 forall xi  U do Xi,t  performance(x i ) // evaluate individual LBi  a, UBi  b // init lower and upper bounds

of at least Е other candidates, it is discarded (lines 23­25). If Е individuals are selected or the number of iterations exceeds tlimit , the race is finished.
selected

undecided: race!

unselected candidate policies

Figure 1. Example of a selection race for Е = 2 and  = 6.

17 18 19 20

21 22 23 24 25

forall xi  U do  Џ  if  xj  U  LBi > UBj    - Е - |D| then // probably among the best Е S  S  {xi } // select U  U \ {xi }   Џ  xj  U  UBi < LBj   Е - |S| then if // probably not among the best Е D  D  {xi } // discard U  U \ {xi } // adapt tlimit depending on whether tlimit was large enough for selection with confidence 1 -  or not (k+1) 1 (k) if |S| = Е then tlimit = max{3,  tlimit} (k+1) (k) else tlimit = min{tlimit , tmax } (k+1) ^ ^ return {X1 , . . . , X }, tlimit
(k+1) (k)

using Hoeffding or empirical Bernstein bound // update lower and upper bounds o n ^ LBi  max LBi , Xi - ci,t o n ^ UBi  min UBi , Xi + ci,t

We compute the confidence intervals using the Hoeffding or empirical Bernstein bound. If a confidence interval is recomputed, the highest lower and lowest upper bounds determined so far are stored. If we want our final decision to be correct with probability of at least 1 - , all computed bounds must be valid with probability of at least 1 - . By the union bound, this holds if each single bound holds with probability of at least 1 - /nb , where nb is the maximum number of considered bounds. In our setting, nb can be upper bounded by tlimit , because we have at most tlimit iterations and at most  evaluations per iteration. Thus, after t evaluations of policy xi we get a confidence ^ ^ interval Xi,t - ci,t , Xi + ci,t with cHoeffding = R i,t using the Hoeffding and ^ cBernstein = i,t i,t 2 log(3n) - log  log(3nb ) - log  +3R t t log (2nb ) - log  2t

26 27 28

Our algorithm is described in the procedure selectRace. For the rank-based selection procedure used in the ES, we need to know the best Е from  policies. Initially, all policies are evaluated (line 6) and then labelled undecided (line 3). In every following iteration or racing step, all policies tagged undecided are reevaluated (line 11). The estimated mean performance (line 14) and confidence interval (lines 17 and 18) are updated for each resampled policy (see below). If the lower bound of a policy is better than the upper bounds of at least  - Е other candidates, it is selected (lines 20­22). Figure 1 illustrates this idea. If the upper bound of a policy is worse than the lower bounds

using the empirical Bernstein bound. The approximation nb = tlimit is much too loose because it does not consider that already selected or discarded individuals are not reevaluated. Therefore we use in iteration t
t-1

nb,t =
k=1

uk + (tlimit - t + 1)ut ,

where ut is the number of individuals labeled undecided in iteration t. We combine this selection algorithm based on racing with the CMA-ES and call the resulting algorithms Hoeffding- and Bernstein-Race-CMA-ES depending on

Hoeffding and Bernstein Races for Selecting Policies

the way we compute the confidence intervals. By construction, our selection algorithm has the following property: Proposition 1 Let {x1 , . . . , x } be a set of individuals with fitness values almost surely between a and b, Е < , tlimit  3, and  ]0, 1]. If the above procedure selectRace selects a set S of policies, then with probability of at least 1- these elements belong to the best Е out of {x1 , . . . , x } in terms of mean performance. If after tlimit iterations less than Е individuals were selected, the ­ according to their estimated mean ­ best of the not yet discarded policies are considered in the CMA-ES. If this happens tlimit was too small, and therefore the maximum number of iterations is increased by a factor of  > 1 in the next generation upper bounded by a threshold tmax (line 27). If tlimit iterations were sufficient, tlimit is reduced by a factor of -1 (line 26), which has a positive effect on the bounds and therefore can speed up future races. Thus, the selection procedure adapts the overall budget of policy evaluations and distributes the available evaluations among the candidate policies. Related work. Stagge (1998) and Schmidt et al. (2006) propose methods for uncertainty handling in rank-based evolution strategies for noisy function optimization. However, both heuristics assume a particular distribution of the performance samples. HeidrichMeisner and Igel (2009) use the uncertainty handling CMA-ES (UH-CMA-ES) developed by Hansen et al. (2009) for reliably ranking policies in RL. However, this heuristic only adjusts globally in each generation the number of roll-outs considered for evaluation. Racing algorithms have been proposed in the domain of evolutionary computation for the empirical evaluation of different evolutionary algorithms (e.g., different external strategy parameter configurations) (Birattari et al., 2002; Yuan & Gallagher, 2004), but to the best of our knowledge not yet for selection.

second scenario we add Gaussian noise N (0, 0.01) to the state observation (Heidrich-Meisner & Igel, 2008), making the underlying Markov decision process partially observable. A trial is successful if the final policy allows the car to reach the hilltop in less than 100 time steps on average for the fully observable task and in less than 120 time steps on average for the partially observable task. As a higher dimensional and more challenging task we consider the swimmer problem (Coulom, 2002). The swimmer consists of nc compartments floating on a liquid surface. The swimmer is supposed to move its center of gravity as fast as possible in a predefined direction. The state description s =    [A0 , 1 , . . . , nc , A0 , 1 , . . . , nc ]T includes the position of the end point of the first compartment A0 (marking the "head" of the swimmer), the angle of the ith part (i = 1, . . . , nc ) with respect to the x-axis,  the corresponding velocity of the "head" A0 , and the 1 , . . . , n . Actions  angular velocities for each part  c a = [a1 , . . . , anc -1 ]T are torques applied between body parts. The reward given to the swimmer is the velocity component of the swimmer's center of gravity parallel to the x-axis. We considered two swimmers with nc  {3, 4}. The swimmer's initial position is set to 0 and the initial angular velocities are each drawn uniformly from [0, ]. A swimmer trial is called successful if the average velocity of the swimmer's center of gravity is larger than 3nc m , i.e., the swimmer covers a 40 s distance of at least one and a half of its length in the desired direction in the simulated time span of 20 s. Baseline algorithms. In order to judge the performance of the CMA-ES and the two Race-CMA-ESs for RL, we consider two alternative methods for comparison. First, we use simple random search (random weight guessing, RWG) as a baseline for evaluation. In every iteration new policy parameters are drawn uniformly from an interval [-xmax , xmax ]n , where n is the number of policy parameters, and the best solution is maintained. Second, we apply the efficient episodic natural actor-critic algorithm (NAC) according to Peters and Schaal (2008), a state-of-the-art policy gradient method. The NAC is, like the CMA-ES, a variable metric algorithm operating on a predefined policy class. Experimental setup. In all four benchmark problems uncertainty arises from random start states. In the second of the two mountain car tasks we additionally have to cope with noisy observations. We always consider the same type of linear policies in order to allow for a fair comparison. The linear poli-

4. Experiments
Benchmark problems. We consider RL benchmarks taken from the literature. The mountain car task serves as a minimal example. Here an underpowered car has to be driven out a valley to the goal state at the hilltop (Sutton & Barto, 1998). The state s = [x, x]T of the system is given by the position  x  [-1.2, 0.6] of the car and by its current velocity x  [-0.07, 0.07]. Actions are discrete forces applied  to the car a  {-amax, 0, amax }. In every time step a reward of -1 is given to the agent. The initial state is uniformly drawn from [-1.2, 0.6] з [-0.07, 0.07]. In a

Hoeffding and Bernstein Races for Selecting Policies

cies examined here are typically used with the NAC (Peters & Schaal, 2008). More sophisticated choices of policy classes certainly improve the performance of the CMA-ES, which for instance works fine with nonlinear neural networks (Pellecchia et al., 2005; Siebel & Sommer, 2007; Gomez et al., 2008). Thus all methdeter ods operate on the same policy class x (s) = xT s n with s, x  R . For learning, the NAC uses the stoch deter stochastic policy x (s, a) = N (x (s), NAC ), where the standard deviation NAC is considered as an additional adaptive parameter of the policy gradient method. The NAC is evaluated on the corresponding deterministic policy. The policy parameters (except the exploration parameter NAC for the NAC) are always initialized with zero. For the swimmer task the action consists of nc - 1 continuous values and we apply an independent linear policy for each action component, thus the search space is 2(n2 - 1)c dimensional. For the independent evaluation of the algorithms (e.g., in the following plots), we determine the median of 50 roll-out for the mountain car tasks and of 10 roll-outs for the swimmer tasks. For the mountain car problems we test for the CMAES all combinations of initial global step size  (0)  {0.1, 1, 5, 10, 15, 25, 50, 100} and sample size neval  {1, 10, 20, 30}, and for the NAC all combinations of initial exploration NAC  {0.1, 1, 5, 10, 15, 25, 50, 100}, learning rate NAC  {0.1, 0.01, 0.001, 0.0001}, and sample size neval  {n + 2, 2(n + 2), 3(n + 2) (the NAC needs a minimum of n + 2 roll outs per policy update). Since the swimmer problem is the more computational demanding we restricted  (0) and NAC to {1, 10, 50} for this task. For the Hoeffding- and Bernstein-Race-CMA-ES we always test all combinations of confidence   {0.01, 0.05, 0.1} and  (0)  {1, 10, 50}. In each trial we initialize tlimit = 3 (at least three roll-outs are necessary to compute the empirical Bernstein bounds), and we use  = 1.5 and tmax = 50 as upper bound on tlimit . Results. In Figs. 2 and 3 the performances of RWG, CMA-ES, NAC, and Bernstein- and Hoeffding-RaceCMA-ES are shown for the four test scenarios. For all methods the performance for the best respective parameter configuration is plotted. The race based selections clearly increased the learning speed. RWG, NAC, and CMA-ES performed best for large but not too large sample sizes neval . Among the methods without uncertainty handling the CMA-ES was more robust against uncertainty than NAC and RWG when looking at the best respective parameter configurations. In the fully observable mountain car problem the NAC

0 -50 -100 median of return -150 -200 -250 -300 -350 -400 -450 -500

a)
Hoeffding-Race-CMA-ES !(0)=10, "=0.05

NAC #NAC=0.01, !NAC=10, neval=12 Bernstein-Race-CMA-ES !(0)=10, "=0.05 RWG xmax=50, neval=10 CMA-ES ! =10, neval=20
(0)

b)
-100 -150 -200 median of return -250 -300 -350 -400 -450 -500 0

0

500

1000

1500
(0)

2000

2500

number of episodes Hoeffding-Race-CMA-ES ! =50, "=0.1

CMA-ES ! =10, neval=10

(0)

Bernstein-Race-CMA-ES ! =50, "=0.1 NAC #NAC=0.001, !NAC=100, neval=10 RWG xmax=10, neval=10 500 1000 1500 2000 2500

(0)

number of episodes

Figure 2. Median of performance over 500 trials for RWG, NAC, CMA-ES, Bernstein-Race-CMA-ES, and HoeffdingRace-CMA-ES in the a) fully observable and b) partially observable mountain car task. The respective best parameter configuration is plotted.

a)
350 300 median of return 250 200 150 100 50 0 Bernstein-Race-CMA-ES !(0)=50, "=0.01 Hoeffding-Race-CMA-ES !(0)=1, "=0.01

CMA-ES ! =1, neval=50, nc=3

(0)

RWG xmax=10, neval=10, nc=3 NAC #NAC=0.01, !NAC=10, neval=18, nc=3

b)
400 350 300 median of return 250 200 150 100 50 0 0

0

1000

2000

3000

4000

5000

number of episodes

Bernstein-Race-CMA-ES !(0)=1, "=0.01 Hoeffding-Race-CMA-ES !(0)=1, "=0.01

CMA-ES ! =1, neval=10, nc=4

(0)

RWG xmax=10, neval=10, nc=4 NAC #NAC=0.01, !NAC=10, neval=32, nc=4 1000 2000 3000 number of episodes 4000 5000

Figure 3. Median of performance over 20 trials for RWG, NAC, CMA-ES, Bernstein-Race-CMA-ES, and HoeffdingRace-CMA-ES for swimmers with a) 3 and b) 4 segments. The respective best parameter configuration is plotted.

performed exceptionally well because it benefits from a policy initialization close to an optimum (HeidrichMeisner & Igel, 2008). However, in the partially observable mountain car problem the Race-CMA-ESs

Hoeffding and Bernstein Races for Selecting Policies
300 250 tmax=25 median of return 200 tmax=50 150 100 50 0 0 1000 2000 3000 4000 5000 6000 7000 number of episodes tmax=100

maximal allowed sample size of tmax = 50 was used most of the time. As can be seen in Fig. 5, a lower confidence improved the performance in our experiments, because in the initial learning phase, which is crucial for the overall performance in simple mountain car task, a high confidence is not needed. But the Race-CMA-ESs not only improved the learning speed. They also proved to be very robust against changes in their hyperparameter values. For the mountain car task both the Bernstein- and the HoeffdingRace-CMA-ES were successful in more than 80% of the 500 trials for 9 out of 9 tested parameter configurations, while the standard CMA-ES was successful in more than 80% of the trials in only 18 of 32 cases, the NAC in 60 of 96 cases and RWG in 7 of 28 cases. In the partially observable mountain car task, the Bernsteinand the Hoeffding-Race-CMA-ES were again successful in more than 80% of the trials for 9 out of 9 tested parameter configurations. CMA-ES, NAC, and RWG achieved a success level of 80% in 3 of 3, 6 of 12, and 0 of 28 cases, respectively. (In this case only the most effective fixed sample size was tested for the CMA-ES and the NAC.) For swimmers with 3 compartments the Race-CMA-ESs were also robust. The BernsteinRace-CMA-ES was successful in more than 50% of all trials for 6 of 9 and the Hoeffding-Race-CMA-ES was successful in more than 50% of all trials for 5 of 9 and parameter configurations. For swimmers with nc = 4 both Race-CMA-ESs were not successful because presumably the value of tmax was too large. The CMA-ES achieved successes in more than 50% of the trials in 11 of 12 and 5 of 12 cases for nc  {3, 4}, while the NAC and RWG reached this success level in none of the configurations. Thus, uncertainty handling through selection races remarkably sped up learning and improved at the same time the robustness compared to the CMA-ES without uncertainty handing, which is already much more robust than the policy gradient method.

Figure 4. Median of the performance over 20 trials for the Bernstein-Race-CMA-ES with  = 0.05 and  (0) = 1 for different maximal race lengths tmax  {25, 50, 100} in the swimmer task with nc = 4.
-60 -80 -100 median of return -120 -140 -160 -180 -200 -220 -240 0 "=0.1 "=0.05 "=0.01 "=0.001 200 tlimit 50 40 30 20 10 1 400 600 number of episodes 1000 number of episodes 800 1000 "=0.5 "=0.4 "=0.3 "=0.2

Figure 5. Median of performance over 500 trials for the Hoeffding-Race-CMA-ES with  (0) = 50 in the partially observable mountain car task for different confidence levels   {0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5}. The race length tlimit selected by the Hoeffding-Race-CMA-ES with  = 5, tmax = 50 and  = 0.05 for four selected trials is plotted in the inset.

outperformed all other methods. For both mountain car tasks the Hoeffding-Race-CMA-ES performed better than the Bernstein-Race-CMA-ES. Initializing the start state of the mountain car randomly led to a large standard deviation i,t compared to the range R and ^ therefore the Hoeffding bound was tighter than the empirical Bernstein bound. In the swimmer tasks the empirical Bernstein bound was tighter than the Hoeffding bound, and the Bernstein-Race-CMA-ES outperformed the Hoeffding-Race-CMA-ES. The performance of both Race-CMA-ESs depended on the upper bound tmax of the race length tlimit as shown in Fig. 4 for the Bernstein-Race-CMA-ES. Choosing tmax too large results in very costly policy updates and unnecessary slow learning. This can be seen in the swimmer task with nc = 4, where both Race-CMA-ESs performed poorly even though they initially improve the learning speed. The inset in Fig. 5 shows the values of tlimit chosen by the Bernstein-Race-CMA-ES for single trials. At the beginning small sample sizes were used, and when the accuracy was no longer sufficient the sample size was increased. Finally for fine tuning the

5. Discussion and Conclusion
Evolution strategies (ESs) are powerful direct policy search methods. One of their main advantages is their ability to cope with uncertainty and noise. Still, random elements in the environment require gathering statistics over several episodes for the evaluation of candidate policies. We added a new adaptive uncertainty handling to evolutionary reinforcement learning. It adjusts the number of roll-outs per evaluation of a policy such that the signal to noise ratio is just high enough for a sufficiently good ranking of candidate policies, which in turn suffices for the ES to find

Hoeffding and Bernstein Races for Selecting Policies

better solutions. The uncertainty handling exploits the advantage of small sample sizes in the beginning and increases the sample size when a higher accuracy is necessary. It balances the trade-off between fast learning and sufficient accuracy. This significantly increases both learning speed and robustness. The statistically sound uncertainty handling scheme is independent of the CMA-ES and could be combined with other RL approaches (e.g., for evolutionary online RL, Whiteson & Stone, 2006). Moreover, it is not limited to RL and may also be adopted for general noisy function approximation tasks.

on Reinforcement Learning (EWRL 2008) (pp. 136­ 150). Springer-Verlag. Heidrich-Meisner, V., & Igel, C. (2009). Uncertainty handling CMA-ES for reinforcement learning. Genetic and Evolutionary Computation Conference (GECCO 2009). ACM Press. Maron, O., & Moore, A. W. (1994). Hoeffding races: Accelerating model selection search for classification and function approximation. Advances in Neural Information Processing Systems (pp. 59­66). Morgan Kaufmann Publishers. Maron, O., & Moore, A. W. (1997). The racing algorithm: Model selection for lazy learners. Artificial Intelligence Review, 11, 193­225. Mnih, V., SzepesvДri, C., & Audibert, J. (2008). a Empirical Bernstein stopping. Proceedings of the 25th International Conference on Machine Learning (ICML 2008) (pp. 672­679). Pellecchia, A., Igel, C., Edelbrunner, J., & SchЈner, G. o (2005). Making driver modeling attractive. IEEE Intelligent Systems, 20, 8­12. Peters, J., & Schaal, S. (2008). Natural actor-critic. Neurocomputing, 71, 1180­1190. Schmidt, C., Branke, J., & Chick, S. (2006). Integrating techniques from statistical ranking into evolutionary algorithms. Applications of Evolutionary Computing (pp. 752­763). Springer-Verlag. Siebel, N. T., & Sommer, G. (2007). Evolutionary reinforcement learning of artificial neural networks. International Journal of Hybrid Intelligent Systems, 4, 171­183. Stagge, P. (1998). Averaging efficiently in the presence of noise. Parallel Problem Solving from Nature (PPSN V) (pp. 188­197). Springer-Verlag. Sutton, R., & Barto, A. (1998). Reinforcement learning: An introduction. MIT Press. Suttorp, T., Hansen, N., & Igel, C. (2009). Efficient covariance matrix update for variable metric evolution strategies. Machine Learning, 75, 167­197. Whiteson, S., & Stone, P. (2006). Evolutionary function approximation for reinforcement learning. Journal of Machine Learning Research, 7, 877­917. Yuan, B., & Gallagher, M. (2004). Statistical racing techniques for improved empirical evaluation of evolutionary algorithms. Parallel Problem Solving from Nature (PPSN VIII) (pp. 172­181). SpringerVerlag.

References
Audibert, J.-Y., Munos, R., & SzepesvДri, C. (2007). a Tuning bandit algorithms in stochastic environments. 18th International Conference on Algorithmic Learning Theory (ALT) (pp. 150­165). Springer-Verlag. Beyer, H.-G. (2007). Evolution strategies. Scholarpedia, 2, 1965. Birattari, M., Stutzle, T., Paquete, L., & Varrentrapp, K. (2002). A racing algorithm for configuring metaheuristics. Genetic and Evolutionary Computation Conference (GECCO 2002) (pp. 11­18). Morgan Kaufmann Publishers. Coulom, R. (2002). Apprentissage par renforcement utilisant des reseaux de neurones, avec des applications au controle moteur. These de doctorat, Institut National Polytechnique de Grenoble. Gomez, F., Schmidhuber, J., & Miikkulainen, R. (2008). Accelerated neural evolution through cooperatively coevolved synapses. Journal of Machine Learning Research, 9, 937­965. Hansen, N., MЈller, S. D., & Koumoutsakos, P. (2003). u Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary Computation, 11, 1­ 18. Hansen, N., Niederberger, A. S. P., Guzzella, L., & Koumoutsakos, P. (2009). A method for handling uncertainty in evolutionary optimization with an application to feedback control of combustion. IEEE Transactions on Evolutionary Computation, 13, 180­197. Heidrich-Meisner, V., & Igel, C. (2008). Variable metric reinforcement learning methods applied to the noisy mountain car problem. European Workshop