Stochastic Search using the Natural Gradient

Sun Yi Daan Wierstra Tom Schaul J¨ rgen Schmidhuber u IDSIA, Galleria 2, Manno 6928, Switzerland

yi@idsia.ch daan@idsia.ch tom@idsia.ch juergen@idsia.ch

Abstract
To optimize unknown `fitness' functions, we present Natural Evolution Strategies, a novel algorithm that constitutes a principled alternative to standard stochastic search methods. It maintains a multinormal distribution on the set of solution candidates. The Natural Gradient is used to update the distribution's parameters in the direction of higher expected fitness, by efficiently calculating the inverse of the exact Fisher information matrix whereas previous methods had to use approximations. Other novel aspects of our method include optimal fitness baselines and importance mixing, a procedure adjusting batches with minimal numbers of fitness evaluations. The algorithm yields competitive results on a number of benchmarks.

chastic search in continuous search spaces, which is less ad-hoc than traditional stochastic search methods. Our algorithm maintains and iteratively updates a multinormal distribution on the search space of solution candidates. Its parameters are updated by estimating and following a natural search gradient, (i.e., the natural gradient on the parameters of the search distribution), towards better expected fitness. Numerous advantages of the natural gradient have been demonstrated, including its ability of providing isotropic convergence on ill-shaped function landscapes, avoiding slow or premature convergence to which `vanilla' (regular) gradients are especially prone (Amari & Douglas, 1998; Peters & Schaal, 2008). In our algorithm, the natural search gradient is calculated using the exact Fisher information matrix (FIM) and the Monte Carlo-estimated gradient, yielding robust performance (objective 1). To reduce the number of potentially costly evaluations (objective 2), we introduce importance mixing, which entails the reuse of samples from previous batches while keeping the sample distribution conformed to the current search distribution. To keep the computational cost manageable in higher problem dimensions (objective 3), we derive a novel, efficient algorithm for computing the inverse of the exact Fisher information matrix (previous methods were either inefficient or approximate). The resulting algorithm, Natural Evolution Strategies (NES), is elegant, requires no additional heuristics and has few parameters that need tuning.

1. Introduction
Stochastic search methods aim to optimize a `fitness' function that is either unknown or too complex to model directly. This general framework is known as `black box' optimization (Hansen & Ostermeier, 2001). It allows domain experts to search for good or near-optimal solutions to numerous difficult real-world problems in areas ranging from medicine and finance to control and robotics. Typically, three objectives have to be kept in mind when developing stochastic search algorithms: (1) robust performance in terms of fitness; (2) limiting the number of (potentially costly) fitness evaluations; (3) scalability with problem dimensionality. We address these issues through a new, principled method for stoAppearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

2. Search Gradients
First let us introduce the algorithm framework and the concept of search gradients. The objective is to maximize a d-dimensional unknown fitness function f : Rd  R, while keeping the number of function evaluations ­ which are considered costly ­ as low as

Stochastic Search using the Natural Gradient

possible. The algorithm iteratively evaluates a batch n samples z1 . . . zn generated from the search distribution p (z|). It then uses the fitness evaluations f (z1 ) . . . f (zn ) to adjust parameters  of the search distribution. Let J () = E [f (z) |] be the expected fitness under search distribution p (z|), namely, J () = f (z) p (z|) dz.

Now we compute g (z|) ln p (z|) 1 d 2 ln |A| =  { ln 2 - 2 2 1 - A- (z - x) A- (z - x) }, 2


=

where g (z|) is assumed to be a ds -dimensional column vector. The gradient w.r.t. x is simply
x

ln p (z|) = C- (z - x) .

The core idea of our approach is to find, at each iteration, a small adjustment , such that the expected fitness J ( + ) is increased. The most straightforward approach is to set    J (), where  J () is the gradient on J (). Using the `log likelihood trick', the gradient can be written as
J

The gradient w.r.t. ai,j (i  j) can be derived as  ln p (z|) = ri,j -  (i, j) a-1 , i,i ai,j where ri,j is the (i, j)-th element of R = A- (z - x) (z - x) C- and  (i, j) is the Kronecker Delta function.
s From g (z|), the search gradient  J () can 1 s be computed as  J () = n Gf , where G = [g (z1 |) . . . g (zn |)], and f = [f (z1 ) . . . f (zn )] . We update  by  =  s J (), where  is an empirically  tuned step size.

() = = = =


f (z) p (z|) dz f (z)
 p (z|) dz  p (z|) dz

p (z|) f (z) p (z|) p (z|) · (f (z)


ln p (z|)) dz,.

The last term can be approximated using Monte Carlo:
s J

() =

1 n

n i=1

3. Using the Natural Gradient
f (zi )


ln p (zi |) ,

where

s J

() denotes the estimated search gradient.

In our algorithm, we assume that p (z|) is a Gaussian distribution with parameters  = x, A , where x represents the mean, and A represents the Cholesky decomposition of the covariance matrix C, such that A is upper triangular matrix and1 C = A A. The reason why we choose A instead of C as primary parameter is twofold. First, A makes explicit the d (d + 1) /2 independent parameters determining the covariance matrix C. Second, the diagonal elements of A are the square roots of the eigenvalues of C, so A A is always positive semidefinite. In the rest of the text, we assume  is column vector of dimension ds = d + d (d + 1) /2 with elements in x, A arranged as 0 , 1 . . . d .

Vanilla gradient methods have been shown to converge slowly in fitness landscapes with ridges and plateaus. Natural gradients (Amari et al., 1995; Kakade, 2001) constitute a principled approach for dealing with such problems. The natural gradient, unlike the vanilla gradient, has the advantage of always pointing in the direction of the steepest ascent. Furthermore, since the natural gradient is invariant w.r.t. the particular parameterization of the search distribution, it can cope with ill-shaped fitness landscapes and provides isotropic convergence properties, which prevents premature convergence on plateaus and avoids overaggressive steps on ridges (Amari, 1998). In this paper, we consider a special case of the natural gradient ~  J, with the metric in parameter space defined by the KL divergence (Peters, 2007). The natural gradient is thus defined by the necessary condition F ~ J =
 J,

Here 0 = x and k = [ak,k . . . ak,d ] for 1  k  d, where ai,j (i  j) denotes the (i, j)-th element of A.
For any matrix Q, Q- denotes its inverse and Q denotes its transpose.
1

with F being the Fisher information matrix: F=E


ln p (z|)


ln p (z|)

.

Stochastic Search using the Natural Gradient

If F is invertible, which may not always be the case, the natural gradient can be uniquely identified by ~  J = F-  J, or estimated from data using F- s J. The  adjustment  can then be computed by  = F-
s  J.

Using the relation C  A A = A A= A+A , ai,j ai,j ai,j ai,j and the properties of the trace, we get (FA )m,n = tr A- + tr A A A- aim ,jm ain ,jn

As we show below, the FIM can indeed be computed exactly and is invertible. The original NES (Wierstra et al., 2008b) algorithm computes the natural search gradient using the empirical Fisher information matrix, which is estimated from the current batch. This approach has three important disadvantages. First, the empirical FIM is not guaranteed to be invertible, which could result in unstable estimations. Second, a large batch size would be required to approximate the exact FIM up to a reasonable precision. Third, it is highly inefficient to invert the empirical FIM, a matrix with O d4 elements. We circumvent these problems by computing the exact FIM directly from search parameters , avoiding the potentially unstable and computationally costly method of estimating the empirical FIM from a batch which in turn was generated from . In NES, the search distribution is the Gaussian defined by  = x, A , the precise FIM F can be computed analytically. Namely, the (m, n)-th element in F is given by (F)m,n = 1 C - C x x C- + tr C- C m n 2 m n ,

A A C- . aim ,jm ain ,jn

Computing the first term gives us tr A- A A A- aim ,jm ain ,jn = A-
jn ,im

A-

jm ,in

.

Note that since A is upper triangular, A- is also upper triangular, so the first summand is non-zero iff in = im = jn = jm . In this case, (A- )jn ,im = (A- )jm ,in = a-1 m , so jn ,i tr A- A A A- aim ,jm ain ,jn = a-2,in  (im , in , jm , jn ) . im

Here  (·) is the generalized Kronecker Delta function, i.e.  (im , in , jm , jn ) = 1 iff all four indices are the same. The second term is computed as tr A A C- aim ,jm ain ,jn = C-  (in , im ) .

jn ,jm

where m , n denotes the m-th and n-th element in . Let im , jm be the aim ,jm such that it appears at the (d + m)-th position in . First, notice that x x C- = C- xi xj and x x x x C- = C- = 0. ai1 ,j1 ai2 ,j2 xi aj,k So the upper left corner of the FIM is C- , and F has the following shape F= C- 0 0 FA . , i,j

Therefore, we have (FA )m,n = C-
jn ,jm

 (in , im )+a-2,in  (im , in , jm , jn ) . im

It can easily be proven that FA itself is a block diagonal matrix with d blocks along the diagonal, with sizes ranging from d to 1. Therefore, the precise FIM is given by   F0   F1   F = , ..   . Fd with F0 = C- and block Fk (d  k  1) given by Fk = a-2 k,k 0 0 0 + Dk .

The next step is to compute FA . Note that (FA )m,n = 1 C C tr C- C- . 2 aim ,jm ain ,jn

Here Dk is the lower-right square submatrix of C- with dimension d + 1 - k, e.g. D1 = C- , and Dd = (C- )d,d .

Stochastic Search using the Natural Gradient

We prove that the FIM given above is invertible if C is invertible. Fk (1  k  d) being invertible follows from the fact that the submatrix Dk on the main diagonal of a positive definite matrix C- must also be positive definite, and adding a-2 > 0 to the diagonal k,k would not decrease any of its eigenvalues. Also note that F0 = C- is invertible, so F is invertible. It is worth pointing out that the block diagonal structure of F partitions parameters  into d + 1 orthogonal groups 0 . . . k , which suggests that we could modify each group of parameters without affecting other groups. Normally, computing the inverse of F, of size d2 by d2 , would be of complexity O d6 , which is intractable for most practical problems. Realizing that we can invert each block Fk separately, the complexity can be reduced to O d4 . In a companion paper (Sun et al., 2009), we present an iterative method for computing F- which further reduces the time complexity from O d4 to O d3 . Additionally it shows that the space complexity can be reduced to O d2 which is linear in the number of parameters.

However, the variance depends on the value of b, i.e. Var ()  b2 E -2bE F- G1 F- Gf F- G1 F- G1 + const.

Here 1 denotes a n-by-1 vector filled with 1s. The optimal value of the baseline is E (F- Gf ) (F- G1) E (F- G1) (F- G1) Assuming samples are i.i.d., b can be approximated from data by b
n - - i=1 f (zi ) (F g (zi )) (F g (zi )) . n - - i=1 (F g (zi )) (F g (zi ))

b=

.

In order to further reduce the estimation covariance, we can utilize a parameter-specific baseline for each parameter j individually, which is given by bj = E [(hj Gf ) (hj G1)] E [(hj G1) (hj G1)]
n 2 i=1 f (zi ) (hj g (zi )) . n 2 i=1 (hj g (zi ))

4. Optimal Fitness Baselines
The concept of fitness baselines, first introduced in (Wierstra et al., 2008b), constitutes an efficient variance reduction method for estimating . However, baselines as found in (Peters, 2007) are suboptimal w.r.t. the variance of , since this FIM may not be invertible. It is difficult to formulate the variance of  directly. However, since the exact FIM is invertible and can be computed efficiently, we can in fact compute an optimal baseline for minimizing the variance of , given by Var () =  2 E[ F- · F-
s J s J

Here hj is the j-th row vector of F- . However, the main disadvantage of parameter-specific baselines is that different baselines values are used for closely correlated parameters, which renders gradient estimations less reliable. In order to address such problems, we follow the intuition that if the (m, n)-th element in the FIM is 0, then parameters m and n are orthogonal and only weakly correlated. Therefore, we propose using the block fitness baseline, i.e. a single baseline bk for each group of parameters k , 0  k  d. Its formulation is given by bk = E F- Gk f k E F- Gk 1 k F- Gk 1 k F- Gk 1 k F- gk (zi ) k F- gk (zi ) k .

- E F-
s J

s J

- E F-

],

where s J is the estimated search gradient, which is  given by
s J

=

1 n

n i=1

[f (zi ) - b]


ln p (zi |) .

n - k i=1 f (zi ) Fk g (zi ) n - k i=1 Fk g (zi )

The scalar b is called the fitness baseline. Adding b does not affect the expectation of s J, since  E[
s  J]

Here F- denotes the inverse of the k-th diagonal block k of F- , while Gk and gk denote the submatrices corresponding to differentiation w.r.t. k .

= =


(f (z) - b) p (z|) dz f (z) p (z|) dz.

5. Importance Mixing
In each batch, we evaluate n new samples generated from search distribution p (z|). However, since small updates ensure that the KL divergence between consecutive search distributions is generally small, most


Stochastic Search using the Natural Gradient

new samples will fall in the high density area of the previous search distribution p (z| ). This leads to redundant fitness evaluations in that same area. Our solution to this problem is a new procedure called importance mixing, which aims to reuse fitness evaluations from the previous batch, while ensuring the updated batch conforms to the new search distribution. Importance mixing works in two steps: In the first step, rejection sampling is performed on the previous batch, such that sample z is accepted with probability min 1, (1 - ) p (z|) p (z| ) .

So the probability of a sample entering the batch is just p (z|) in that region. The same result holds also for the region where (1 - ) p (z|) /p (z| ) > 1.

6. The Algorithm
Integrating all the algorithm elements introduced above, Natural Evolution Strategies (with block fitness baselines) can be summarized as 1 2 3 4 5 6 7 8 9 10 11 12 13 initialize A  I repeat compute A- , and C- = A- A- update batch using importance mixing evaluate f (zi ) for new zi compute the gradient G for k = d to 0 compute the exact FIM inverse F- k compute the baseline bk k  F- Gk f - bk k end    +  until stopping criteria is met

Here   [0, 1] is the minimal refresh rate. Let na be the number of samples accepted in the first step. In the second step, reverse rejection sampling is performed as follows: Generate samples from p (z|) and accept z with probability max , 1 - p (z| ) p (z|)

until n-na new samples are accepted. The na samples from the old batch and n - na newly accepted samples together constitute the new batch. Note that only the fitnesses of the newly accepted samples need to be evaluated. The advantage of using importance mixing is twofold: On the one hand, we reduce the number of fitness evaluations required in each batch, on the other hand, if we fix the number of newly evaluated fitnesses, then many more fitness evaluations can potentially be used to yield more reliable and accurate gradients. The minimal refresh rate  lower bounds the expected proportion of newly evaluated samples  = E n-na , n namely   , with the equality holding iff  =  . In particular, if  = 1, all samples from the previous batch will be discarded, and if  = 0,  depends only on the distance between p (z|) and p (z| ). Normally we set  to be a small positive number, e.g. 0.01, to avoid too low an acceptance probability at the second step when p (z| ) /p (z|) 1. It can be proven that the updated batch conforms to the search distribution p (z|). In the region where (1 - ) p (z|) /p (z| )  1, the probability that a sample from previous batches appears in the new batch is p (z| ) · (1 - ) p (z|) /p (z| ) = (1 - ) p (z|) . The probability that a sample generated from the second step entering the batch is p (z|), since max {, 1 - p (z| ) /p (z|)} = .

Assuming that n scales linearly with d, the complexity of our algorithm is O d3 (Sun et al., 2009). This is a significant improvement over the original NES, whose complexity is O d6 . Implementations of NES are available in both Python and Matlab2 .

7. Experiments
The tunable parameters of Natural Evolution Strategies are comprised of the batch size n, the learning rate , the refresh rate  and the fitness shaping function. In addition, three kinds of fitness baselines can be used. We empirically find a good and robust choice for the learning rate  to be 1.0. On some (but not all) benchmarks the performance can be further improved by more aggressive updates. Therefore, the only parameter that needs tuning in practice is the batch size, which is dependent on both the expected ruggedness of the fitness landscape and the problem dimensionality. For problems with wildly fluctuating fitnesses, the gradient is disproportionately distorted by extreme fitness values, which can lead to premature convergence or numerical instability. To overcome this problem, we
The Python code is part of the PyBrain machine learning library (www.pybrain.org) and the Matlab code is available at www.idsia.ch/~sun/enes.html
2

Stochastic Search using the Natural Gradient

use fitness shaping, an order-preserving nonlinear fitness transformation function (Wierstra et al., 2008b). The choice of (monotonically increasing) fitness shaping function is arbitrary, and should therefore be considered to be one of the tuning parameters of the algorithm. We have empirically found that ranking-based shaping functions work best for various problems. The shaping function used for all experiments in this paper ^ ^ was fixed to f (z) = 2i - 1 for i > 0.5 and f (z) = 0 for i < 0.5, where i denotes the relative rank of f (z) in the batch, scaled between 0 . . . 1. 7.1. Benchmark Functions We empirically validate our algorithm on 9 unimodal functions out of the set of standard benchmark functions from (Suganthan et al., 2005) and (Hansen & Ostermeier, 2001), that are typically used in the literature, for comparison purposes and for competitions. We randomly choose the inital guess at average distance 1 from the optimum. In order to prevent potentially biased results, we follow (Suganthan et al., 2005) and consistently transform (by a combined rotation and translation) the functions' inputs, making the variables non-separable and avoiding trivial optima (e.g. at the origin). This immediately renders many other methods virtually useless, since they cannot cope with correlated search directions. NES, however, is invariant under translation and rotation. In addition, the rank-based fitness shaping makes it invariant under order-preserving transformations of the fitness function. 7.2. Fitness Baselines and Importance Mixing We introduced optimal fitness baselines to increase the algorithm's robustness, we can thus determine their effectiveness by comparing the probability of premature convergence, for each type of baseline, on a diverse set of benchmarks. Importance mixing, on the other hand, was designed to reduce the required number of fitness evaluations. In order to measure the benefits of both enhancements, as well as their interplay, we conducted a set of experiments for 16 different settings. Each set consisted in 10 independent runs on each of the 8 unimodal benchmark functions (on dimension 5), using n = 50 and  = 1.0. We varied the value of  between 0.0 and 1.0, the latter corresponding to not using importance mixing at all. We compare the three types of optimal fitness baselines introduced in section 4 to using no baseline. Table 1 summarizes the results. Not using any baseline shows equivalent behavior to using the uniform fitness baseline. In both cases, the

Table 1. Average number of evaluations and percentage of runs that prematurely converged, while varying  and the type of fitness baseline used.

Baseline None None None None Uniform Uniform Uniform Uniform Specific Specific Specific Specific Block Block Block Block

 0.0 0.1 0.2 1.0 0.0 0.1 0.2 1.0 0.0 0.1 0.2 1.0 0.0 0.1 0.2 1.0

evaluations 1285 ± 739 1456 ± 533 2011 ± 650 8306 ± 2756 1251 ± 646 1488 ± 650 2060 ± 775 8510 ± 3158 1405 ± 866 2181 ± 2801 2430 ± 1769 7973 ± 2407 1329 ± 662 1813 ± 725 2481 ± 919 8199 ± 2321

premature convergence 52% 42% 40% 35% 50% 37% 42% 33% 33% 29% 27% 25% 0% 0% 0% 0%

algorithm tends to prematurely converge on ParabR and SharpR, to a lesser degree also on Cigar. In contrast, when using the parameter-specific baseline, we find that the algorithm consistently fails on Ellipsoid and Tablet, while working well on ParabR, SharpR and Cigar. Finally, block fitness baselines are very robust and have not been found to prematurely converge in any of our experiments. Importance mixing is clearly beneficial to performance in all cases, but slightly decreases the robustness. The latter is not an issue when using block fitness baselines, which frees us from the requirement of tuning , as a value of 0.0 consistently gives the best performance. However, taking into consideration computation time, it can be prudent to use a slightly larger , for the reasons given in section 5. For the following experiments, we consistently use block fitness baselines and set  = 0.01. 7.3. Performance on Benchmark Functions We ran NES on the set of unimodal benchmark functions on dimension 50 with batch size 1000, using  = 1.0 and a target precision of 10-10 . Figure 1 shows the average performance over 5 runs for each benchmark function. We left out the Rosenbrock function on which NES is one order of magnitude slower than on the other functions (approximately 2,000,000 evaluations). Presumably this is due to the fact that

Stochastic Search using the Natural Gradient

8. Discussion
Cigar DiffPow Ellipsoid ParabR Schwefel SharpR Sphere Tablet
-fitness

Unlike most stochastic search algorithms, NES boasts a relatively clean derivation from first principles. The relationship of NES to methods from other fields, notably evolution strategies (Hansen & Ostermeier, 2001) and policy gradients (Peters & Schaal, 2008; Kakade, 2001), should be evident to readers familiar with both of these domains, as it marries the concept of fitness-based black box optimization from evolutionary methods with the concept of Monte Carlo-based gradient estimation from the policy gradient framework. Using both a full multinormal search distribution and fitness shaping, the NES algorithm is invariant to translation and rotation and to order-preserving transformations of the fitness function. We empirically showed that fitness baselines significantly improve the algorithm's robustness. We also measured the usefulness of importance mixing, which reduces the number of required fitness evaluations by a factor 5, and renders the algorithm's performance less sensitive to the batch size hyperparameter, because the number of effectively evaluated fitness values in each batch is adjusted dynamically. Comparing our empirical results to CMA-ES (Hansen & Ostermeier, 2001), considered by many to be the `industry standard' of stochastic search, we find that NES is competitive but slower on most but not all standard benchmark functions, especially on higher dimensions. On the difficult double-pole balancing benchmark, however, NES yields faster and more robust results. Furthermore, the results in a companion paper (Sun et al., 2009) show that NES is also competitive with CMA-ES on multimodal benchmarks. Our results collectively show that NES can compete with state-of-the-art stochastic search algorithms on standard benchmarks. It holds a lot of promise for ongoing real-world experiments. Future work will address the problems of automatically determining good batch sizes and dynamically adapting the learning rate. We plan to investigate the possibility of combining our algorithm with other methods (e.g. Estimation of Distribution Algorithms) to accelerate the adaptation of covariance matrices, improving performance on fitness landscapes where directions of ridges and valleys change abruptly (e.g. the Rosenbrock benchmark). Moreover, realizing that stochastic search based on the natural gradient is not limited to any particular distribution, we can derive the FIM inverse for other (e.g. heavy-tailed) distributions using the same methodology.

number of evaluations

Figure 1. Results for 8 unimodal benchmark functions on dimension 50, averaged over 5 runs.

the principal search direction is updated too slowly on complex curvatures. Note that SharpR and ParabR are unbounded functions, which explains the abrupt drop-off. 7.4. Non-Markovian Double Pole Balancing Non-Markovian double pole balancing is a challenging task which involves balancing two differently sized poles hinged on a cart that moves on a finite track. The single control consists of the force F applied to the cart and observations include the cart's position and the poles' angles, but no velocity information, which makes this task partially observable. It provides a perfect testbed for algorithms focusing on learning fine control with memory in continuous state and action spaces (Wieland, 1991). We used the implementation as found in (Gomez & Miikkulainen, 1997). We employ NES to optimize the parameters of the controller of the cart, which is implemented as a simple recurrent neural network, with three inputs (position x and the two poles' angles 1 and 2 ), three hidden sigmoid units, and one output, resulting in a total of 21 weights to be optimized. An evaluation is considered a success iff the poles do not fall over for 100, 000 time steps. Using a batch size of 100, the average number of evaluations until success, over 50 runs, was 1753. Only 6% of the runs did not reach success within the limit of 10000 evaluations. Table 2 shows results of other premier algorithms applied to this task, as reported in the literature. NES clearly outperforms all other methods except CoSyNE.

Stochastic Search using the Natural Gradient Table 2. Non-Markovian double pole balancing. Shown are the average numbers of evaluations for SANE (Moriarty & Miikkulainen, 1996), ESP (Gomez & Miikkulainen, 1997), NEAT (Stanley & Miikkulainen, 2002), CMA-ES (Hansen & Ostermeier, 2001), CoSyNE (Gomez et al., 2006), FEM (Wierstra et al., 2008a), and NES.

Method Evaluations

SANE 262, 700

ESP 7, 374

NEAT 6, 929

CMA-ES 3, 521

CoSyNE 1, 249

FEM 2, 099

NES 1, 753

9. Conclusion
NES constitutes a competitive, theoretically wellfounded and relatively simple method for stochastic search. Unlike previous natural gradient methods, NES quickly calculates the inverse of the exact Fisher information matrix. This increases robustness and accuracy of the search gradient estimation, even in higher-dimensional search spaces. Furthermore, importance mixing prevents unnecessary redundancy embodied by samples from earlier batches. Good results on standard benchmarks and the difficult non-Markovian double pole balancing task affirm the promise of this research direction.

Hansen, N., & Ostermeier, A. (2001). Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9, 159­195. Kakade, S. (2001). A natural policy gradient. In Advances in neural information processing systems (NIPS01), 12, 1531­1538. Moriarty, D. E., & Miikkulainen, R. (1996). Efficient reinforcement learning through symbiotic evolution. Machine Learning, 22, 11­32. Peters, J., & Schaal, S. (2008). Natural actor-critic. Neurocomputing, 71, 1180­1190. Peters, J. R. (2007). Machine learning of motor skills for robotics. Doctoral dissertation, Los Angeles, CA, USA. Adviser-Stefan Schaal. Stanley, K. O., & Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. Evolutionary Computation, 10, 99­127. Suganthan, P. N., Hansen, N., Liang, J. J., Deb, K., Chen, Y. P., Auger, A., & Tiwari, S. (2005). Problem definitions and evaluation criteria for the cec 2005 special session on real-parameter optimization (Technical Report). Nanyang Technological University, Singapore. Sun, Y., Wierstra, D., Schaul, T., & Schmidhuber, J. (2009). Efficient natural evolution strategies. To appear in: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO09). Wieland, A. (1991). Evolving neural network controllers for unstable systems. Proceedings of the International Joint Conference on Neural Networks (IJCNN91), 2, 667­673. Wierstra, D., Schaul, T., Peters, J., & Schmidhuber, J. (2008a). Fitness expectation maximization. In Parallel problem solving from nature (PPSN08), 337­ 346. Wierstra, D., Schaul, T., Peters, J., & Schmidhuber, J. (2008b). Natural evolution strategies. Proceedings of the Congress on Evolutionary Computation (CEC08), Hongkong, 3381­3387.

Acknowledgments
We thank Fred Ducatelle for his valuable and timely input. This research was funded by SNF grants 200020-116674/1, 200021-111968/1 and 200021-113364/1.

References
Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10, 251­276. Amari, S., Cichocki, A., & Yang, H. (1995). A new learning algorithm for blind signal separation. Advances in Neural Information Processing Systems (NIPS95), 8, 757­763. Amari, S., & Douglas, S. C. (1998). Why natural gradient? Proceedings of the 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP98), 2, 1213­1216. Gomez, F., & Miikkulainen, R. (1997). Incremental evolution of complex general behavior. Adaptive Behavior, 5, 317­342. Gomez, F., Schmidhuber, J., & Miikkulainen, R. (2006). Efficient non-linear control through neuroevolution. Proceedings of the 16th European Conference on Machine Learning (ECML06), 4212, 654­ 662.