A fast natural Newton method
Nicolas Le Roux Andrew Fitzgibbon Microsoft Research, 7 JJ Thomson Avenue, Cambridge, CB3 0FB UK

nicolas.le.roux@gmail.com awf@microsoft.com

Abstract
Nowadays, for many tasks such as object recognition or language modeling, data is plentiful. As such, an important challenge has become to find learning algorithms which can make use of all the available data. In this setting, called "large-scale learning" by Bottou & Bousquet (2008), learning and optimization become different and powerful optimization algorithms are suboptimal learning algorithms. While most efforts are focused on adapting optimization algorithms for learning by efficiently using the information contained in the Hessian, Le Roux et al. (2008) exploited the special structure of the learning problem to achieve faster convergence. In this paper, we investigate a natural way of combining these two directions to yield fast and robust learning algorithms.

ture of machine learning problems (viewing the gradient obtained as a noisy estimate of the true gradient of the function we are really interested in), one could obtain faster convergence speeds than first order gradient descent methods without using the Hessian. We investigate whether this improvement is due to the similarity of these methods to approximate second-order methods or, if this is not the case, if we can combine these improvements with the ones obtained when using the information contained in the Hessian. The paper is organized as follows: section 2 explores the differences between the optimization and the learning frameworks, section 3 describes our proposed algorithm combining Newton method and natural gradient, which is the basis for the experiments in section 4.

2. Optimization versus learning
2.1. Optimization methods The goal of optimization is to minimize a function f , which we will assume to be twice differentiable and defined from a space E to R, over E. This is a problem with a considerable literature (Nocedal & Wright, 2006). It is well known that second order descent methods, which rely on the Hessian of f (or approximations thereof), enjoy much faster theoretical convergence than first order methods (quadratic versus linear), even when accounting for the potential complexity of computing and inverting the Hessian. Such methods include the Newton method, Gauss-Newton, Levenberg-Marquardt and Quasi-Newton methods such as BFGS. 2.2. Online learning The learning framework differs slightly from the optimization one. The function f we wish to minimize (which we call the "cost function") is defined as the expected value of a function L under a distribution p, that is  f () = L(, x)p(x) dx (1)
x

1. Introduction
Machine learning often looks like optimization: write down the likelihood of some training data under some model and find the model parameters which maximize that likelihood, or which minimize some divergence between the model and the data. In this context, conventional wisdom is that one should find in the optimization literature the state of the art optimizer for one's problem and use it. Furthermore, many machine learning objective functions are smooth in the optimization sense, so secondorder optimizers are the tools of choice. And indeed, comparing second order methods to first order ones shows significant improvements in learning speed. However, recent research (Le Roux et al., 2008) has shown that, by paying attention to the special strucAppearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

A fast natural Newton method

and we have access only to samples xi drawn from p. If we have n samples, we can define a new function f () = 1 L(, xi ) n i . (2)

Let us call f the test cost, and f the training cost. The xi are the training data. As n goes to infinity, the difference between f and f vanishes. Bottou & Bousquet (2008) study the case where one has access to a potentially infinite amount of training data but only a finite amount of time. This setting, which they dub large-scale learning, calls for a tradeoff between the quality of the optimization for each datapoint and the number of datapoints treated. They showed that: 1. good optimization algorithms may be poor learning algorithms 2. stochastic gradient descent enjoys a faster convergence rate than batch gradient descent 3. introducing second order information can win us a constant factor (the condition parameter). Therefore, the choice lies between first and second order stochastic gradient descent, depending on the additional cost of taking second order information into account and the condition parameter. Recently, several authors have developed algorithms allowing for efficient use of this second order information in a stochastic setting (Schraudolph et al., 2007; Bordes et al., 2009). However, we argue, all of these methods are derived from optimization methods without taking into account the particular nature of the learning problem. 2.3. Taking uncertainty into account To our knowledge, the first paper explicitly accounting for the uncertainty of the gradient computed on the training set is (Le Roux et al., 2008). The argument is as follows. With f and f as above, we write the gradient of f as  L(, x) df = p(x) dx (3) g= d  x and the gradient of f as g= df 1  L(, xi ) = d n i  (4)

as the "true" gradient of f , and of g as an "empirical" gradient of f , which we view as the mean of a set of samples drawn from a distribution with true mean g. If the training samples are iid, and given the "largescale" assumption of large n, we can use the centrallimit theorem, which yields ) ( C g | g  N g, (5) n where  ( C=
x

L(, x) -g 

)(

L(, x) -g 

)T p(x) dx

(6) is the true covariance matrix of the gradients. Relaxing eq. 5 to finite n and defining an isotropic Gaussian prior over g: g  N (0,  2 I), (7) yields the following posterior distribution over the true gradient: ([ ) ]-1 [ -1 ] C -2 -1 g|gN I+ g, nC +  I . (8) n 2 Replacing the true covariance C by the empirical covariance C defined as: ( )( )T 1  L(, xi ) L(, xi ) C= -g -g , (9) n i   the direction of maximum expected gain becomes [ ]-1 C   I + g, (10) n 2 reminiscent of the natural gradient (Amari, 1998), with two differences: · C is here the centered covariance matrix, whereas (Amari, 1998) uses the uncentered one; · when the number of datapoints n goes to infinity, the effect of the covariance matrix vanishes. This is understandable as, in that case, f and f are equal, and so are g and g. Le Roux et al. (2008) report large speedups on various neural network problems. Intuitively, one can understand why using the covariance may be beneficial. Indeed, it seems wasteful to compute the gradient over many data points and only keep their mean. While this allows for greater accuracy, one would think that more could (and should) be kept from these computations.

where the dependence of g and g on  has been omitted to keep the notation uncluttered. We may think of g

A fast natural Newton method

2.4. Natural gradient is not an approximation to Newton Before moving on to the core of the paper, we clarify the links between natural gradient and Newton method as this should help the reader understand the advantage one can gain from using both. 2.4.1. Similarities Maximum likelihood: Let us assume that we are training a density model by minimizing the negative log-likelihood. The cost function fnll is defined by  fnll () = - log[L(, x)]p(x) dx. (11)
x

with the cost for each datapoint being L(, xi ) = N fi ()2 . 2 (16)

The gradient of this cost with respect to  is gi = L(, xi ) fi () = N fi () .   (17)

At the optimum (where the average of the gradients is zero and the centered and uncentered covariance matrices are equal), the covariance matrix of the gi 's is C = 
i T gi gi = N 2


i

fi ()2

fi fi T , (18)  

Note that this L is not the same as the L used in sections 2.2 and 2.3. Let us assume that there is a set of parameters  such that our model is perfect and that we are at this . Then the covariance matrix of the gradients at that point is equal to the Hessian of fnll . In the general case, this equality does not hold. Gauss-Newton: Gauss-Newton is an approximation to the Newton method when f can be written as a sum of residuals: 1 f () = fi ()2 . (12) 2 i Computing the Hessian of f yields  2 fi  fi fi T  2 f ()  = fi () 2 + . 2    i i (13)

which is a weighted sum of the terms involved in eq. 14. Thus the natural gradient and the Gauss-Newton approximation, while related, are different quantities, and (as we show) have very different properties. 2.4.2. Differences Remember what the Hessian is: a measure of the change in gradient when we move in parameter space. In other words, the Hessian helps to answer the question: if I were at a slightly different position in parameter space, how different would the gradient be? It is a quantity defined for any (twice differentiable) function. On the other hand, the covariance matrix of the gradients captures the uncertainty around this particular choice of training data, i.e. the change in gradient when we move in input space. In other words, the covariance helps us to answer the question: If I had slightly different training data, how different would the gradient be? This quantity only makes sense when there are training data. Whereas the Hessian seems naturally suited to optimization problem (it allows us to be less short-sighted when minimizing a function), the covariance matrix unleashes its power in the learning setting, where we are only given a subset of the data. From this observation, it seems natural to combine these two matrices.

If the fi get close to 0 (relative to their gradient), the first term may be ignored, yielding the following approximation to the Hessian: H  fi fi T
i

 
.

(14)

One, however, must be aware that: · this approximation is only interesting when the fi are residuals (that is, when the approximation is valid close to the optimum) · the gradients involved are those of fi and not of fi2 · the term on the right-hand side is the uncentered covariance of these gradients. In order to compare the result of eq. 14 to the natural gradient, we will assume that the sum in eq. 12 is over datapoints, that is 1 1  f () = fi ()2 = L(, xi ) (15) 2 i N i

3. Combining Newton method and natural gradient
Building upon (Le Roux et al., 2008), we now show how to use Hessian information within the natural gradient. The Newton method assumes that the cost function is locally quadratic, i.e: f () = 1 ( -  )T H( -  ) 2 (19)

A fast natural Newton method

for some value of  . In the case of a learning problem, this would translate to   1 f () = L(, x)p(x) dx = (-x)T H(-x)p(x) dx x x 2 (20)  with  = x xp(x) dx. Here we make the assumption that H depend only weakly on x, a common assumption in online second-order methods. The derivative of this cost is:  L(, x) p(x) dx = H( -  ). (21) g() =  x We can see that, in the context of a quadratic function, the isotropic prior over g proposed in eq. 7 is erroneous as g is clearly influenced by H. We shall rather consider an isotropic Gaussian prior on the quantity  - as we do not have any information about the position of  relative to  . The resulting prior distribution over g is ( ) g  N 0,  2 H 2 (22) where we omitted the dependence on  to keep the notation uncluttered. In a similar fashion to section 2.3, we will suppose that we are only given a finite training set composed of n datapoints xi with associated gradients gi . The empirical gradient g is the mean of the gi 's. Using the central-limit theorem, we again have ( ) C g | g  N g, (23) n where C is the true covariance of the gradients, i.e.  ( C=
x

Since eq. 26 appears complicated, we shall explain it. Let us define by di the Newton directions: di = H -1 gi . (28)

Since C is the covariance matrix of the gradients gi , H -1 CH -1 = D is the covariance matrix of the di 's. We can therefore rewrite [  ]-1 [ ]-1 D I  H -1 g | g  N  I + d, 2 + nD-1 n 2  (29) where d is the average of the Newton directions, i.e. d = H -1 g. The direction which maximizes the expected gain is thus [ ]-1 D d . (30)   - I + n 2 This formula is exactly the (regularized) natural gradient, but on the Newton directions. This is good news as it means that one may choose his favorite second-order gradient descent method (for instance, SGD-QN (Bordes et al., 2009)) to compute the Newton directions, and then his favorite natural gradient algorithm (for instance, TONGA (Le Roux et al., 2008)) to apply to these Newton directions, to yield an algorithm combining the advantages of both methods. As a side note, one can see that, as the number n of data points used to compute the mean increases, the prior vanishes and the posterior distribution concentrates around the empirical Newton direction. As mentioned in section 2.2, online methods are faster than batch methods. Thus, we will update the parameters of our model after each datapoint, replacing the empirical mean d and covariance D in eq. 30 by running averages, as detailed in section 3.2. We shall now analyze several components of this method. 3.1. Setting a zero-centered prior at each timestep Eq. 29 has been obtained using the zero-centered Gaussian prior defined in eq. 22. Except for the first update, one may wonder why we would use such a distribution rather than the posterior distribution at the previous timestep as our prior. The reason is that, whenever we update the parameters of our model, the distribution over the gradients changes. If the function to optimize were truly quadratic, we could exactly quantify the change in gradient using our approximation of the Hessian. Unfortunately, this is not

L(, x) -g 

)(

L(, x) -g 

)T p(x) dx.

(24) Therefore, the posterior distribution over g is ([ ] [ -2 ]-1 ) CH -2 H g|gN I+ g, + nC -1 n 2 2 (25) Since the function is locally quadratic, we wish to move in the direction H -1 g. This direction follows a Gaussian distribution with mean [ ]-1 H -1 CH -1 I+ H -1 g (26) n 2 and covariance [ I + nHC -1 H 2 ]-1 . (27)

Once again, we shall replace the true covariance matrix C of the gradients by its empirical counterpart, C.

A fast natural Newton method

the case. Thus, while acknowledging that using the prior of eq. 22 at every timestep is a suboptimal strategy, we believe there is still something to be gained while retaining the simplicity of the algorithm. 3.2. Exponentially moving covariance matrix Since efficiency is our main goal, we need a fast way to update the covariance matrix of the data points which progressively "forgets" about older data. For that purpose, we shall use exponentially moving mean and covariance, namely: n µn = = = Dn =
n 

specifies how many gradient updates are done before the approximation to the Hessian is updated. We introduce an additional variable skipC which specifies how many Hessian approximation updates are done before updating the covariance approximation. The total number of gradient updates between two covariance approximation updates is therefore skip · skipC . Experiments using the validation set showed that using values of skipC lower than 8 did not yield any improvement while increasing the cost of each update. We therefore used this value in all our experiments. This allows us to use the information contained in the covariance with very little computation overhead. 3.4. Limiting the influence of the covariance

 n-i

(31) (32) (33) (34)

i=1 n i=1

 n-i di n (n - 1)µn-1 + dn n n n-i (di - µn )(di - µn )T i=1  n -
n
i=1

Eq. 29 tells us that the direction to follow is [ D I+ n 2 ]-1 d. (38)

 -2i n

where di is the Newton direction obtained at timestep i and  is the discount factor. The closer  is to 1, the longer examples will influence the means and covariance. Introducing Un , the uncentered covariance matrix at timestep n, we can easily update Dn using: n -i T i=1  di di Un = (35) n (n - 1)Un-1 + dn dT n = (36) n Dn = Un - µn µT n (37)

The only unknown in this formula is  2 , which is the variance of our Gaussian prior on  -  . To avoid having to set this quantity by hand at every time step, we will devise a heuristic to find a sensible value of  2 . While this will lack the purity of a full Bayesian treatment, it will allow us to reduce the number of parameters to be set by hand, which we think is a valuable feature of any gradient descent algorithm. If we knew the distance from our position in parameter space, , to the optimal solution,  , then the optimal value for  2 would be  -  2 . Of course, this information is not available to us. However, if the function to optimize were truly quadratic, the squared norm of the Newton direction would be exactly  -  2 . We shall therefore replace  2 by the squared norm of the last computed Newton direction. Since this estimate may be too noisy, we will replace it by the squared norm of the running average of the Newton directions, i.e. µn 2 . However, even then, we may still get undesirable variations. We shall therefore adopt a conservative strategy: we will set an upper bound on the correction to the Newton method brought by eq. 38. More precisely, D we will bound the eigenvalues of nµn 2 by a positive number B. The parameter update then becomes n - n-1 [ ( = - I + min B, Cd nµn 2 )]-1 H -1 gn

Therefore, to update Dn , one first computes the new n , then computes µn and Un which will be combined to yield Cn . If the number of parameters is large, computing a full covariance matrix would be too costly. Le Roux et al. (2008) propose an efficient way of computing a lowrank approximation of the covariance matrix. Though their method is for an uncentered covariance matrix, it can be modified to accommodate centered covariance matrices. 3.3. Frequency of updates The covariance matrix of the gradients changes very slowly. Therefore, one does not need to update it as often as the Hessian approximation. In the SGD-QN algorithm, the authors introduce a counter skip which

(39) where B is a hyperparameter and min(B, M ) is defined for symmetric matrices M with eigenvectors u1 , . . . , un

A fast natural Newton method

and eigenvalues 1 , . . . , n as min(B, M ) =
n  i=1

min(B, i )ui uT i

,

(40)

(we bound each eigenvalue of M by B). If we set B = 0, we recover the standard Newton method. This modification transforms the algorithm in a conservative way, trading off potential gains brought by the covariance matrix with guarantees that the parameter update will not differ too much from the Newton direction. The pseudo-code for the algorithm is shown in Algorithm 1.

4. Experiments
4.1. Algorithms chosen Our algorithm requires two independent components: 1. an approximation to the Newton method, to get the Newton directions 2. an approximation to the natural gradient to be applied to the Newton directions. In these experiments, the former was chosen to be SGD-QN (Bordes et al., 2009), since it recently won the Wild Track competition at the Pascal Large Scale Learning Challenge. Since this method uses a diagonal approximation to the Hessian, we decided to use a diagonal approximation to the covariance matrix. Though this was not required and we could have used a low-rank covariance matrix, using a diagonal approximation shows the improvements over the original method one can obtain with little extra effort. 4.2. Experimental setup Experiments have been led on datasets of the Pascal Large Scale Learning Challenge, namely Alpha, Gamma, Delta, Epsilon, Zeta and Face datasets. Labels were only available for the training examples of the challenge. We therefore split these examples into several sets: · the first 100K (1M for the Face dataset) examples constituted our training set · the last 100K (1M for the Face dataset) examples constituted our test set The last 50K (500K for the Face dataset) examples of the training set were used as validation examples to tune the bound B defined in eq. 39. The same value of B was used for TONGA.

Algorithm 1 Simplified pseudo-code of the NaturalNewton algorithm Require: : skip (number of gradient updates between Hessian updates) Require: : skipC (number of Hessian updates between covariance updates) Require: : 0 (the original set of parameters) Require: :  (the discount factor for the moving covariance) Require: : T (the total number of epochs) Require: : t0 ,  (the weight decay) 1: t = 0, count = skip, countC = skipC 2: 0 = 0, 0 = 0 3: H = -1 I, D = H 4: µ1 = 0 (the running mean vector ), C1 = 0 (the running covariance matrix) 5: while t = T do t ,x 6: gt  L(tt ,yt ) 7: t+1  t - (t + t0 )-1 Dgt 8: if count == 0 then 9: count  skip 10: Update H, the approximation to the Hessian, according to the SGD-QN algorithm 11: if countC == 0 then 12: countC  skipC 13: t+1  t   + 1 14: t+1  t   2 + 1 -1)µ 15: µt+1  (t+1 t+1 t +dt
16: 17: 18: 19: 20: 21: 22: 23: 24: 25:

Ct+1 

N1 ( )-1 C -µt+1 µT D = I + t+1·µt+1 2t+1 N else countC  countC - 1 end if else count  count - 1 end if end while

(t+1 -1)Ct +dt dT t t+1 t+1 - 2 t+1

4.3. Parameter tuning In all the experiments,  has been set to 0.995, following (Le Roux et al., 2008). To test the sensitivity of the algorithm to this parameter, we tried other values (0.999, 0.992, 0.99 and 0.9) without noticing any significant difference in validation errors. We optimized the bound on the covariance (§3.4) on the validation set. The best value was chosen for the test set, but we found that a value of 2 yielded nearoptimal results on all datasets, the difference between B = 1, B = 2 and B = 5 being minimal, as shown in

A fast natural Newton method

figure 1 in the case of the Alpha dataset. 4.4. Results
0.3 B=1 B=2 B=5 B = 10

0.28 SGD TONGA SGD-QN Natural-Newton

0.27

0.26

0.29

0.25

0.28

0.24

0.27

0.23

0.26

0.22

0.25
0.21

0.24
0.2 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

0.23

0.22
0.29

Figure 3. Test error vs. time on the Gamma dataset
0 0.1 0.2 0.3 0.4 0.5 0.6 Training time (sec) 0.7 0.8 0.9 1
0.28 SGD TONGA SGD-QN Natural-Newton

0.21

Figure 1. Validation error vs. time on the Alpha dataset, for various values of B.
0.3 SGD TONGA SGD-QN Natural-Newton

0.27

0.26

0.25

0.29

0.24

0.28
0.23

0.27
0.22

0.26
0.21 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

0.25

Figure 4. Test error vs. time on the Delta dataset
0.24 0.2 0.23 0.19 0.22 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 SGD TONGA SGD-QN Natural-Newton

0.18

Figure 2. Test error vs. time on the Alpha dataset

0.17

0.16

Several conclusions may be drawn from these experiments: · Natural-Newton never performs worse than SGDQN and always better than TONGA. Using a large value of skipC ensures that the overhead of using the covariance matrix is negligible · on the Alpha dataset, using the information contained in the covariance resulted in significantly faster convergence, with or without second-order information · on the Epsilon, Zeta and Face datasets, using the covariance information stabilizes the results while

0.15

0.14

0.13

0.12

0.11

0.1

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Figure 5. Test error vs. time on the Epsilon dataset

A fast natural Newton method
0.1 SGD TONGA SGD-QN Natural-Newton

Hessian and in the covariance matrix of the gradients. Experiments showed that, on most datasets, our method offered either faster convergence or increased robustness compared to the original algorithm. Furthermore, our algorithm never performed worse than the Newton algorithm is was built upon. Moreover, our algorithm is able to use any existing second-order algorithm as base method. Therefore, while we used SGD-QN for our experiments, one may pick any algorithm best suited for a given task. We hope to have shown two things. Firstly, the covariance matrix of the gradients is usefully viewed, not as an approximation to the Hessian, but as a source of additional information about the problem, for typical "machine learning" objective functions. Secondly, it is possible with little extra effort to use this information in addition to that provided by the Hessian matrix, yielding faster or more robust convergence. Despite all these successes, we believe that our algorithm may be improved in several ways, whether it is by retaining some of the information contained in the posterior distribution between timesteps or in the selection of the parameter  2 .

0.095

0.09

0.085

0.08

0.075

0.07

0.065

0.06

0.055

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Figure 6. Test error vs. time on the Zeta dataset
0.4 SGD-QN Natural-Newton 0.38

0.36

0.34

0.32

References
Amari, Shun-ichi. Natural gradient works efficiently in learning. Neural Computation, 10(2):251­276, 1998. Bordes, Antoine, Bottou, L´on, and Gallinari, Patrick. e SGD-QN: Careful quasi-newton stochastic gradient descent. Journal of Machine Learning Research, 10: 1737­1754, July 2009. Bottou, L´on and Bousquet, Olivier. The tradeoffs e of large scale learning. In Platt, J.C., Koller, D., Singer, Y., and Roweis, S. (eds.), Advances in Neural Information Processing Systems, volume 20, pp. 161­168. 2008. Le Roux, Nicolas, Manzagol, Pierre-Antoine, and Bengio, Yoshua. Topmoumoute online natural gradient algorithm. In Advances in Neural Information Processing Systems 20, pp. 849­856. MIT Press, Cambridge, MA, 2008. Nocedal, J. and Wright, S. J. Numerical Optimization, Second Edition. Springer Verlag, New York, 2006. Schraudolph, Nicol N., Yu, Jin, and G¨nter, Simon. u A stochastic quasi-newton method for online convex optimization. In Proceedings of AISTATS 2007, Puerto Rico. 2007.

0.3

0.28

0

0.5

1

1.5

2

2.5

3

Figure 7. Test error vs. time on the Face dataset

yielding the same convergence speed. This is in accordance with the use of the covariance, which reduces the influence of directions where gradients vary wildly · on the Gamma and the Delta dataset, the covariance information helped a lot when the Hessian was not used, yielding no improvement otherwise.

5. Conclusion
A lot of effort has been put into designing efficient online optimization algorithms, with great results. Most of these algorithms rely on some approximation to the Hessian or to the covariance matrix of the gradients. While the latter is commonly believed to be an approximation of the former, we proved that they encode very different kinds of information. Based on this, we proposed a way of combining information contained in the