Sparse Multiscale Gaussian Pro cess Regression

Christian Walder Kwang In Kim Bernhard Scholkopf ¨ Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 Tuebingen, Germany

christian.walder@tuebingen.mpg.de kimki@tuebingen.mpg.de berhard.schoelkopf@tuebingen.mpg.de

Abstract
Most existing sparse Gaussian process (g.p.) models seek computational advantages by basing their computations on a set of m basis functions that are the covariance function of the g.p. with one of its two inputs fixed. We generalise this for the case of Gaussian covariance function, by basing our computations on m Gaussian basis functions with arbitrary diagonal covariance matrices (or length scales). For a fixed number of basis functions and any given criteria, this additional flexibility permits approximations no worse and typically better than was previously possible. We perform gradient based optimisation of the marginal likelihood, which costs O(m2 n) time where n is the number of data points, and compare the method to various other sparse g.p. methods. Although we focus on g.p. regression, the central idea is applicable to all kernel based algorithms, and we also provide some results for the support vector machine (s.v.m.) and kernel ridge regression (k.r.r.). Our approach outperforms the other methods, particularly for the case of very few basis functions, i.e. a very high sparsity ratio.

lem, either by approximating the posterior distribution, or constructing degenerate covariance functions for which the exact posterior is less expensive to evaluate (Smola & Bartlett, 2000; Csat´ & Opper, 2002; o Lawrence et al., 2002; Seeger et al., 2003; Snelson & Ghahramani, 2006) -- for a unifying overview see (Quinonero-Candela & Rasmussen, 2005). The ma jor~ ity of such methods achieve an O(m2 n) time complexity for training where m  n is the number of points on which the computations are based. The g.p. can be interpreted as a linear (in the parameters) model which, due to its non-parametric nature, has potentially as many parameters to estimate as there are training points. An exception is the case where the covariance function has finite rank, such as the linear covariance function on Rd × Rd given by k (x, x ) = x x , which has rank d. In this case the g.p. collapses to a parametric method and it is possible to derive algorithms with O(d2 n) time complexity by basing the computations on d basis functions. For non-degenerate covariance functions, most existing sparse g.p. algorithms all have in common that they base their computations on m basis functions of the form k (vi , ·). Typically the set V = {v1 , v2 , . . . , vm } is taken to be a subset of the training set (Smola & Bartlett, 2000; Csat´ & Opper, 2002; Seeger et al., o 2003). For example Seeger et al. (Seeger et al., 2003) employ a highly efficient approximate information gain criteria to incrementally select points from the training set in a greedy manner. More recently Snelson and Ghahramani (2006) have shown that further improvements in the quality of the model for a given m can be made -- especially for small m -- by removing the restriction that V be a subset of the training set. For this they introduced a new sparse g.p. model which has the advantage of being closer to the full g.p., and also of being more amenable to gradient based optimisation of the marginal likelihood with respect to the set V . A further advantage of their con-

1. Intro duction
The Gaussian process (g.p.) is a popular nonparametric model for supervised learning problems. Although g.p.'s have been shown to perform well on a wide range of tasks, their usefulness is severely limited by the O(n3 ) time and O(n2 ) storage requirements where n is the number of data points. A large amount of work has been done to alleviate this probAppearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).


Sparse Multiscale GP Regression

tinuous optimisation of V is that the hyper-parameters of the model can be optimised at the same time -- this is more difficult when V is taken to be a subset of the training set, since choosing such a subset is a hard combinatoric problem. In this paper we take a logical step forward in the development of sparse g.p. algorithms. We also base our computations on a finite set of basis functions, but remove the restriction that the basis functions be of the form k (vi , ·) where k is the covariance of the g.p. This will require computing integrals involving the basis and covariance functions, and so cannot always be done in closed form. Fortunately however, closed form expressions can be obtained for arguably the most useful scenario, namely that of Gaussian covariance function (with arbitrary diagonal covariance matrix) along with Gaussian basis functions (again each with their own arbitrary diagonal covariance matrix). The central idea is that, under some mild restrictions, we can compute the prior probability density -- under the g.p. model with Gaussian covariance -- of arbitrary Gaussian mixtures. Our analysis is new, but there is a precedent for it in the literature. In particular, Walder et al. (2006) employ a similar idea, but from an reproducing kernel Hilbert space (r.k.h.s.) rather than a g.p. perspective, and for a different basis and covariance function. Also related is (Gehler & Franz, 2006), which analyses from a g.p. perspective with arbitrary basis and covariance function, but with the difference that they do not take infinite limits. Our idea has a direct r.k.h.s. analogy. Indeed the main idea is applicable to any kernel machine, but in this paper we focus on the g.p. framework. The main reason for this is that it allows us to build on the sparse g.p. model of Snelson and Ghahramani (2006), which has been shown to be amenable to gradient based optimisation of the marginal likelihood. Nonetheless we do provide some experimental results for the kernel ridge regression (k.r.r.) case, as well as an animated toy example of the support vector machine (s.v.m.), in the accompanying video. The paper is structured as follows. Section 2 provides an introduction to g.p. regression. In Section 3 we derive the likelihood of arbitrary Gaussian mixtures under the g.p. model with Gaussian covariance, and clarify the link to r.k.h.s.'s. In Section 4 we discuss and motivate the precise probabilistic model which we use to make practical use of our theoretical results. Experimental results and conclusions are presented in Sections 5 and 6, respectively.

2. Gaussian Pro cess Regression
We assume that we are given an independent and identically distributed (i.i.d.) sample S = {(x1 , y1 ) , . . . , (xn , yn )}  Rd × R drawn from an unknown distribution, and the goal is to estimate p(y |x). We introduce a latent variable u  R, and make the assumption that p(y |u, x) = p(y |u). Hence we can think of y as a noisy realisation of u, 2 which we model by p(y |u) = N (y |u, n ) where n is a 1 hyper-parameter. The relationship x  u is a random process u(·), namely a zero mean g.p. with covariance function k : Rd × Rd  R. Typically k will be defined in terms of further hyper-parameters. We shall denote such a g.p. as G (k ), which is defined by the fact that its joint evaluation at a finite number of input points is a zero mean Gaussian random variable with covariance Ef G (k) [f (x)f (z )] = k (x, z ). One can show that given the hyper-parameters, the posterior p(u|S ), where2 [u]i = u(xi ), is
2 p(u|S )  p(u)N (y |u, n I )

N

u

, 2 2 2 |Kxx (Kxx + n I )-1 y , n Kxx (Kxx + n I )-1 (1)

where [Kxx ]ij = k (xi , xj ). Like many authors we neglect to notate the conditioning upon the hyperparameters, both in the above expression and for the remainder of the paper. Now, it can also be shown that the latent function u = u(x ) at an arbitrary test point x is distributed according to p(u |x , S ) = p 2 (u |x , S , u)p(u|x , S ) du = N (u |µ ,  ), where
2 µ = y  (Kxx + n I )-1 k , 2 

(2)

= k (x , x ) -

 k (Kxx

+

2 n I )-1 k ,

(3)

and we have defined [k ]i = k (x , xi ). In a Bayesian setting, one places priors over the hyper-parameters and computes the hyper-posterior, but this usually involves costly numerical integration techniques. Alternatively one may fix the hyperparameters to those obtained by maximising some criteria such as the marginal likelihood conditioned upon them, p(y |X ) = N (y |0, Ky ), where X = (x1 , . . . , xn )
1 We adopt the common convention of writing N (x|µ,  ) for the probability density at x of the Gaussian random variable with mean µ and variance  . 2 Square brackets with subscripts denote elements of matrices and vectors, and a colon subscript denotes an entire row or column of a matrix.


Sparse Multiscale GP Regression
2 and Kyy = Kxx + n I is the covariance matrix for y . This can be computed using the result that - log (p(y |X ))  -y  Kyy1 y - log |Kyy | + c, -1 and assume that Kxx is invertible, then  = Kxx u. -1 Following this finite analogy, by k we now intend a loppy notation for the fu nction which, for u = s u (x)k (x, ·) dx, satisfies (x)k -1 (x, ·) dx = (·). Hence if we define  Mk :   Mk  = (x)k (x, ·) dx,

(4)

where c is a term independent of the hyper-parameters. Even when one neglects the cost of choosing the hyperparameters however, it typically costs O(n) and O(n2 ) time to evaluate the posterior mean and variance respectively, after an initial setup cost of O(n3 ).

3. Sparse Multiscale Gaussian Pro cess Regression
In this section we ­ loosely speaking ­ derive the likelihood of a mixture of Gaussians with arbitrary diagonal covariance matrices, under a g.p. prior with a covariance function that is also a Gaussian with arbitrary diagonal covariance matrix. Let u be drawn from G (k ). As we mentioned previously, this means that the vector of joint evaluations at an arbitrary ordered set of points X = (x1 , . . . , xn ) is a random variable, call it uX , distributed according to puX (u) = N (u|0, Kxx ) . Hence by definition p uX ( m (5)

Let us now consider the covariance function given by k (x, y ) = cg (x, y ,  ), where c > 0,  > 0  Rd and g is a normalised Gaussian on Rd × Rd with diagonal covariance matrix, that is4 . -d 1 1 i ([x - y ]i )2 -2 g (x, y ,  ) |2diag ( )| exp 2 =1 [ ]i (9) If we assume furthermore that our function is an arbitrary mixture of such Gaussians, so that ui (x) = g (x, vi , i ), (10)

then k -1 is by definition the Green's function (Roach, 1970) of Mk , as it satisfies M ( -1 (x, ·) dx = (·). (8) k  x)k

i=1 ci ui )

where |·| denotes the matrix determinant. Note that this is simply the probability density functionm(p.d.f.) of uX where we have set the argument to be i=1 ci ui , for some ci  R. We have done this because later we will wish to determine the likelihood of a function expressed as a summation of fixed basis functions. To this end we now consider an infinite limit of the above case. Taking the limit n   of uniformly distributed points3 xi leads to the following p.d.f. for G (k ), m pG (k) ( i=1 ci ui ) - , m 1 2 -1 - 2 1i exp ci cj k (ui , uj ) (6) = k 2 ,j = 1 where k (ui , uj ) k -1 (x, y )ui (x)uj (y ) dx dy . (7)

  im 1 -2 2 1 -1 -1 =  Kxx exp - ci cj u Kxx uj  , i 2 ,j = 1

then the well known integral (for the convolution of two Gaussians) g (x, vi , i )g (x, vj , j ) dx = g (vi , vj , i + j ), (11) leads to 1 ( Mcg(·,·,) g (·, vi , i -  ) x) = g (x, vi , i ) = ui (x). c (12) As the covariance function and the basis functions are all Gaussian we can obtain in closed form k (7,10,12) -1 k (ui , uj ) = (x, y )g (x, vi , i ) 1 ( · Mcg(·,·,) g (·, vj , j -  ) y ) dx dy c g 1 (8) = (x, vi , i )g (x, vj , j -  ) dx c (11) 1 = g (vi , vj , i + j -  ). c
4 We use diag in a sloppy fashion with two meanings -- for a  Rn , diag(a)  Rn×n is a diagonal matrix satisfying [diag(a)]ii = [a]i . But for A  Rn×n , diag(A)  Rn is a column vector with [diag(A)]i = [A]ii

2 -1 We will discuss the factor of  k -1 2 shortly. Note that in the previous case of finite n, if we let u = Kxx 
3 Although any non-vanishing distribution leads to the same result.


Sparse Multiscale GP Regression

For clarity we have noted above each equals sign the number of the equation which implies the corresponding logical step. The following expression summarises the main idea of the present section  exp - pG (cg(·,·,))
m

results for the Gaussian covariance function (for example (Steinwart, 2002)), this is not the case. On the contrary, such results hold only for compact domains, and our analysis is for Rd . An r.k.h.s. Analogy We note that (13) has a direct analogy in the theory of r.k.h.s.'s, as made clear by the following lemma. The lemma follows from (13) and the well understood relationship between every g.p. and the corresponding r.k.h.s. of functions. Lemma 3.1. Let H be the r.k.h.s. with reproducing 1 kernel g (·, ·,  ). If the conditions i > 2  and j > 1 2  are satisfied component-wise, then g(·, vi , i ), g (·, vj , j ) H = g (vi , vj , i + j -  ). (15) If either condition is not satisfied, then the corresponding function on the left hand side is not in H. Naturally this can also be proven directly, but doing so for the general case is more involved and we omit the details due to space limitations.5 However, by assuming that the conditions i >  and j >  are satisfied component-wise, then it is straightforward to obtain the main result. The basic idea is asg follows. Using (11) we substitute g (·, vp , p ) = (·, xp ,  )g (xp , vp , p -  ) dxp for p = i, j into the l.h.s. of (15). By linearity we can write the two integrals outside the inner product. Next we use the r.k.h.s. reproducing property -- the fact that f (·), g (·, x,  ) H = f (x), f  H, x  Rd -- to evaluate the inner product. Using (11) we integrate to obtain the r.h.s. of (15).

1 1i ci cj g (vi , vj , i + j -  ) 2 ,j = 1 c

im

ci g (·, vi , i ) . (13)

=1

We give only an unnormalised form by neglecting the 2 -1 factor  k -1 2 in (6). The neglected factor is equal to the inverse of the integral of the right hand side of mhe above expression with respect to all functions t We need not concern ourselves i=1 ci g (·, vi , i ). with choosing a measure with respect to which this integral is finite, due to the fact that, since we will be working only with ratios of the above likelihood (i.e. for maximum a posteriori (m.a.p.) estimation and marginal likelihood maximisation), we need only the unnormalised form. Note that this peculiarity is not particular to our proposed sparse approximation to the g.p., but is a property of g.p.'s in general. Interpretation We now make two remarks regarding the expression (13). i) If 1 = 2 = · · · = n =  and we reparameterise ci = cc then it simplifies to (5). i- , ii) Let c = 1 and h(x) = exp
-1 1 2 x diag (1 )

x

an unnormalised Gaussian. Using (9) and (13) we can derive the log-likelihood of h under the g.p. prior, | p  h diag (1 )| log G (g(·,·,)) (·) - . (14) |diag (21 -  )|

Simple analysis of this expression shows that the most likely such function h is that with 1 =  . From this extremal point, as any component of 1 increases, the log likelihood of h decreases without bound. Similarly decreasing any component of 1 also decreases the log likelihood, and as any component of 1 approaches half the value of the corresponding component of  , then the log likelihood decreases without bound. To be more precise, we have for all j = 1, 2, . . . , d that p h = -. log G (g(·,·,)) (·) lim + 1 [1 ]j ( 2 [ ]j ) An interesting consequence of the second remark is that, roughly speaking, it is not possible to recover a Gaussian function using a g.p. with Gaussian covariance, if the covariance function is more than twice as broad as the function to be recovered. Although this may at first appear to contradict proven consistency

4. Inference with the Sparse Mo del
4.1. A Simple Approach In the previous section we derived the g.p. likelihood over a certain restricted function space. This likelihomd defines a distribution over functions of the form o i=1 ci g (·, vi , i ) where g as given previously is deterministic and the ci are, by inspection of (13), normally distributed according to cN 0 -1 , , U (16)

where [U ]i,j = k (ui , uj ). Let us write U = {u1 , . . . , um } (which we refer to as the basis) and refer to the random process thus defined as GU (k ). This new random process is equivalent to a full g.p. with
5

For ICML reviewing, we can provide proof on request.


Sparse Multiscale GP Regression

covariance function of rank at most m given by = u u   Ef GU (k) [f (x)f (z )] = EcN (0,U -1 ) vx c vz c


- ux U 1 uvz , v

(17)

where [uvx ]i = g (x, vi , i ) and [uvz ]i = g (z , vi , i ). As an aside, note that if we choose as the basis U = {g (·, x,  ), g (·, z ,  )}, then it is easy to verify using (17) that Ef GU (g(·,·,)) [f (x)f (z )] = Ef G (k) [f (x)f (z )]. This is analogous to a special case of the representer theorem from the theory of r.k.h.s.'s, and agrees with the interpretation that (16) is such that GU (k ) approximates G (k ) well in some sense, for the given basis U . Returning to the main thread, the new posterior can be derived as it was at the end of Section 2 for the exact g.p., but using the new covariance function (17). Hence after some algebra we have from (2) and (3) that, conditioned again upon the hyper-parameters, the latent function u = u(x ) at an arbitrary test point is distributed according to puGU (k) (u |x , S ) = 2 N (u |µ ,  ), where µ = (Uvx y )
2 2  = n u v 

where a,b is the Kronecker delta function and  a,b = 1 - a,b . Note that if x = z then the covariance is that of the original g.p. G (k ), otherwise it is that of GU (k ). Unlike (17), the prior variance in this case is the same as that of the full g.p., even though in general the covariance is not. Once again the posterior can be found as before by replacing the covariance function in (2) and (3) with the right hand side of (20). In this case we obtain after some algebra the expression 2 puGU (k) (u |x , S ) = N (u |µ ,  ), where ~ µ = u Q-1 Uvx v
2 

= k (x , x ) - uv


 = diag (), and

2 + n I U -1   

-Q

-1

y,
-1

(21) u
v ,

(22)

and we have defined [Uvx ]i,j = g (xj , vi , i ), etc. Note that these expressions can be evaluated in O(m) and O(m2 ) time respectively, after an initial setup or training cost of O(m2 n). This is the usual improvement over the full g.p. obtained by such sparse approximation schemes. It turns out however that by employing an idea introduced by Snelson and Ghahramani (2006), we can retain these computational advantages while switching to a different model that is closer to the full g.p. 4.2. Inference with Improved Variance A fair criticism of the previous model is that the predictive variance approaches zero far away from the basis function centres vi , as can be seen from (19). It turns out that this is particularly problematic to gradient based methods for choosing the basis (the vi and i ) by maximising the marginal likelihood (Snelson & Ghahramani, 2006). An effective but still computationally attractive way of healing the model is to switch ~ to a different g.p. -- which we denote GU (k ) -- whose covariance function satisfies
- EGU (k) [f (x)f (z )] = x,z k (x, z ) +  x,z ux U 1 uvz , ~ v (20)

-1 2 + n U uv  , U -1  2 uv  , v x Uv x + n U U
 v x Uv x

(18) (19)

To compute the marginal likelihood we can use the expression (4). Note that it can be computed efficiently using Cholesky decompositions. In order to optimize the marginal likelihood, we also need its gradients with respect to the various parameters. Our derivation of the gradients (which closely follows (Seeger et al., 2003)) is long and tedious, and has been omitted due to space limitations. Note that by factorising appropriately, all of the required gradients can be obtained in O(m2 n + mnd). 4.3. A Unifying View We now briefly outline how the method of the previous section fits into the unifying framework of sparse g.p.'s provided by Quinonero-Candela and Rasmussen ~ (2005). Using Bayes rule and marginalising out the training set latent variables u, we obtain the posterior p 1 (y |u)p(u, u ) du. p(u |y ) = p(y ) Here we have neglected to notate conditioning on x and x1 , . . . , xn , and have written p instead of the more precise puG (k) . Our algorithm can be interpreted as employing two separate approximations. The first is conditional independence of u and u given a, i.e. p p(u, u ) = (u, u |a)p(a) da p  (u|a)p(u |a)p(a) da, where a (which is marginalised out) is taken to be ( u1 , u H u2 , u H · · · um , u H) ,


- []i = k (vi , vi ) - [Uvv ]:,i U 1 [Uvv ]:,i ,  -1  2 Q = U + Uvx + n I Uvx .


Sparse Multiscale GP Regression
2 1.5 1 0.5 0 -0.5 -1 -2 0 2 2 1.5 1 0.5 0 -0.5 -1 -2 0 2 2 1.5 1 0.5 0 -0.5 -1 -2 0 2

(a) spgp-ful l

(b) Our vsgp-ful l

(c) Exact g.p.

Figure 1. Predictive distributions (mean curve with ± two standard deviations shaded). For the spgp-ful l and vsgp-ful l algorithms, we plot the (vi , i )  R × R of the basis as crossed circles. The horizontal lines denote the resulting   R of the covariance function cg (·, ·,  ).

where diag (A) is a diagonal matrix matching A on the diagonal, and [Kxv ]i,j = k (xi , vj ), etc. Note that the first line can be shown with some algebra, whereas the second is an approximation. One can show that this leads to the result of Section 4.2, but we omit the details for brevity. Of the algorithms considered in (Quinonero-Candela & Rasmussen, 2005), ours is clos~ est to that of Snelson and Ghahramani (2006), however there the basis functions take the form ui = k (vi , ·), which has two implications. Firstly, a simplifies to (u(v1 ) u(v2 ) · · · u(vm )) , the vector of the values of u at v1 , . . . , vm . Secondly, U and Uxv simplify to Kvv and Kxv , respectively.


the vector of inner products between the basis functions ui and the latent function u, in the r.k.h.s. H associated with k (·, ·). The second approximation is  U - -  p(u|a) = N xv U 1 v , Kxx - Uxv U 1 Uxv . U K -1  - N xv U 1 v , diag xx - Uxv U Uxv

Let us clarify the terminology we use to refer to the various algorithms under comparison. Our new method is the variable sigma Gaussian process (v.s.g.p.). The vsgp-ful l variant consists of optimising the marginal likelihood with respect to the m basis centers vi  Rd and length scales i  Rd of our basis functions ui = g (·, vi , i ) where g is defined in (9). Also optimised are the following hyper parameters -- the noise variance n  R of (1), and the parameters c  R and   Rd of our original covariance function cg (·, ·,  ). The vsgp-basis variant is identical to vsgpful l except that n , c and  are determined by optimising the marginal likelihood of a full g.p. trained on a subset of the training data, and then held fixed while the i and vi are optimised as before. Both ~ v.s.g.p. variants use the GU (k ) probabilistic model of Section 4.2, where k = cg (·, ·,  ). For the optimisation of the sparse pseudo-input Gaussian process (s.p.g.p.) and v.s.g.p. methods we used a standard conjugate gradient type optimiser.6 spgp-ful l and spgp-basis correspond to the work of Snelson and Ghahramani (2006), and are identical to their v.s.g.p. counterparts except that -- as with all sparse g.p. methods prior to the present work -- they are forced to satisfy the constraints i =  , i = 1 . . . m. To initialise the marginal likelihood optimisation we take the vi to be a k -means clustering of the training data. The other parameters are always initialised to the same sensible starting values, which is reasonable due to the preprocessing we employ (which is identical to that of (Seeger et al., 2003)) in order to standardise the data sets. Figure 1 demonstrates the basic idea on a one dimensional toy problem. Using m = 4 basis functions is not
Carl Rasmussen's minimize.m, which is freely available from http://www.kyb.mpg.de/~carl.
6

5. Exp eriments
Our main goal is to demonstrate the value of being able to vary the i individually. Note that the chief advantage of our method is in producing highly sparse solutions, and the results represent the state of the art in this respect. As such, and since the prediction cost is O(md), we analyse the predictive performance of the model as a function of the number of basis functions m. Note that neither our method nor the most closely related method of Snelson and Ghahramani (2006) are particularly competitive in terms of training time. Nonetheless, there is a demand for algorithms which sacrifice training speed for testing speed, such as real-time vision and control systems, and web services in which the number of queries is large.


Sparse Multiscale GP Regression
1
deviation of the i from  exact gp spgp-full vsgp-full info-gain smo-bart exact gp spgp-full spgp-basis vsgp-basis vsgp-full info-gain smo-bart

0.8 0.6 0.4 0.2 0

vsgp-full vsgp-basis

test error

10

-1

test error

10

-1

0

20

40

60

80

100

10

-2

0

500
# basis functions

1000

0

500
# basis functions

1000

# basis functions

(a) pumadyn-32nm test error

(b) kin-40k test error

(c) kin-40k i deviation

Figure 2. Plots (a) and (b) depict the test error as a function of basis size m. In (c) we plot against m the deviation of the i from  , measured by the mean squared difference (see the text), for the kin-40k data set.

enough for spgp-ful l to infer a posterior similar to that of the full g.p. trained on the depicted n = 200 training points. The v.s.g.p. achieves a posterior closer to that of the full g.p. by employing -- in comparison to the full g.p. -- larger i 's and a smaller  . This leads ~ to an effective covariance function -- that of GU (k ) as given by (17) -- which better matches that of the full g.p. depicted in Figure 1 (c). In addition to merely observing the similarity between Figures 1 (b) and (c), we verified this last statement directly by visualising EGU (k) [f (x)f (z )] of (20) as a function of x and z , but ~ we omit the plot due to space limitations. Figure 2 shows our experiments which, as in (Seeger et al., 2003) and (Snelson & Ghahramani, 2006), were performed on the pumadyn-32nm and kin-40k data sets.7 Optimising the v.s.g.p. methods from a random initialisation tended to lead to inferior local optima, so we used the s.p.g.p. to find a starting point for the optimisation. This is possible because both methods optimise the same criteria, while the s.p.g.p. merely searches a subset of the space permitted by the v.s.g.p. framework. To ensure a fair comparison, we optimised the s.p.g.p. for 4000 iterations, whereas for the v.s.g.p. we optimised first the s.p.g.p. for 2000 iterations (i.e. fixing i =  , i = 1 . . . m), took the result as a starting point, and optimised the v.s.g.p. for a further 2000 iterations (with the i unconstrained). We have also reproduced with kind permission the results of Seeger et al. (Seeger et al., 2003), and hence have used exactly the experimental methodology described therein. The results we reproduce are from the info-gain and smo-bart methods. info-gain is their
7 kin-40k : 10000 training, 30000 test, 9 attributes, see www.igi.tugraz.at/aschwaig/data.html. pumadyn-32nm : 7168 training, 1024 test, 33 attributes, see www.cs.toronto/delve.

own method which is extremely cheap to train for a given set of hyper parameters. The method uses greedy subset selection based on a criteria which can be evaluated efficiently. smo-bart is similar but is based on a criteria which is more expensive to compute (Smola & Bartlett, 2000). We also show the result of training a full g.p. on a subset of the data of size 2000 and 1024 for kin-40k and pumadyn-32nm, respectively. Neither info-gain nor smo-bart estimate the hyperparameters, but rather fix them to the values determined by optimising the marginal likelihood of the full g.p. Hence they are most directly comparable to spgpbasis and vsgp-basis. However, spgp-ful l and vsgp-ful l correspond to the more difficult task of estimating the hyper parameters at the same time as the basis. For pumadyn-32nm we do not plot spgp-basis and vsgp-basis as the results are practically identical to spgp-ful l and vsgp-ful l. This differs from (Snelson & Ghahramani, 2006), where local minima problems with spgp-ful l on the pumadyn-32nm data set are reported. It is unclear why our experiments did not suffer in this way -- possible explanations are the choice of initial starting point, as well as the choice of optimisation algorithm. The results of the s.p.g.p. and v.s.g.p. methods on the pumadyn-32nm data set very similar, but both outperform the info-gain and smobart approaches. The kin-40k results are rather different. While the i deviated little from  on the pumadyn-32nm data set, this was not the case for kin-40k, particularly for small m, as seen in Figure 2 (c) where we plot m d 1 2 i=1 j =1 ([i -  ]j ) . Our results are in agreemd ment with those of (Snelson & Ghahramani, 2006) -- our vsgp-ful l outperforms spgp-ful l for small m, which in turn outperforms both info-gain and smobart. However for large m both spgp-ful l and vsgp-ful l


Sparse Multiscale GP Regression

tend to over-fit. This is to be expected due to the use of marginal likelihood optimisation, as the choice of basis U is equivalent to the choice of the order of md hyper parameters for the covariance function of ~ GU (k ). Happily, and somewhat surprisingly, the vsgpful l method tends not to over-fit more than the spgpful l, in spite of its having roughly twice as many basis parameters. Neither vsgp-basis nor spgp-basis suffered from over-fitting however, and while they both outperform info-gain and smo-bart, our vsgp-basis clearly demonstrates the advantage of our new s.p.g.p. framework by consistently outperforming spgp-basis. Finally, to emphasise the applicability of our idea to other kernel algorithms, we provide an accompanying video which visualises the optimisation of an s.v.m. using multiscale gaussian basis functions.

Quinonero-Candela, J., & Rasmussen, C. E. (2005). A ~ unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6, 1935­1959. Roach, G. F. (1970). Green's functions. Cambridge, UK: Cambridge University Press. Seeger, M., Williams, C., & Lawrence, N. D. (2003). Fast forward selection to speed up sparse gaussian process regression. In C. M. Bishop and B. J. Frey (Eds.), Workshop on ai and statistics 9. Society for Artificial Intelligence and Statistics. Smola, A. J., & Bartlett, P. L. (2000). Sparse greedy gaussian process regression. In T. K. Leen, T. G. Dietterich and V. Tresp (Eds.), Advances in neural information processing systems 13, 619­625. Cambridge, MA: MIT Press. Snelson, E., & Ghahramani, Z. (2006). Sparse gaussian processes using pseudo-inputs. In Y. Weiss, B. Sch¨lkopf and J. Platt (Eds.), Advances in neuo ral information processing systems 18, 1257­1264. Cambridge, MA: MIT Press. Steinwart, I. (2002). On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2, 67­93. Walder, C., Sch¨lkopf, B., & Chapelle, O. (2006). Imo plicit surface modelling with a globally regularised basis of compact support. Proc. EUROGRAPHICS, 25, 635­644.

6. Conclusions
Sparse g.p. regression is an important topic which has received a lot of attention in recent years. Previous methods have based their computations on subsets of the data or pseudo input points. To relate this to our method, this is analogous to basing the computations on a set of basis functions of the form k (vi , ·) where k is the covariance function and the vi are for example the pseudo input points. We have generalised this for the case of Gaussian covariance function, by basing our computations on a set of Gaussian basis functions whose bandwidth parameters may vary independently. This provides a new avenue for approximations, applicable to all kernel based algorithms, including g.p.'s and the s.v.m., for example. To demonstrate the utility of this new degree of freedom, we have constructed sparse g.p. and k.r.r. algorithms which outperform previous methods, particularly for very sparse solutions. As such, our approach yields state of the art performance as a function of prediction time.

References
Csat´, L., & Opper, M. (2002). Sparse on-line gaussian o processes. Neural Comp., 14, 641­668. Gehler, P., & Franz, M. (2006). Implicit wiener series, part ii: Regularised estimation (Technical Report 148). Max Planck Institute for Biological Cybernetics. Lawrence, N., Seeger, M., & Herbrich, R. (2002). Fast sparse gaussian process methods: The informative vector machine. Advances in Neural Information Processing Systems 15 (pp. 609­616).