Gaussian Pro cess Pro duct Mo dels for Nonparametric Nonstationarity

Ryan Prescott Adams Oliver Stegle Cavendish Laboratory, University of Cambridge, Cambridge CB3 0HE, UK

rpa23@cam.ac.uk os252@cam.ac.uk

Abstract
Stationarity is often an unrealistic prior assumption for Gaussian process regression. One solution is to predefine an explicit nonstationary covariance function, but such covariance functions can be difficult to specify and require detailed prior knowledge of the nonstationarity. We propose the Gaussian process product model (GPPM) which models data as the pointwise product of two latent Gaussian processes to nonparametrically infer nonstationary variations of amplitude. This approach differs from other nonparametric approaches to covariance function inference in that it operates on the outputs rather than the inputs, resulting in a significant reduction in computational cost and required data for inference. We present an approximate inference scheme using Expectation Propagation. This variational approximation yields convenient GP hyperparameter selection and compact approximate predictive distributions.

Unfortunately, stationarity is often an unrealistic assumption. We expect many problems of interest to have nontrivial nonstationarity in the form of input-dependent noise, length scale or amplitude. While input-dependent noise and length-scale have been well-studied in the literature, nonstationarity in the form of varying amplitude has received relatively little attention. One approach to modeling such data is to directly specify a covariance function with nonstationary properties (Gibbs, 1997; Higdon et al., 1999). In machine learning, however, we find it undesirable to need to specify the covariance nonstationarity a priori ; rather we wish to infer it. Moreover, as the ob jective with Gaussian process regression is to perform nonparametric inference, we would prefer a representation of the nonstationarity which is also nonparametric. Several approaches have been proposed to solve the problem of learning a length scale that varies across the input space. One of the first techniques was that of Sampson and Guttorp (1992), who model a spline-based mapping to a latent input space in which the data are stationary. This approach was given a nonparametric Bayesian treatment by Schmidt and O'Hagan (2003). Recently, Paciorek and Schervish (2004) extended the work of Higdon et al. (1999) to learn nonparametric variation of the covariance kernel. Other approaches involve Gaussian process mixtures (Rasmussen, 2000), augmentation of the input space (Pfingsten et al., 2006), and weighted sums of locally-stationary processes (Nott & Dunsmuir, 2002). A related problem is input-dependent observation noise in the Gaussian process, addressed by Goldberg et al. (1998), who model a log-noise term in the covariance function with another Gaussian process, and by Le et al. (2005) who model nonstationary noise by performing regression in the natural parameter space of the exponential family. Snelson and Ghahramani (2006) achieve nonstationary noise as a side effect of the combination of input dimensionality reduction and a sparse approximation using pseudo-data.

1. Intro duction
The Gaussian process (Rasmussen & Williams, 2006) is a useful and popular prior for nonlinear regression. It can be used to construct a distribution over scalar functions via a prior on smoothness. This prior is specified through a positive-definite kernel, which determines the covariance between two outputs as a function of their corresponding inputs. Often, this covariance function is taken to be stationary, i.e., a function only of the distance between the input points. Stationary covariance functions are appealing due to their intuitive interpretation and their relative ease of construction via Bochner's Theorem (Gibbs, 1997).
Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).


Gaussian Pro cess Pro duct Mo dels

In this paper, we propose the Gaussian process product model (GPPM) to address smooth inputdependent changes in amplitude. The GPPM models the data as the pointwise product of two latent stationary Gaussian processes. This approach has the notable computational advantage over remappings of the input space in that high dimensional problems pose no intrinsic scalability problems. Remapping the input nonparametrically while maintaining the input dimension requires at least as many latent processes as input dimensions. In contrast, the GPPM uses only a single additional GP regardless of input dimension. We develop a quadrature-based Expectation Propagation (EP) algorithm for efficient approximate inference in the GPPM model. The EP approach allows us to use the estimated marginal likelihood of the model to learn empirical settings of the Gaussian process hyperparameters. The approximate inference procedure we describe yields uncertainty in the nonstationarity, while avoiding expensive MCMC methods that are typically required. We additionally develop useful approximations for the predictive distribution arising from the EP approximation, and discuss rapid learning of a MAP estimate of the nonstationarity when observations can be considered noise free. This model is similar to that presented by Turner and Sahani (2008), who modulate sounds with Gaussian processes, however the GPPM is intended for the general regression problem and our inference approach differs significantly.

x1 f f1 g g1 f2

x2

x3

xN

f3

fN

g2

g3

gN

y1 2

y2

y3

yN

Figure 1. A graphical model describing the GPPM. The thick lines connecting the values of f and g represent undirected connections associated with the Gaussian process. The double-lined circles around the y values represent observables. Both f (x) and g (x) have the same input space.

2. Gaussian Pro cess Regression
In Gaussian process regression, we find a distribution over functions of the form f : X  R, X = Rm . For a comprehensive introduction see Rasmussen and Williams (2006). The data consist of N input/output pairs D = {xn , yn }N , xn  X , yn  R. A vector of output points has a Gaussian prior distribution with a mean function ľ(x), which we take to be zero, and a positive-definite covariance function C (x, x ;  ). This construction gives an analytic Gaussian predictive distribution for an unseen output y  N (ľ , v ): ľ = kT C -1 y N , NN
- v = C (x , x ) - kT C N1 kN , N T

Mahalanobis distance d(x, x ) = (x - x )T W (x - x ) with positive definite W . Covariance functions that depend only on distance are appealing due to the intuition that the outputs of the function should covary in inverse proportion to how far the inputs are from each other. The model proposed in this paper attempts to retain this intuition while providing a mechanism for the relationship between distance and covariance to vary across the input space.

3. The Gaussian Pro cess Pro duct Mo del
In the Gaussian process product model (GPPM), the observed outputs {yn }N are modeled by a pointwise product of two latent functions, plus independent zeromean Gaussian noise with variance  2 . One latent function f : X  R, is modulated by the other function g : X  R that has been exponentiated, so that yn  N (f (xn )eg(xn ) ,  2 ). (2) We place independent zero-mean Gaussian process priors on f (x) and g (x), with covariance functions Cf (x, x ;  f ) and Cg (x, x ;  g ), respectively. Figure 1 shows a graphical interpretation of this model. Our convention is that f (x) captures local near-stationary variations in the observed function and g (x) captures slowly-varying amplitude nonstationarity. The lengthscale hyperparameters of these covariance functions (and their hyperpriors) should be chosen to reflect prior beliefs about such variations. To give the flavor of this model, Figure 2 shows several samples from the GPPM.

where kN = [C (x , x1 ;  ), . . . , C (x , xN ;  )] , and C N is the covariance matrix formed from the observed data. The log evidence, or log marginal likelihood after integrating out all possible functions is 1 N 1 ln 2 . L = - ln |C N | - y T C -1 y N - NN 2 2 2 (1)

Stationary covariance functions only depend on a distance measure d between x and x , for example the


Gaussian Pro cess Pro duct Mo dels

4.1. Approximate Inference Approximate inference via variational methods is appealing due to its determinism and potential computational savings. In the GPPM, several properties affect our choice of approximation. First, we expect that the posterior will be approximately Gaussian, as we have strong Gaussian process priors and a near-Gaussian likelihood. Second, the likelihood factorizes to N independent terms, each involving one point from the two latent functions. Third, these likelihood factors introduce nontrivial dependencies between f and g so that a factorized approximation is inappropriate. We address these properties using Expectation Propagation. 4.1.1. Expectation Propagation
-5 -4 -3 -2 -1 y = f * exp(g) 0 1 2 g 3 4 f 5

Figure 2. Three samples from the GPPM with different parameters. In the top plot, the length scales are lf = 0.5 and lg = 4.0. In the middle plot, both are shorter: lf = 0.25 and lg = 2.0. In the bottom plot, lf = 0.5 and lg = 2.0, but f (x) also has additive noise.

Note that the pointwise product of a Gaussian process prior with any known function a(x) results in a covariance function given by C  (x, x ) = a(x)C (x, x )a(x ) and that this function is guaranteed to be positive definite. In the GPPM we use an exponentiated form a(x) = exp{g (x)} in order to reduce the multimodality of the posterior on the latent functions, but this is not critical for the validity of the covariance function. Without restricting the sign of one of the functions, there would be at least 2N posterior modes, as each observation could be explained by the same latent function values with flipped signs.

Expectation Propagation (Minka, 2001) makes successive local approximations of factors in a joint density, typically using exponential-family distributions, to yield a global approximation that is optimal under a divergence measure. EP is particularly well-suited for approximation of Bayesian posterior distributions with i.i.d. data as in Equation 3, as each factor only involves a few of the unknown parameters. Our construction of the EP approximation is similar to that used by Rasmussen and Williams (2006) for binary Gaussian process classification. The prior on f and g is Gaussian with zero mean and a block covariance matrix arising from the independent Gaussian process priors. For notational convenience, we will write  to be the concatenation of f and g so that  = [f1 , . . . , fN , g1 , . . . , gN ]T , and n to be the nth pair [fn gn ]T . The prior can now be written . C 0 f p() = N (0, GP ), GP = 0 Cg The aim of EP is to approximate the exact posterior distribution of Equation 3 with a tractable alternative nN ~ q (f , g | D,  )  N (0, GP ) tn (fn , gn ). (4)
=1

4. Factor Inference in the GPPM
The basic GPPM inference task is to determine the posterior distribution over the values of the latent functions f (x) and g (x) at the input locations {xn }N . These latent function values will be denoted fn = f (xn ) and gn = g (xn ) for brevity. Additionally we will write the vectors of these laT tent values in bold type: f = [f1 , . . . , fN ] and T g = [g1 , . . . , gN ] . With this notation and with C f and C g representing the GP-derived covariance matrices on f (x) and g (x) respectively, the posterior distribution of the latent functions is p(f , g | D,  )  N (f ; 0, C f )N (g ; 0, C g ) × nN
=1

N (yn ; fn egn ,  2 ).

(3)

Each of the exact likelihood terms i - 1 1 2 (fn egn - yn ) Ln (fn , gn ) =  exp 2 2  2 s approximated with an unnormalized bivariate Gaussian on fn and gn : - . 1 T ~ -1 ~ ~n (fn , gn ) = Zn exp t ( - ľn ) n (n - ľn ) ~ ~ 2n The product of these likelihood approximations is an unnormalized Gaussian with a block-diagonal covariance matrix. - nN nN -1 1 ~ ~ Zn ( - ľ)T  ( - ľ) ~~ ~ tn (n ) = exp 2 =1 =1


Gaussian Pro cess Pro duct Mo dels

The overall approximation is Gaussian as well, as it is the product of these Gaussian likelihood approximations and the Gaussian process prior. (  f; ľ,  5) q (f , g | D,  ) = N = g -1  -1 ~ -1 ~ -1 ~ = + ľ =  ľ GP The Expectation Propagation algorithm proceeds by iteratively updating the parameters of the local approximations tn , leaving all other approximate factors fixed. In this iterative procedure the update of the nth site can be understood as the minimization of the KL divergence between two approximating distributions: the product of the cavity distribution times the exact local likelihood, and the product of the cavity distribution times the approximate local likelihood. The insight of EP is that the cavity distribution "focuses" the approximation on the most relevant area. N ľn , n = argmin KL ^^
ľ ,

N

exact factor

(ľ/n , /n ) ×

T

L

n (fn , gn )

Taken together these equations define a fixed-point iteration scheme for approximating the posterior in Equation 3. We initialize the approximations so that the initial estimate of the mean of f is y and the mean of g is zero. We then iterate over each of the N local approximations, and update the overall posterior approximation using Equation 5. To facilitate convergence of EP it is helpful to use damping to update local sites, which we implement in natural parameter space. Convergence of EP is not guaranteed, but given sufficient damping it is found to convergence for the problems we considered so far. Local approximations may not necessarily be positive definite, but as long as the overall approximation remains a valid Gaussian, this does not present a problem. Following from the treatment by Minka (2001) of negative variances, we skip the update of local approximations that would result in invalid global covariance matrices. This has not appeared to affect the accuracy of the global approximation in practice. Figure 3(b) shows the result of applying the EP procedure to a synthetic data set. Marginal error bars are shown for each function and site location. 4.1.2. Gaussian Quadrature for EP Unfortunately, the moments that minimize the KL divergence of Section 4.1.1 are not available analytically. To resolve this, we use the approach proposed by Zoeter and Heskes (2005) of approximating the moment integrals using Gaussian quadrature. When a definite integral is the product of a nonnegative "weighting function" w(v ) and another function z (v ), it can be approximated by a sum of weighted evaluations of z (v ) a dv w(v )z (v )  kK wk z (vk )

he cavity distribution for site n is the product of the prior and all approximate sites excluding the nth. This is Gaussian with parameters  -1 -1 ~ -1 (6) /n = - n n .  -1 ~ -1 ~ (7) ľ/n = /n n ľ n - n ľ n As shown by Minka (2001), the minimum of an inclusive KL divergence is achieved when the moments are equal. Thus to find the best-fitting Gaussian, it is sufficient to find the first and second moments of the product of the cavity distribution and the exact likelihood. We also find the "zeroth moment," which is the ^ normalization constant Zn . Calculation of these moments is done numerically via Gaussian quadrature, addressed in Section 4.1.2. Once the moments of the product have been found, we use them to recover the optimal parameters of the local approximation: ^ -1 -1 ~ n = n - -1 /n l ^ -1 ~ ^ ľn =n n ľn - -1 ľ/n ~ /n 1 1 1 T ~ -1 ~ ^ ~ n Zn = ln Zn - ln |n | + ln |/n | + ľn n ľn ~ ~ 2 2 2 -1 1 1 ^^ ^ + ľTn -1 ľ/n - ľT n ľn . 2 / /n 2n

~ (ľ/n , /n ) × tn (fn , gn |ľ ,  ) a
pproximation

b

=1

where the weights {wk } and abscissae {vk } are determined by the integration interval, the weighting function w(v ), and the number of evaluation points K . This sum is exact where z (v ) is a polynomial of degree 2K - 1. In the case of interest here, the weighting function is the Gaussian cavity distribution, which implies Gauss-Hermite quadrature. One difficulty is that Gaussian quadrature is generally oriented towards univariate definite integrals and we must solve a two-dimensional integral. When the weighting function is factorizable, this is done straightforwardly by defining a lattice of abscissae and using the Cartesian product of the weights. In the GPPM, however, the cavity distribution has nonzero mean and is not generally factorizable, so we must transform


Gaussian Pro cess Pro duct Mo dels

the integrand prior to to performing Gauss-Hermite quadrature. The factorizable form can be recovered by transforming the abscissae with the inverse Cholesky decomposition of the cavity covariance matrix and the cavity mean. The Gaussian parameters resulting ^^ from these moment calculations are denoted Z n , ľn , ^ n in Section 4.1.1. and  4.2. Noise-free MAP Learning In some applications of the GPPM, it may be that the observations can be considered noise-free. For example, one may model the noise as coming exclusively from the locally-varying function f (x). The appeal of this restricted model is that proposals of the nonstationarity can now be evaluated as O(N 2 ) rather than O(N 3 ). This is particularly valuable for finding rapid maximum a posteriori (MAP) estimates of the latent modulating function g (x). The computational advantage in the noise-free case comes from the deterministic coupling of the latent functions, given y ; we can now consider the posterior of g alone: p(g |  f ,  g )  p(D | g ,  f )p(g |  g ). (8)

1 0.5 0 -0.5 -1 -1.5 -5 -4 -3 -2 -1 0 1 2 3 4 5

(a) Observed Data
3 2 1 0 -1 -2 -3 0.5 0 -0.5 -1 -1.5 -2 -2.5 -5 EP Posterior f(x)

EP Posterior g(x) -4 -3 -2 -1 0 1 2 3 4 5

(b) EP Approximation
3 2 1 0 -1 -2 0.5 0 -0.5 -1 -1.5 -2 -2.5 -5 MAP f(x)

In this form, conditioning on g corresponds to a simple linear transformation of the GP prior on f . Using the notational shortcut G = diag([eg1 , eg2 , . . . , egN ]), the log likelihood is 1 ln p(D | g ,  f ) = - ln |G C f G | 2 N 1 ln 2 . - y T [G C f G ]-1 y - 2 2 The log posterior over g , eliminating irrelevant terms and using 1 to indicate a column vector of ones, is 1 ln p(g | D,  f ,  g ) = -g T 1 - y T [G C f G ]-1 y 2 1 - g T C -1 g + const g 2 and the gradient in terms of g is  ln p(g | D,  f ,  g ) = -1 + Y [G C f G ]-1 y - C -1 g g g where Y = diag(y ). As the difficult O(N 3 ) operations of decomposition or inversion of C f and C g can be done in advance, the computational complexity of taking a step in g space is O(N 2 ). In practice, we have found the MAP estimate to be best when f (x) has additive noise and g (x) is smooth.

MAP g(x) -4 -3 -2 -1 0 1 2 3 4 5

(c) MAP Approximation Figure 3. Figure 3(a) shows synthetic data generated from the GPPM with known settings and  = 0.05. We applied the Expectation Propagation algorithm to the data and the Gaussian marginal posterior distributions over f and g are shown in Figure 3(b), along with the true f (x) and g (x) indicated as circles. Figure 3(c) shows the result of applying the MAP approximation to the data, despite the known observation noise. The true values are shown for comparison.

have not been observed. For the GPPM we must make predictions for both latent functions, and find the resulting distribution, integrating out the posterior distribution over the latent functions, as in p(y  | x , D,  f ,  g ) f p(y  | x , f , g )p(f , g | D,  f ,  g ). =
,g

5. Making Predictions
As with the standard regression model, the primary task of interest is prediction at locations where data


Gaussian Pro cess Pro duct Mo dels

The EP scheme of Section 4.1 finds an approximate Gaussian distribution over f and g , and this results in a convenient joint Gaussian distribution on f  and g  , the values of the latent functions at x , with parameters  -1  -1 ~ ~~ =  - K T GP +  K , ľ= K T GP +  ľ, where C (x , x1 ;  f )  C (x , x2 ;  f )   . .  .  K = C (x , xN ;  f )   0   . .  . 0 = C (x , x ;  f ) 0


      0  C (x , x1 ;  g )    . .  .  C (x , xN ;  g ) . 0 C (x , x ;  g )

0 0 . . .


hyperparameters controlling the covariance function. These hyperparameters generally determine the length scale of correlations, the output variation (or amplitude) of the function, and the noise level. In the GPPM, we wish to find appropriate hyperparameter settings for both latent functions, given the data. While the vanilla Gaussian process offers the marginal likelihood analytically, it is not available directly in the GPPM. Fortunately, the EP algorithm of Section 4.1 provides a convenient estimate of the marginal likelihood, using the zeroth moments mentioned previously. ln ZEP =
-1 1 1 1 ln || - ln |GP | - ľT  ľ ~~ ~ 2 2 2 nN 1 T -1 ~ + ľ  ľ+ ln Zn 2 =1

We expect that the resulting predictive distribution on y  will be heavy-tailed and have similar properties to the noncentral Student's t distribution. To approximate the true distribution's heavy tails analytically, one approach is to generate several samples from g  and use the conditional distribution on f  to create a mixture of Gaussians: i    p(y  | x , D,  f ,  g )  N (y  ; ľ |gi egi , vf |gi e2gi ). f
 We have used ľ |gi and vf |gi to indicate the conf ditional Gaussian parameters on f  given the ith marginal sample from g  .

In principle it is also possible to evaluate the gradients of ln ZEP with respect to hyperparameters following for instance (Seeger, 2005). In practice however, the quadrature-based moment calculation is numerically not stable enough to provide precise gradients. We hence reverted to gradient-free optimization methods to determine hyper parameter settings. We suggest setting hyperpriors to reflect the intuition described in Section 3 of f (x) capturing local near-stationary variations and g (x) capturing slowly varying nonstationarity on a larger lengthscale.

7. Results
We evaluated the GPPM model on three data sets. First, we examined the motorcycle data set (Parker & Rice, 1985), a well-studied example of a nonstationary regression task. The data are acceleration force in g's on a helmet during impact, as a function of time in milliseconds. Figure 4(a) in the upper plot shows the EP approximation found for the latent g (x) function, and in the lower plot shows the Gaussian approximation to the predictive distribution, overlaid with the true data. The GPPM finds a good fit in most regions except where the g (x) function becomes quite small. In these regions the uncertainty in the modulating function creates unrealistically large prediction error bars. We evaluated the accuracy of predictions using a fill-in test, where a fraction of the data are removed from the training set and compared to the model's predictions. Figure 4(d) depicts the mean log probability and the mean squared error for missing data as a function of the fraction of missing data. The GPPM outperforms both a vanilla GP and the sparse pseudo-input process (SPGP) (Snelson & Ghahramani, 2006) using either of the performance measures. We chose the SPGP for comparison to the GPPM, as it is one of the few

If the heavy-tailed properties are not significant for the application, and a single Gaussian distribution is preferred, then a more tractable alternative is to linearize the model around the mean ľ . This is a similar approach to the Extended Kalman Filter (EKF) (Haykin, 2001), which uses the first terms of the Taylor series of a nonlinear function to maintain Gaussian uncertainty in latent state estimation. The resulting approximation is w eľ T f  g - ľ  g  ľ f g+  f e  ľf e   ľg ľ g - ľ ľf e g hich transforms the Gaussian on f  and g  into one on y  with parameters eľ + eľ T g g   vy =  ľ   ľ ľ = ľ eľg 2 . y f g ľf e g ľf e

6. Hyp erparameter Learning
When performing Gaussian process regression, we are commonly interested in appropriate settings of the


Gaussian Pro cess Pro duct Mo dels
1.5 EP Posterior g(x) 1 0.5 0 -0.5 -1 150 100 50 0 -50 -100 -150 5 10 15 20 25 30 35 40 45 50 55 EP Posterior f(x)*exp(g(x)) 2 0 -2 -4 Jan07 0.5 0 -0.5 -1 -1.5 4 EP Posterior f(x)*exp(g(x)) 50 0 -50 -100 1 EP Posterior g(x) 1.8 1.6 1.4 1.2 1 100 EP Posterior f(x)*exp(g(x)) 2 EP Posterior g(x)

Feb07

Apr07

Jun07

Jul07

Sep07

Nov07

Dec07

0

200

400

600

800

1000

(a) Motorcycle Predictions
-4 MLP -1.4 -4.5 -1.5 -1.3

(b) S&P 500 Predictions
-3 MLP -4 -5

(c) Heart Rate Predictions
MLP

-5

-1.6 -1.7 -6 -7 14 MSE MSE 12 10 8 6

-5.5 35 30 25 MSE

-1.8 1.6 1.4 1.2

20 15 10 0.1 0.2 0.3 0.4 Fraction of missing Data SPGP 0.5 0.6 1 0.8

0.1

0.2

0.3 0.4 Fraction of missing Data SPGP

0.5

0.6

0.1

0.2

0.3 0.4 Fraction of missing Data SPGP

0.5

0.6

vanilla GP

GPPM

vanilla GP

GPPM

vanilla GP

GPPM

(d) Motorcycle Fill­in Performance (e) S&P Fill­in Performance (f ) Heart Rate Fill­in Performance Figure 4. Top panel: Predictive distribution of GPPM for three different data sets. The upper plot shows the EP approximation to the posterior of the log-modulating function g (x) with 2 error bars. The lower plot shows the raw data, along with the 2 approximate predictive distribution. Lower panel: Fill-in test for corresponding data sets comparing three models. The upper plot shows the mean log probability of the missing data as a function of the fill-in rate. The lower plot shows the root mean squared error for these data. Both plots show mean values and 2 error bars, calculated from four training/test splits.

methods capable of representing nonstationarity without requiring MCMC. Hyperparameters for the SPGP and the vanilla GP were set via ML-II optimization (Rasmussen & Williams, 2006). To set hyperparameters in the GPPM, a grid search was used, centered on the settings for the vanilla GP. We also examined the performance of the GPPM for daily log returns of the S&P 500 stock index during 2007. We expect that these data will be well-modeled by a latent f (x) comprised primarily of noise. The log modulating function g (x) can be interpreted roughly as the log "volatility" of the stochastic process and is shown in the upper plot of Figure 4(b). The corresponding expected envelope is shown against the true data in the lower plot. Performance measures against the standard Gaussian process and the SPGP are shown in Figure 4(e). In this example mean predictions are equally good for three all models, but GPPM yields nonstationary uncertainty which results in an improved mean log probability. As a last application we applied the GPPM to 23 hours of heart rate data, sampled at 5 minute intervals. Based on the physiological properties of heart rates,

we expect correlations on a short time scale to be captured by f (x). These local correlations will be modulated by an activity profile over a daily time scale. Figure 4(c) illustrates that these amplitude modulations are picked up by the latent g (x) leading to improved predictive performance compared to the vanilla GP and SPGP, as shown in Figure 4(f ).

8. Discussion
We have introduced the Gaussian process product model for modeling nonstationary amplitude in regression. We have presented an approximate inference algorithm using Expectation Propagation to infer the latent functions in this model and have exploited this approximation to make tractable predictions and enable hyperparameter learning. When examined on real-world data, the GPPM has yielded promising results, outperforming the vanilla Gaussian process. It has also outperformed an alternative approach to nonstationary regression in the SPGP, although it should be noted that the SPGP's focus is purely on efficient regression and not on modeling nonstationarity per se.


Gaussian Pro cess Pro duct Mo dels

Computationally, the model we have presented, combined with the EP implementation has two appealing properties. First, as we expect the number of EP iterations to be independent of the number of data (Minka, 2001), and each local calculation is a O(N 2 ) rank-one update of the inverse, the overall algorithm is O(N 3 ). The GPPM is therefore only a constant multiple more expensive than performing standard Gaussian process regression. Second, in contrast to methods of modeling nonstationarity on the input side, the GPPM does not introduce additional latent spaces if the input dimensionality increases. The additional computational complexity of using the GPPM is essentially independent of input dimension. In future work, a more comprehensive examination of inference of hyperparameters is warranted. We also expect that the basic idea of this model can be used to perform vector regression with correlation that varies across the input space.

Nott, D., & Dunsmuir, W. (2002). Estimation of nonstationary spatial covariance structure. Biometrika, 89, 819­829. Paciorek, C., & Schervish, M. (2004). Nonstationary covariance functions for Gaussian process regression. Advances in Neural Information Processing Systems 16. Cambridge, MA: MIT Press. Parker, R., & Rice, J. (1985). Discussion of "Some aspects of the spline smoothing approach to nonparametric curve fitting" by B.W. Silverman. Journal of the Royal Statistical Society, Series B, 47, 40­42. Pfingsten, T., Kuss, M., & Rasmussen, C. (2006). Nonstationary Gaussian process regression using a latent extension of the input space. www.kyb.mpg.de/publication.html?publ=3985. Rasmussen, C. (2000). The infinite Gaussian mixture model. Advances in Neural Information Processing Systems 12 (pp. 554­560). Cambridge, MA: MIT Press. Rasmussen, C., & Williams, C. (2006). Gaussian processes for machine learning. Cambridge, MA: MIT Press. Sampson, P., & Guttorp, P. (1992). Nonparametric estimation of nonstationary spatial covariance structure. Journal of the American Statistical Association, 87, 108­119. Schmidt, A. M., & O'Hagan, A. (2003). Bayesian inference for nonstationary spatial covariance structure via spatial deformations. Journal of the Royal Statistical Society, Series B, 65, 745­758. Seeger, M. (2005). Expectation propagation for exponential families (Technical Report). Technical report, University of California at Berkeley, 2005. See www. kyb. tuebingen. mpg. de/bs/people/seeger. Snelson, E., & Ghahramani, Z. (2006). Variable noise and dimensionality reduction for sparse Gaussian processes. Proceedings of the 22nd International Conference on Uncertainty in Artificial Intel ligence. Turner, R., & Sahani, M. (2008). Modeling natural sounds with modulation cascade processes. Advances in Neural Information Processing Systems 21. Cambridge, MA: MIT Press. Zoeter, O., & Heskes, T. (2005). Gaussian quadrature based expectation propagation. Proceedings of Artificial Intel ligence and Statistics 2005.

Acknowledgements
The authors wish to thank David MacKay for helpful comments. This work was funded by the Gates Cambridge Trust.

References
Gibbs, M. (1997). Bayesian Gaussian processes for regression and classification. Doctoral dissertation, University of Cambridge, Cambridge. Goldberg, P., Williams, C., & Bishop, C. (1998). Regression with input-dependent noise: a Gaussian process treatment. Advances in Neural Processing Systems 10 (pp. 493­499). Cambridge, MA: MIT Press. Haykin, S. (Ed.). (2001). Kalman filtering and neural networks. New York: John Wiley and Sons, Inc. Higdon, D., Swall, J., & Kern, J. (1999). Nonstationary spatial modeling. In J. Bernardo, J. Berger, A. Dawid and A. Smith (Eds.), Bayesian statistics 6, 761­768. Oxford: Oxford University Press. Le, Q., Smola, A., & Canu, S. (2005). Heteroscedastic Gaussian process regression. Proceedings of the 22nd International Conference on Machine Learning (ICML). Minka, T. P. (2001). A family of algorithms for approximate Bayesian inference. Doctoral dissertation, Massachusetts Institute of Technology, Cambridge, MA.