Variational Mixture of Gaussian Process Experts Chao Yuan and Claus Neubauer Siemens Corporate Research Integrated Data Systems Department 755 College Road East, Princeton, NJ 08540 {chao.yuan,claus.neubauer}@siemens.com Abstract Mixture of Gaussian processes models extended a single Gaussian process with ability of modeling multi-modal data and reduction of training complexity. Previous inference algorithms for these models are mostly based on Gibbs sampling, which can be very slow, particularly for large-scale data sets. We present a new generative mixture of experts model. Each expert is still a Gaussian process but is reformulated by a linear model. This breaks the dependency among training outputs and enables us to use a much faster variational Bayesian algorithm for training. Our gating network is more flexible than previous generative approaches as inputs for each expert are modeled by a Gaussian mixture model. The number of experts and number of Gaussian components for an expert are inferred automatically. A variety of tests show the advantages of our method. 1 Introduction Despite of its widespread success in regression problems, Gaussian process (GP) has two limitations. First, it cannot handle data with multi-modality. Multi-modality can exist in the input dimension (e.g., non-stationarity), in the output dimension (given the same input, the output has multiple modes), or in a combination of both. Secondly, the cost of training is O(N 3 ), where N is the size of the training set, which can be too expensive for large data sets. Mixture of GP experts models were proposed to tackle the above problems (Rasmussen & Ghahramani [1]; Meeds & Osindero [2]). Monte Carlo Markov Chain (MCMC) sampling methods (e.g., Gibbs sampling) are the standard approaches to train these models, which theoretically can achieve very accurate results. However, MCMC methods can be slow to converge and their convergence can be difficult to diagnose. It is thus important to explore alternatives. In this paper, we propose a new generative mixture of Gaussian processes model for regression problems and apply variational Bayesian methods to train it. Each Gaussian process expert is described by a linear model, which breaks the dependency among training outputs and makes variational inference feasible. The distribution of inputs for each expert is modeled by a Gaussian mixture model (GMM). Thus, our gating network can handle missing inputs and is more flexible than single Gaussian-based gating models [2-4]. The number of experts and the number of components for each GMM are automatically inferred. Training using variational methods is much faster than using MCMC. The rest of this paper is organized as follows. Section 2 surveys the related work. Section 3 describes the proposed algorithm. We present test results in Section 4 and summarize this paper in Section 5. 2 Related work Gaussian process is a powerful tool for regression problems (Rasmussen & Williams [5]). It ele2 gantly models the dependency among data with a Gaussian distribution: P (Y) = N (Y|0, K+n I), 1 t p L y C z ql x ml c m0 R0 x Rlc r S vl l y l Il a b Figure 1: The graphical model representation for the proposed mixture of experts model. It consists of a hyperparameter set = {L, y , C, x , m0 , R0 , r, S, 1:L , I1:L , a, b} and a parameter set = {p, ql , mlc , Rlc , vl , l | l = 1, 2, ..., L and c = 1, 2, ..., C }. The local expert is a GP linear model to predict output y from input x; the gating network is a GMM for input x. Data can be generated as follows. Step 1, determine hyperparameters . Step 2, sample parameters . Step 3, to sample one data point x and y , we sequentially sample expert indicator t, cluster indicator z , x and y . Step 3 is independently repeated until enough data points are generated. where Y = {y1:N } are N training outputs and I is an identity matrix. We will use y1:N to denote y1 , y2 , ..., yN . The kernel matrix K considered here consists of kernel functions between pairs of d 2 2 inputs xi and xj : Kij = k (xi , xj ) = f exp(- m=1 1/(2gm )(xim - xj m )2 ), where d is the dimension of the input x. The d + 2 hyperparameters n , f , g1 , g2 , ..., gd can be efficiently estimated from the data. However, Gaussian process has difficulties in modeling large-scale data and multi-modal data. The first issue was addressed by various sparse Gaussian processes [6-9, 16]. The mixture of experts (MoE) framework offers a natural solution for multi-modality problems (Jacobs et al. [10]). Early MoE work used linear experts [3, 4, 11, 12] and some of them were neatly trained via variational methods [4, 11, 12]. However, these methods cannot model nonlinear data sets well. Tresp [13] proposed a mixture of GPs model that can be trained fast using the EM algorithm. However, hyperparameters including the number of experts needed to be specified and the training complexity issue was not addressed. By introducing the Dirichlet process mixture (DPM) prior, infinite mixture of GPs models are able to infer the number of experts, both hyperparameters and parameters via Gibbs sampling [1, 2]. However, these models are trained by MCMC methods, which demand expensive training and testing time (as collected samples are usually combined to give predictive distributions). How to select samples and how many samples to be used are still challenging problems. 3 Algorithm description Fig.1 shows the graphical model of the proposed mixture of experts. It consists of the local expert part and gating network part, which are covered in Sections 3.1 and 3.2, respectively. In Section 3.3, we describe how to perform variational inference of this model. 3.1 Local Gaussian process expert A local Gaussian process expert is specified by the following linear model given the expert indicator t = l (where l = 1 : L) and other related variables: T P (y |x, t = l, vl , l , Il , l ) = N (y |vl l (x), l-1 ). (1) This linear model is symbolized by the inner product of the weight vector vl and a nonlinear feature vector l (x). l (x) is a vector of kernel functions between a test input x and a subset of training inputs: [kl (x, xIl1 ), kl (x, xIl2 ), ..., kl (x, xIlM )]T . The active set Il denotes the indices of selected M training samples. How to select Il will be addressed in Section 3.3; for now let us assume that we use the whole training set as the active set. vl has a Gaussian distribution N (vl |0, U-1 ) with l 2 0 mean and inverse covariance Ul . Ul is set to Kl + hl I, where Kl is a M × M kernel matrix 2 consisting of kernel functions between training samples in the active set. hl is needed to avoid singularity of Ul . l = {hl , f l , gl1 , gl2 , ..., gld } denotes the set of hyperparameters for this 2 linear model. Note that l (x) depends on l . l is the inverse variance of this linear model. The prior of l is set as a Gamma distribution: (l |a, b) ba la-1 e-bl with hyperparameters a and b. It is easy to see that for each expert, y is a Gaussian process defined on x. Such a linear model was proposed by Silverman [14] and was used by sparse Gaussian process models [6, 8]. If we set 2 hl = 0 and l = 1 , the joint distribution of the training outputs Y, assuming they are from 2 2 the same expert l, can be proved to be N (Y|0, Kl + nl I). This has exactly the same form of a regular Gaussian process. However, the largest advantage of this linear model is that it breaks the dependency of y1:N once t1:N are given; i.e., P (y1:N |x1:N , t1:N , v1:L , 1:L , I1:L , 1:L ) = N n=1 P (yn |xn , tn = l, vl , l , Il , l ). This makes the variational inference of the mixture of Gaussian processes feasible. nl 3.2 Gating network A gating network determines which expert to use based on input x. We consider a generative gating network, where expert indicator t is generated by a categorical distribution P (t = l) = pl . p = [p1 p2 ... pL ] is given a symmetric Dirichlet distribution P (p) = Dir(p|y /L, y /L, ..., y /L). Given expert indicator t = l, we assume that x follows a Gaussian mixture model (GMM) with C components. Each component (cluster) is modeled by a Gaussian distribution P (x|t = l, z = c, mlc , Rlc ) = N (x|mlc , R-1 ). z is the cluster indicator which has a categorical distribution lc P (z = c|t = l, ql ) = qlc . In addition, we give mlc a Gaussian prior N (mlc |m0 , R-1 ), Rlc a 0 Wishart prior W (Rlc |r, S) and ql a symmetric Dirichlet prior Dir(ql |x /C, x /C, ..., x /C ). In previous generative gating networks [2-4], the expert indicator also acts as the cluster indicator (or t = z ) such that inputs for an expert can only have one Gaussian distribution. In comparison, our model is more flexible by modeling inputs x for each expert as a Gaussian mixture distribution. One can also put prior (e.g., inverse Gamma distribution) on x and y as done in [1, 2, 15]. In this paper we treat them as fixed hyperparameters. 3.3 Variational inference Variational EM algorithm Given a set of training data D = {(xn , yn ) | n = 1 : N }, the task of learning is to estimate unknown hyperparameters and infer posterior distribution of parameters. This problem is nicely addressed by the variational EM algorithm. The objective is to maximize log P (D|) over hyperparameters . Parameters , expert indicators T = {t1:N } and cluster indicators Z = {z1:N } are treated as hidden variables, denoted by = {, T, Z}. It is possible to estimate all hyperparameters via the EM algorithm. However, most of the hyperparameters are generic and are thus fixed as follows. m0 and R0 are set to be the mean and inverse covariance of the training inputs, respectively. We fix degree of freedom r = d and the scale matrix S = 100I for the Wishart distribution. x , y , C and L are all set to 10. Following Bishop & ´ Svensen [12], we set a = 0.01 and b = 0.0001. Such settings give broad priors to the parameters and make our model sufficiently flexible. Our algorithm is not found to be sensitive to these generic hyperparameters. The only hyperparameters remain to be estimated are = { 1:L , I1:L }. Note that these GP-related hyperparameters are problem specific and should not be assumed known. In the E-step, based on the current estimates of , posterior probability of hidden variables P (|D, ) is computed. Variational inference is involved in this step by approximating P (|D, ) with a factorized distribution l l n Q() = Q(mlc )Q(Rlc ) Q(ql )Q(vl )Q(l )Q(p) Q(tn , zn ). (2) ,c Each hidden variable has the same type of posterior distribution as its conjugate prior. To compute the distribution for a hidden variable i , we need to compute the posterior mean of log P (D, |) over all hidden variables except i : log P (D, |) /i . The derivation is standard and is thus omitted. Variational inference for each hidden variable takes linear time with respect to N , C and L, because the factorized form of P (D, |) leads to separation of hidden variables in log P (D, |). If we switch from our linear model to a regular Gaussian process, one will encounter a prohibitive 3 complexity of O(LN ) for integrating log P (y1:N |x1:N , t1:N , ) over t1:N . Also note that C = L = 10 represents the maximum number of clusters and experts. The actual number is usually smaller. During iteration, if a cluster c for expert l does not have a single training sample supporting it (Q(tn = l, zn = c) > 0), this cluster and its associated parameters mlc and Rlc will be removed. Similarly, we remove an expert l if no Q(tn = l) > 0. These C and L choices are flexible enough for all our tests, but for more complicated data, larger values may be needed. In the M-step, we search for which maximizes log P (D, |) . We employ the conjugate gradient method to estimate 1:L similarly to [5]. Both E-step and M-step are repeated until the algorithm converges. For better efficiency, we do not select the active sets I1:L in each M-step; instead, we fix I1:L during the EM algorithm and only update I1:L once when the EM algorithm converges. The details are given after we introduce the algorithm initialization. Initialization Without proper initialization, variational methods can be easily trapped into local optima. Consequently, using pure randomization methods, one cannot rely on a single result, but has to run the algorithm multiple times and then either pick the best result [12] or average the results [11]. We present a new initialization method that only needs the algorithm to run once. Our method is based on the assumption that the combined data including x and y for an expert are usually distributed locally in the combined d + 1 dimensional space. Therefore, clustering methods such as k -mean can be used to cluster data, one cluster for one expert. Experts are initialized incrementally as follows. First, all training data are used to train one expert. Secondly, we cluster all training data into two clusters and train one expert per cluster. We do this four times and collect a total of L = 1 + 2 + 3 + 4 = 10 experts. Different experts represent different local portions of training data in different scales. Although our assumption may not be true in some cases (e.g., one expert's data intersect with others), this initialization method does give us a meaningful starting point. In practice, we find it effective and reliable. Active set selection We now address the problem of selecting active set Il of size M in defining the feature vector l for expert l. The posterior distribution Q(vl ) can be proved to be 2 Gaussian with inverse covariance Ul = l n Tnl l (xn )l (xn )T + Kl + hl I and mean µl = U-1 l n Tnl yn l (xn ). Tnl is an abbreviation for Q(tn = l) and l is the posterior l mean of l . Inverting Ul has a complexity of O(M 3 ). Thus, for small data sets, the active set can be set to the full training set (M = N ). But for large data sets, we have to select a subset with M < N . The active set Il is randomly initialized. With Il fixed, we run the variational EM algorithm and obtain Q() and . Now we want to improve our results by updating Il . Our method is inspired by the maximum a posteriori probability (MAP) used by sparse Gaussian processes [6, 8]. Specifically, the optimization target in our case is maxIl ,vl P (vl |D) Q(vl ) with posterior distributions of other hidden variables fixed. The justification of this choice is that a good Il should be strongly supported by data D such that Q(vl ) is highly peaked. Since Q(vl ) is Gaussian, vl is always µl at the optimal point and thus this optimization is equivalent to maximizing the determinant of the inverse covariance max |Ul | = | l Il 2 n Tnl l (xn )l (xn )T + Kl + hl I|. (3) Note that if Tnl is one for all n, our method turns into a MAP-based sparse Gaussian process. However, even in that case, our criterion maxIl ,vl P (vl |D) differs from maxIl ,vl P (D|vl )P (vl ) derived in previous MAP-based work [6, 8]. First, the denominator P (D) is ignored by previous 2 methods, which actually depends on Il . Secondly, |Kl + hl I| in P (vl ) is also ignored in previous methods. For these reasons, previous methods are not real MAP estimation but its approximations. Looking for the global optimal active set with size M is not feasible. Thus, similarly to many sparse Gaussian processes, we consider a greedy algorithm by adding one index to Il each time. For a candidate index i, computing the new Ul requires O(N M ); incremental updating Cholesky factorization of Ul requires O(M 2 ) and computing the new |Ul | needs O(1). Therefore, checking one candidate i takes O(N M ). We consider selecting the best index from = 100 randomly selected candidates [6, 8], which makes the total time for adding one index O(N M ). For adding all M indices, the total time is O(N M 2 ). Such a complexity is comparable to that of [6], but higher than those of [7, 8]. Note that this time is needed for each of the L experts. 4 In a summary, the variational EM algorithm with active set selection proceeds as follows. During initialization, training data are clustered and assigned to each expert by the k -mean clustering algorithm noted above; the data assigned to each expert is used for randomly selecting the active set and then training the linear model. During each iteration, we run variational EM to update parameters and hyperparameters; when the EM algorithm converges, we update the active set and Q(vl ) for each expert. Such an iteration is repeated until convergence. It is also possible to define the feature vector l (x) as [k (x, x1 ), k (x, x2 ), ..., k (x, xM )]T , where each x is a pseudo-input (Snelson & Ghahramani [9]). In this way, these pseudo-inputs X can be viewed as hyperparameters and can be optimized in the same variational EM algorithm without resorting to a separate update for active sets as we do. This is theoretically more sound. However, it leads to a large number of hyperparameters to be optimized. Although overfitting may not be an issue, the authors cautioned that this method can be vulnerable to local optima. Predictive distribution Once training is done, for a test input x , its predictive distribution P (y |x , D, ) is evaluated as following: P P P (y |x , D, ) = (y |x , , )P (|D, )d (y |x , , )Q()d P (y |x , p , { ql }, { mlc }, { Rlc }, { vl }, { l }, { l }, {Il }). (4) The first approximation uses the results from the variational inference. Note that expert indicators T and cluster indicators Z are integrated out. Suppose that there are sufficient training data. Thus, the posterior distribution of all parameters are usually highly peaked. This leads to the second approximation, where the integral reduces to point evaluation at the posterior mean of each parameter. Eq.(4) can be easily computed using standard predictive algorithm for mixture of linear experts. See appendix for more details. 4 Test results For all data sets, we normalize each dimension of data to zero mean and unit variance before using them for training. After training, to plot fitting results, we de-normalize data into their original scales. Artificial toy data We consider the toy data set used by [2], which consists of four continuous functions covering input ranges (0, 15), (35, 60), (45, 80) and (80, 100), respectively. Different levels of noise (with standard deviations std = 7, 7, 4 and 2) are added to different functions. This is a challenging multi-modality problem in both input and output dimensions. Fig.2 (left) shows 400 points generated by this toy model, each point with a equal probability 0.25 to be assigned to one of the four functions. Using these 400 points as training data, our method found two experts that fit the data nicely. Fig.2 (left) shows the results. In general, expert one represents the last two functions while expert two represents the first two functions. One may desire to recover each function separately by an expert. However, note the fact that the first two functions have the same noise level (std = 7); so it is reasonable to use j st one GP u to model these two functions. In fact, we recovered a very close estimated std = 1/ 2 = 6.87 for the second expert. The stds of the last two functions are also close (4 vs. 2), and are also similar to 1/ 1 = 2.48 of the first expert. Note that the GP for expert one appears to fit the data of the first function comparably well to that of expert two. However, the gating network does not support this: the means of the GMM for expert one does not cover the region of the first function. Ref.[2] and our method performed similarly well in discovering different modalities in different input regions. We did not plot the mean of the predictive distribution as this data set has multiple modes in the output dimension. Our results were produced using an active set size M = 60. Larger active sets did not give appreciably better results. Motorcycle data Our algorithm was also applied to the 2D motorcycle data set [14], which contains 133 data points with input-dependent noise as shown in Fig.2 (right). Our algorithm yielded two experts with the first expert modeling the majority of the points and the second expert only depicting the beginning part. The estimated stds of the two experts are 23.46 and 2.21, respectively. This appears to correctly represent different levels of noise present in different parts of the data. 5 50 100 50 0 0 -50 data for expert 1 GP for expert 1 -50 m for expert 1 data for expert 2 GP for expert 2 -100 m for expert 2 mean of experts posterior samples -150 0 20 40 60 80 100 10 20 30 40 50 -100 Figure 2: Test results for toy data (left) and motorcycle data (right). Each data point is assigned to an expert l based on its posterior probability Q(tn = l) and is referred to as "data for expert l". The means of the GMM for each expert are also shown at the bottom as "m for expert l". In the right figure, the mean of the predictive distribution is shown as a solid line and samples drawn from the predictive distribution are shown as dots (100 samples for each of the 45 horizontal locations). We also plot the mean of the predictive distribution (4) in Fig.2 (right). Our mean result compares favorably with other methods using medians of mixtures [1, 2]. In particular, our result is similar to that of [1] at input 30. At input > 35, the result of [1] abruptly becomes flat while our result is smooth and appears to fit data better. The result of [2] is jagged, which may suggest using more Gibbs samples for smoother results. In terms of the full predictive (posterior) distribution (represented by samples in Fig.2 (right)), our results are better at input 40 as more artifacts are produced by [1, 2] (especially between 15 and 25). However, our results have more artifacts at input > 40 because that region shares the same std = 23.46 as the other region where input is between 15 and 40. The active set size of our method is set to 40. Training using matlab 7 on a Pentium 2.4 GHz machine took 20 seconds, compared to one hour spent by Gibbs sampling method [1]. Robot arm data We consider the two-link robot arm data set used by [12]. Fig.3 (left) shows the kinematics of such a 2D robot. The joint angles are limited to the ranges 0.3 1 1.2 and /2 2 3 /2. Based on the forward kinematic equations (see [12]) the end point position (x1 , x2 ) has a unique solution given values of joint angles (1 , 2 ). However, we are interested in the inverse kinematics problem: given the end point position, we want to estimate the joint angles. We randomly generated 2000 points based on the forward kinematics, with the first 1000 points for training and the remaining 1000 points for testing. Although noise can be added, we did not do so to make our results comparable to those of [12]. Since this problem involves predicting two correlated outputs at the same time, we used an independent set of local experts for each output but let these two outputs share the same gating network. This was easily adapted in our algorithm. Our algorithm found five experts vs. 16 experts used by [12]. The average number of GMM components is 3. We use residue plots [12] to present results (see Fig.3). Compared to that of [12], the first residue plot is much cleaner suggesting that our errors are much smaller. This is expected as we use more powerful GP experts vs. linear experts used by [12]. The second residue plot (not used in [12]) also gives clean result but is worse than the first plot. This is because the modality with the smaller posterior probability is more likely to be replaced by false positive modes. The active set size was set to 100. A larger size did not improve the results. DELVE data We applied our algorithm to three widely used DELVE data sets: Boston, Kin-8nm and Pumadyn-32nm. These data sets appear to be single modal because impressive results were achieved by a single GP. The purpose of this test is to check how our algorithm (intended for multimodality) handles single modality without knowing it. We followed the standard DELVE testing framework: for the Boston data, there are two tests each using 128 training examples; for both Kin-8nm and Pumadyn-32nm data, there are four tests, each using 1024 training examples. Table 1 shows the standardised squared errors for the test. The scores from all previous methods are copied from Waterhouse [11]. We used the full training set as the active set. Reducing the active 6 1 1 1 A B 0.5 0.5 0.5 2 C 0 0 1 0.5 1 0 0 0.5 1 0 0 0.5 1 Figure 3: Test results for robot arm data set. Left: illustration of the robot kinematics (adapted from [12]). Our task is to estimate the joint angles (1 , 2 ) based on the end point positions. In region B, there are two modalities for the same end point position. In regions A and C, there is only one modality. Middle: the first residue plot. For a test point, its predictive distribution is a Gaussian mixture. The mean of the Gaussian distribution with the highest probability was fed into the forward kinematics to obtain the estimated end point position. A line was drawn between the estimated and real end point positions; the length of the line indicates the magnitude of the error. The average line length (error) is a very small 0.00067 so many lines appear as dots. Right: the second residue plot using the mean of the Gaussian distribution with the second highest probability only for region B. The average line length is 0.001. Both residue plots are needed to check whether both modalities are detected correctly. Date sets Boston Kin8nm Pum32nm gp 0.194 ± 0.061 0.116 ± 0.006 0.044 ± 0.009 mars 0.157 ± 0.009 0.460 ± 0.013 0.061 ± 0.003 mlp 0.094 ± 0.013 0.046 ± 0.023 me 0.159 ± 0.023 0.182 ± 0.020 0.701 ± 0.079 vmgp 0.157 ± 0.002 0.119 ± 0.005 0.041 ± 0.005 Table 1: Standardised squared errors of different methods on the DELVE data sets. Our method (vmgp) is compared with a single Gaussian process trained using a maximum a posteriori method (gp), a bagged version of MARS (mars), a multi-layer perceptron trained using hybrid MCMC (mlp) and a committee of mixtures of linear experts (me) [11]. set compromised the results, suggesting that for these high dimensional data sets, a large number of training examples are required; and for the present training sets, each training example carries information not represented by others. We started with ten experts and found an average of 2, 1 and 2.75 experts for these data sets, respectively. The average number of GMM components for these data sets are 8.5, 10 and 9.5, respectively, indicating that more GMM components are needed for modeling higher dimensional inputs. Our results are comparable to and sometimes better than those of previous methods. Finally, to test how our active set selection algorithm performs, we conducted a standard test for sparse GPs: 7168 samples from Pumadyn-32nm were used for training and the remaining 1024 were for testing. The active set size M was varied from 10 to 150. The error was 0.0569 when M = 10, but quickly reduced to 0.0225, the same as the benchmark error in [7], when M = 25. We rapidly achieved 0.0196 at M = 50 and the error did not decrease after that. This result is better than that of [7] and comparable to the best result of [9]. 5 Conclusions We present a new mixture of Gaussian processes model and apply variational Bayesian method to train it. The proposed algorithm nicely addresses data multi-modality and training complexity issues of a single Gaussian process. Our method achieved comparable results to previous MCMCbased models on several 2D data sets. One future direction is to compare all algorithms using high dimensional data so we can draw more meaningful conclusions. However, one clear advantage of 7 our method is that training is much faster. This makes our method more suitable for many real-world applications where speed is critical. Our active set selection method works well on the Pumadyn-32nm data set. But this test was done in the context of mixture of GPs. To make a fair comparison to other sparse GPs, we can set L = 1 and also try more data sets. It is worthy noting that in the current implementation, the active set size M is fixed for all experts. This can be improved by using a smaller M for an expert with a smaller number of supporting training samples. Acknowledgments Thanks to Carl Rasmussen and Christopher Williams for sharing the GPML matlab package. Appendix Eq.(4) can be expressed as a weighted sum of all experts, where hyperparameters and parameters are omitted: P (y |x ) = X X P (t l c = l, z = c|x )P (y |x , t = l). (A-1) The first term in (A-1) is the posterior probability for expert t = l and it is the sum of P (t = l, z = c|x ) = P l P (x |t = l, z = c)P (t = l, z = c) , , ) , ) Pc P (x |t = l z = c P (t = l z = c (A-2) where P (t = l, z = c) = pl qlc . The second term in (A-1) is the predictive probability for y given expert l, which is Gaussian. References [1] C. E. Rasmussen and Z. Ghahramani. Infinite mixtures of Gaussian process experts. In Advances in Neural Information Processing Systems 14. MIT Press, 2002. [2] E. Meeds and S. Osindero. An alternative infinite mixture of Gaussian process experts. In Advances in Neural Information Processing Systems 18. MIT Press, 2006. [3] L. Xu, M. I. Jordan, and G. E. Hinton. An alternative model for mixtures of experts. In Advances in Neural Information Processing Systems 7. MIT Press, 1995. [4] N. Ueda and Z. Ghahramani. Bayesian model search for mixture models based on optimizing variational bounds. Neural Networks, 15(10):1223­1241, 2002. [5] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [6] A. J. Smola and P. Bartlett. Sparse greedy Gaussian process regression. In Advances in Neural Information Processing Systems 13. MIT Press, 2001. [7] M. Seeger, C. K. I. Williams, and N. D. Lawrence. Fast forward selection to speed up sparse Gaussian process regression. In Workshop on Artificial Intelligence and Statistics 9, 2003. [8] S. S. Keerthi and W. Chu. A matching pursuit approach to sparse Gaussian process regression. In Advances in Neural Information Processing Systems 18. MIT Press, 2006. [9] E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems 18. MIT Press, 2006. [10] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixture of local experts. Neural computation, 3:79­87, 1991. [11] S. Waterhouse. Classification and regression using mixtures of experts. PhD Theis, Department of Engineering, Cambridge University, 1997. ´ [12] C. M. Bishop and M. Svensen. Bayesian hierarchical mixtures of experts. In Proc. Uncertainty in Artificial Intelligence, 2003. [13] V. Tresp. Mixtures of Gaussian processes. In Advances in Neural Information Processing Systems 13. MIT Press, 2001. [14] B. W. Silverman. Some aspects of the spline smoothing approach to non-parametric regression curve fitting. J. Royal. Stat. Society. B, 47(1):1­52, 1985. [15] C. E. Rasmussen. The infinite Gaussian mixture model. In Advances in Neural Information Processing Systems 12. MIT Press, 2000. ´ [16] L. Csato and M. Opper. Sparse on-line Gaussian processes. Neural Computation, 14(3):641­668, 2002. 8