Bayesian Kernel Shaping for Learning Control Jo-Anne Ting1 , Mrinal Kalakrishnan1 , Sethu Vijayakumar2 and Stefan Schaal1,3 1 Computer Science, U. of Southern California, Los Angeles, CA 90089, USA 2 School of Informatics, University of Edinburgh, Edinburgh, EH9 3JZ, UK 3 ATR Computational Neuroscience Labs, Kyoto 619-02, Japan Abstract In kernel-based regression learning, optimizing each kernel individually is useful when the data density, curvature of regression surfaces (or decision boundaries) or magnitude of output noise varies spatially. Previous work has suggested gradient descent techniques or complex statistical hypothesis methods for local kernel shaping, typically requiring some amount of manual tuning of meta parameters. We introduce a Bayesian formulation of nonparametric regression that, with the help of variational approximations, results in an EM-like algorithm for simultaneous estimation of regression and kernel parameters. The algorithm is computationally efficient, requires no sampling, automatically rejects outliers and has only one prior to be specified. It can be used for nonparametric regression with local polynomials or as a novel method to achieve nonstationary regression with Gaussian processes. Our methods are particularly useful for learning control, where reliable estimation of local tangent planes is essential for adaptive controllers and reinforcement learning. We evaluate our methods on several synthetic data sets and on an actual robot which learns a task-level control law. 1 Introduction Kernel-based methods have been highly popular in statistical learning, starting with Parzen windows, kernel regression, locally weighted regression and radial basis function networks, and leading to newer formulations such as Reproducing Kernel Hilbert Spaces, Support Vector Machines, and Gaussian process regression [1]. Most algorithms start with parameterizations that are the same for all kernels, independent of where in data space the kernel is used, but later recognize the advantage of locally adaptive kernels [2, 3, 4]. Such locally adaptive kernels are useful in scenarios where the data characteristics vary greatly in different parts of the workspace (e.g., in terms of data density, curvature and output noise). For instance, in Gaussian process (GP) regression, using a nonstationary covariance function, e.g., [5], allows for such a treatment. Performing optimizations individually for every kernel, however, becomes rather complex and is prone to overfitting due to a flood of open parameters. Previous work has suggested gradient descent techniques with cross-validation methods or involved statistical hypothesis testing for optimizing the shape and size of a kernel in a learning system [6, 7]. In this paper, we consider local kernel shaping by averaging over data samples with the help of locally polynomial models and formulate this approach, in a Bayesian framework, for both function approximation with piecewise linear models and nonstationary GP regression. Our local kernel shaping algorithm is computationally efficient (capable of handling large data sets), can deal with functions of strongly varying curvature, data density and output noise, and even rejects outliers automatically. Our approach to nonstationary GP regression differs from previous work by avoiding Markov Chain Monte Carlo (MCMC) sampling [8, 9] and by exploiting the full nonparametric characteristics of GPs in order to accommodate nonstationary data. DRAFT -- DRAFT -- DRAFT -- DRAFT -- DRAFT -1 One of the core application domains for our work is learning control, where computationally efficient function approximation and highly accurate local linearizations from data are crucial for deriving controllers and for optimizing control along trajectories [10]. The high variations from fitting noise, seen in Fig. 3, are harmful to the learning system, potentially causing the controller to be unstable. Our final evaluations illustrate such a scenario by learning an inverse kinematics model for a real r o b o t ar m . 2 Bayesian Local Kernel Shaping We develop our approach in the context of nonparametric locally weighted regression with locally linear polynomials [11], assuming, for notational simplicity, only a one-dimensional output-- extensions to multi-output settings are straightforward. We assume a training set of N samples, N D = {xi , yi }i=1 , drawn from a nonlinear function y = f (x) + that is contaminated with meanzero (but potentially heteroscedastic) noise . Each data sample consists of a d-dimensional input vector xi and an output yi . We wish to approximate a locally linear model of this function at a query point xq d×1 in order to make a prediction yq = bT xq , where b d×1 . We assume the existence of a spatially localized weighting kernel wi = K (xi , xq , h) that assigns a weight to every {xi , yi } according to its Euclidean distance in input space from the query point xq . A popular choice for K is the Gaussian kernel, but other kernels may be used as well [11]. The bandwidth h d×1 of the kernel is the crucial parameter that determines the local model's quality of fit. Our goal is to find a Bayesian formulation of determining b and h simultaneously. 2.1 Model For the locally linear model at the query i = 1, .., N yi point xq , we can introduce hidden ran- ! 2 dom variables z [12] and modify the linear !z 2 !zd !z1 d model yi = bT xi so that yi = m=1 zim + zi1 zid zi 2 bd b1 b2 É , where zim = bT xim + zm a0 d zma n m 2 Normal (0, zm ), Normal , re both additive noise terms. Note that xim = wi1 Éx wid xi1 xi 2 wi 2 id [xim 1]T and bm = [bm bm0 ]T , where xim is the mth coefficient of xi , bm is the mth É hd h2 h1 coefficient of b and bm0 is the offset value. The z variables allow us to derive compu- Figure 1: Graphical model. Random variables are in tationally efficient O(d) EM-like updates, circles, and observed random variables are in shaded as we will see later. The prediction at the double circles. d query point xq is then m bT xqm . We asm sume the following prior distributions for our model, shown graphically in Fig. 1: 1 p p(yi |zi ) Normal T zi , 2 (bm |zm ) Normal (0, zm bm ,0 ) bT p p(zim |xim ) Normal m xim , zm (zm ) Scaled-Inv-2 (nm0 , zm,0 ) where 1 is a vector of 1s, zi d×1 , zim is the mth coefficient of zi , and bm ,0 is the prior 2 covariance matrix of bm and a 2 × 2 diagonal matrix. nm0 and mN 0 are the prior parameters of 2 2 the Scaled-inverse- distribution (nm0 is the number of degrees of freedom parameter and mN 0 is the scale parameter). The Scaled-Inverse-2 distribution was used for zm since it is the conjugate prior for the variance parameter of a Gaussian distribution. In contrast to classical treatments of Bayesian weighted regression [13] where the weights enter as a heteroscedastic correction on the noise variance of each data sample, we associate a scalar indicator-like weight, wi {0, 1}, with each sample {xi , yi } in D. The sample is fully included in d the local model if wi = 1 and excluded if wi = 0. We define the weight wi to be wi = m=1 wim , where wim is the weight component in the mth input dimension. While previous methods model the weighting kernel K as some explicit function, we model the weights wim as Bernoulli-distributed random variables, i.e., p(wim ) Bernoulli(qim ), choosing a symmetric bell-shaped function for the parameter qim : qim = 1/(1 + (xim - xqm )2r hm ). xqm is the mth coefficient of xq , hm is the mth DRAFT -- DRAFT -- DRAFT -- DRAFT -- DRAFT -2 coefficient of h, and r > 0 is a positive integer1 . As pointed out in [11], the particular mathematical formulation of a weighting kernel is largely computationally irrelevant for locally weighted learning. Our choice of function for qim was dominated by the desire to obtain analytically tractable learning updates. We place a Gamma prior over the bandwidth hm (i.e., p(hm ) Gamma (ahm0 , bhm0 ) where ahm0 and bhm0 are parameters of the Gamma distribution) to ensure that a positive weighting kernel width. 2.2 Inference We can treat the entire regression problem as an EM learning problem [14, 15] and maximize the log likelihood log p(y|X) for generating the observed data. We can maximize this incomplete log likelihood by maximizing the expected value of the complete log likelihood p(y, Z, b, w, h, 2 , z |X) = N 2 i=1 p(yi , zi , b, wi , h, , z |xi ). In our model, each data sample i has an indicator-like scalar weight wi associated with it, allowing us to express the complete log likelihood L, in a similar fashion to mixture models, as: N p d E i m wi md 2 L = log (yi |zi , )p(zi |xi , b, z ) p(wim ) p(bm |zm )p(zm )p(hm )p( 2 ) =1 =1 =1 xpanding the log p(wim ) term from the expression above results in a problematic - log(1 + (xim - xqm )2r ) term that prevents us from deriving an analytically tractable expression for the posterior of hm . To address this, we use a variational approach on concave/convex functions suggested by [16] to producx analytically tractable expressions. We can find a lower bound on the e term so that - log(1 + im - xqm )2r -im (xim - xqm )2r hm , where im is a variational parameter to be optimized in the M-step of our final EM-like algorithm. Our choice of weighting kernel allows us to find a lower bound to L in this manner. We explored the use of other weighting kernels (e.g., a quadratic negative exponential), but had issues with finding a lower bound to the problematic terms in log p(wim ) such that analytically tractable inference for hm could be done. ^ ^ The resulting lower bound to L is L; due to lack of space, we give the expression for L in the ap^ should be taken with respect to the true posterior distribution of all pendix. The expectation of L hidden variables Q(b, z , z, h). Since this is an analytically tractable expression, a lower bound can be formulated using a technique from variational calculus where we make a factorial approximation of the true posterior, e.g., Q(b, z , z, h) = Q(b, z )Q(h)Q(z) [15], that allows resulting posterior distributions over hidden variables to become analytically tractable: The posterior of wim , p(wim = 1|yi , zi , xi , , wi,k=m ), is inferred using Bayes' rule: p(yi , zi |xi , , wi,k=m , wim = 1) p(yi , zi |xi , , wi,k=m , wim = 1) Qd Qd t=1,t=m wit p (wim = 1) d t=1,t=m wit p (wim = 1) + p(wim = 0) (1 ) where = {b, z , h} and wi,k=m denotes the set of weights {wik }k=1,k=m . For the dimension m, we account for the effect of weights in the other d - 1 dimensions. This is a result of wi being defined as the product of weights in all dimensions. The posterior mean of wim is then d p(wim = 1|yi , zi , xi , , wi,k=m ) , and wi = m=1 wim , where . denotes the expectation operator. We omit the full set of posterior EM update equations (please refer to the appendix for this) and list only the posterior updates for hm , wim , bm and zi : -1 iN 1 z N zN zN -1 T 1T b m = wi xim xim zi |yi ,xi = + bm ,0 b wi - si wi 1 wi =1 m = b m i =1 Qd i k=1,k=m wik N wi zim xim z i zN 1 = si wi + b I d ,d - zN 11T si wi xi b wim = qim A qim A Qd i k=1,k=m wik hm + 1 - qim 1 (xim - xqm ) is taken to the power 2r in order to ensure that the resulting expression is positive. Adjusting r affects how long the tails of the kernel are. We use r = 2 for all our experiments. N ahm0 + N - i=1 wim = N 2r h m0 + i=1 im (xim - xqm ) DRAFT -- DRAFT -- DRAFT -- DRAFT -- DRAFT -3 where Id,d is a d × d identity matrix, bxi is a d by 1 vector with coefficients bm T xim , wi = d 2 T z N m=1 wim , z N is a diagonal matrix with z N on its diagonal, si = + 1 wi 1 (to avoid division by zero, wi needs to be capped to some small non-zero value), qim = im = 1/(1 + (xim - d xqm )2r hm ), and Ai = N (yi ; 1T zi , 2 ) m=1 N (zim ; bm T xim , zm ). Closer examination of the expression for bm shows that it is a standard Bayesian weighted regression update [13], i.e., a data sample i with lower weight wi will be downweighted in the regression. Since the weights are influenced by the residual error at each data point (see posterior update for wim ), an outlier will be downweighted appropriately and eliminated from the local model. Fig. 2 shows how local kernel shaping is able to ignore outliers that a classical GP fits. A few remarks should be made regarding the initialization of priors 3 Training data Stationary GP used in the posterior EM updates. bm ,0 can be set to 106 I to 2.5 Kernel Shaping reflect a large uncertainty associated with the prior distribution of 2 b. The initial noise variance, zm,0 , should be set to the best guess 1.5 on the noise variance. To adjust the strength of this prior, nm0 can y 1 be set to the number of samples one believes to have seen with 0.5 noise variance zm,0 Finally, the initial h of the weighting kernel 0 should be set so that the kernel is broad and wide. We use values of -0.5 ahm0 = bhm0 = 10-6 so that hm0 = 1 with high uncertainty. Note -1 0 1 x that some sort of initial belief about the noise level is needed to distinguish between noise and structure in the training data. Aside Figure 2: Effect of outliers (in from the initial prior on zm , we used the same priors for all data black circles) sets in our evaluations. 2.3 Computational Complexity For one local model, the EM update equations have a computational complexity of O(N d) per EM iteration, where d is the number input dimensions and N is the size of the training set. This efficiency arises from the introduction of the hidden random variables zi , which allows zi and zi |yi ,xi to be computed in O(d) and avoids a d × d matrix inversion which would typically require O(d3 ). Some nonstationary GP methods, e.g., [5], require O(N 3 ) + O(N 2 ) for training and prediction, while other more efficient stationary GP methods, e.g., [17], require O(M 2 N ) + O(M 2 ) training and prediction costs (where M << N is the number of pseudoinputs used in [17]). Our algorithm requires O(N dIE M ), where IE M is the number of EM iterations--with a maximal cap of 1000 iterations used. Our algorithm also does not require any MCMC sampling as in [8, 9], making it more appealing to real-time applications. 3 Extension to Gaussian Processes We can apply the algorithm in section 2 not only to locally weighted learning with linear models, but also to derive a nonstationary GP method. A GP is defined by a mean and and a covariance function, where the covariance function K captures dependencies betw, en any two points as a function of e f the corresponding inputs, i.e., k (xi , xj ) = cov (xi ), f (xj ) where i, j = 1, .., N . Standard GP models use a stationary covariance function, where the covariance between any two points in the training data is a function of the distances |xi - xj |, not of their locations. Stationary GPs perform suboptimally for functions that have different properties in various parts of the input space (e.g., discontinuous functions) where the stationary assumption fails to hold. Various methods have been proposed to specify nonstationary GPs. These include defining a nonstationary Matern covariance ´ function [5], adopting a mixture of local experts approach [18, 8, 9] to use independent GPs to cover data in different regions of the input space, and using multidimensional scaling to map a nonstationary spatial GP into a latent space [19]. Given the data set D drawn from the function y = f (x) + , as previously introduced in section 2, we propose an approach to specify a nonstationary covariance function. Assuming the use of a quadratic negative exponential covariance function, the covariance function of a stationary GP is k (xi , xj ) = d 2 v1 exp(-0.5 m=1 hm (xim - xj m )2 ) + v0 , where the hyperparameters {h1 , h2 , ..., hd , v0 , v1 } are DRAFT -- DRAFT -- DRAFT -- DRAFT -- DRAFT -4 optimized. In a nonstationary GP, the covariance function could then take the form2 k (xi , xj ) = + - d him hj 2 v0 , where him is the bandwidth of the local model v1 exp 0.5 m=1 (xim - xj m )2 (him +hm ) jm centered at xim and hj m is the bandwidth of the local model centered at xj m . We learn first the d values of {him }m=1 for all training data samples i = 1, ..., N using our proposed local kernel shaping algorithm and then optimize the hyperparameters v0 and v1 . To make a prediction for a test d sample xq , we learn also the values of {hqm }m=1 , i.e., the bandwidth of the local model centered at xq . Importantly, since the covariance function of the GP is derived from locally constant models, we learn with locally constant, instead of locally linear, polynomials. We use r = 1 for the weighting kernel in order keep the degree of nonlinearity consistent with that in the covariance function (i.e., quadratic). Even though the weighting kernel used in the local kernel shaping algorithm is not a quadratic negative exponential, it has a similar bell shape, but with a flatter top and shorter tails. Because of this, our augmented GP is an approximated form of a nonstationary GP. Nonetheless, it is able to capture nonstationary properties of the function f without needing MCMC sampling, unlike previously proposed nonstationary GP methods [8, 9]. 4 Experimental Results 4.1 Synthetic Data First, we show our local kernel shaping algorithm's bandwidth adaptation abilities on several synthetic data sets, comparing it to a stationary GP and our proposed augmented nonstationary GP. For the ease of visualization, we consider the following one-dimensional functions, similar to those in [5]: i) a function with a discontinuity, ii) a spatially inhomogeneous function, and iii) a straight line function. The data set for function i) consists of 250 training samples, 201 test inputs (evenly spaced across the input space) and output noise with 2 = 0.3025; the data set for function ii) consists of 250 training samples, 101 test inputs and an output signal-to-noise ratio (SNR) of 10; and the data set for function iii) has 50 training samples, 21 test inputs and an output SNR of 100. Fig. 3 shows the predicted outputs of a stationary GP, augmented nonstationary GP and the local kernel shaping algorithm for data sets i)-iii). The local kernel shaping algorithm smoothes over regions where a stationary GP overfits and yet, it still manages to capture regions of highly varying curvature, as seen in Figs. 3(a) and 3(b). It correctly adjusts the bandwidths h with the curvature of the function. When the data looks linear, the algorithm opens up the weighting kernel so that all data samples are considered, as Fig. 3(c) shows. Our proposed augmented nonstationary GP also can handle the nonstationary nature of the data sets as well, and its performance is quantified in Table 1. Returning to our motivation to use these algorithms to obtain linearizations for learning control, it is important to realize that the high variations from fitting noise, as shown by the stationary GP in Fig. 3, are detrimental for learning algorithms, as the slope (or tangent hyperplane, for highdimensional data) would be wrong. Table 1 reports the normalized mean squared prediction error (nMSE) values for function i) and function ii) data sets, averaged over 20 random data sets. Fig. 4 shows results of the local kernel shaping algorithm and the proposed augmented nonstationary GP on the "real-world" motorcycle data set [20] consisting of 133 samples (with 80 equally spaced input query points used for prediction). We also show results from a previously proposed MCMC-based nonstationary GP method: an alternate infinite mixture of GP experts [9]. We can see that the augmented nonstationary GP and the local kernel shaping algorithm both capture the leftmost flatter region of the function, as well as some of the more nonlinear and noisier regions after 30msec. 4.2 Robot Data Next, we move on to an example application: learning an inverse kinematics model for a 3 degree-offreedom (DOF) haptic robot arm (manufactured by SensAble, shown in Fig. 5(a)) in order to control the end-effector along a desired trajectory. This will allow us to verify that the kernel shaping algoThis is derived from the definition of K as a positive semi-definite matrix, i.e. where the integral is the product of two quadratic negative exponentials--one with parameter him and the other with parameter hj m . 2 DRAFT -- DRAFT -- DRAFT -- DRAFT -- DRAFT -5 2 2 1 y 0 Training data Stationary GP Aug GP Kernel Shaping y 2 1 0 y0 -2 -1 -4 -2 -1 0 x 1 2 -2 -1 0 x 1 2 -1 -2 -2 -1 0 x 1 2 w 10 xq w h 7 10 6 10 6 w 10 1 0 -2 -1 0x 1 10 2 0 3 h h w 10 0 1 0 -2 -1 0 1 10 2 0 1 0 -2 -1 0 x x 1 10 2 -6 (a) Function i) (b) Function ii) (c) Function iii) Figure 3: Predicted outputs using a stationary GP, our augmented nonstationary GP and local kernel shaping. Figures on the bottom show the bandwidths learnt by local kernel shaping and the corresponding weighting kernels (in dotted black lines) for input query points (shown in red circles). rithm can successfully deal with a large, noisy real-world data set with outliers and non-stationary properties--typical characteristics of most control learning problems. We collected 60, 000 data samples from the arm while it performed random sinusoidal movements within a constrained box volume of Cartesian space. Each sample consists of the arm's joint angles q, joint velocities q, end-effector position in Cartesian space x, and end-effector velocities x. From this data, we first learn a forward kinematics model: x = J(q)q, where J(q) is the Jacobian matrix. The transformation from q to x can be assumed to be locally linear at a particular configuration q of the robot arm. We learn the forward model using kernel shaping, building a local model around each training point only if that point is not already sufficiently covered by an existing local model (e.g., having an activation weight of less than 0.2). Using insights into robot geometry, we localize the models only with respect to q while the regression of each model is trained only on a mapping from q to x--these geometric insights are easily incorporated as priors in the Bayesian model. This procedure resulted in 56 models being built to cover the entire space of training data. We artificially introduce a redundancy in our inverse kinematics problem on the 3-DOF arm by specifying the desired trajectory (x, x) only in terms of x, z positions and velocities, i.e., the movement is supposed to be in a vertical plane in front of the robot. Analytically, the inverse kinematics g equation is q = J# (q)x - (I - J# J) q , where J # (q) is the pseudo-inverse of the Jacobian. The second term is an optimal solution to the redundancy problem, specified here by a cost function g in terms of joint angles q. To learn a model for J# , we can reuse the local regions of q from the forward model, where J# is also locally linear. The redundancy issue can be solved by applying an additional weight to each data point according to a reward function [21]. In our case, the task is specified in terms of {x, z }, so we define a reward based on a desired y coordinate, ydes , that we 1 2 would like to enforce as a soft constraint. Our reward function is g = e- 2 h(k(ydes -y)-y) , where k is a gain and h specifies the steepness of the reward. This ensures that the learnt inverse model chooses a solution which produces a y that pushes the y coordinate toward ydes . We invert each forward local model using a weighted linear regression, where each data point is weighted by the weight from the forward model and additionally weighted by the reward. We test the performance of this inverse model (Learnt IK) in a figure-eight tracking task as shown in Fig. 5(b). As seen, the learnt model performs as well as the analytical inverse kinematics solution (IK), with root mean squared tracking errors in positions and velocities very close to that of the DRAFT -- DRAFT -- DRAFT -- DRAFT -- DRAFT -6 Table 1: Average normalized mean squared prediction error values, for a stationary GP model, our augmented nonstationary GP, local kernel shaping--averaged over 20 random data sets. Method Stationary GP Augmented nonstationary GP Local Kernel Shaping 100 Training Data AiMoGPE SingleGP Function i) 0.1251 ± 0.013 0.0110 ± 0.0078 0.0092 ± 0.0068 Function ii) 0.0230 ± 0.0047 0.0212 ± 0.0067 0.0217 ± 0.0058 100 50 Acceleration (g) 0 -50 -100 -150 0 Training data Kernel Shaping Stationary GP 10 20 30 40 Time (ms) 50 60 100 50 50 Acceleration (g) 0 Acceleration (g) 0 -50 -100 -150 0 Training data Aug GP Stationary GP 10 20 30 40 Time (ms) 50 60 -50 -100 -150 0 10 20 30 Time (ms) 40 50 60 (a) Alternate in(A)ite mix. of GPs fin (b) Augmented nonstationary GP (c) Local Kernel Shaping Figure 4: Motorcycle impact data set from [20], with predicted results shown for our augmented GP and local kernel shaping algorithms. Results from the alternate infinite mixture of GP experts (AiMoGPE) are taken from [9]. analytical solution. This demonstrates that kernel shaping is an effective learning algorithm for use in robot control learning applications. Applying any arbitrary nonlinear regression method (such as a GP) to the inverse kinematics problem would, in fact, lead to unpredictably bad performance. The inverse kinematics problem is a one-tomany mapping and requires careful design of a learning problem to avoid problems with non-convex solution spaces [22]. Our suggested method of learning linearizations with a forward mapping (which is a proper function), followed by learning an inverse mapping within the local region of the forward mapping, is one of the few clean approaches to the problem. Instead of using locally linear methods, one could also use density-based estimation techniques like mixture models [23]. However, these methods must select the correct mode in order to arrive at a valid solution, and this final step may be computationally intensive or involve heuristics. For these reasons, applying a MCMC-type approach or GP-based method to the inverse kinematics problem was omitted as a comparison. 5 Discussion We presented a full Bayesian treatment of nonparametric local multi-dimensional kernel adaptation that simultaneously estimates the regression and kernel parameters. The algorithm can also be integrated into nonlinear algorithms, offering a valuable and flexible tool for learning. We show that our local kernel shaping method is particularly useful for learning control, demonstrating results on an inverse kinematics problem, and envision extensions to more complex problems with redundancy, 0.2 z (m) Desired Analytical IK 0.2 z (m) Desired Learnt IK 0.1 0.1 0 0 -0.1 -0.1 -0.05 0 0.05 x (m) 0.1 -0.1 -0.1 -0.05 0 0.05 x (m) 0.1 (a) Robot arm (b) Desired versus actual trajectories Figure 5: Desired versus actual trajectories for SensAble Phantom robot arm DRAFT -- DRAFT -- DRAFT -- DRAFT -- DRAFT -7 e.g., learning inverse dynamics models of complete humanoid robots. Note that our algorithm requires only one prior be set by the user, i.e., the prior on the output noise. All other biases are initialized the same for all data sets and kept uninformative. In its current form, our Bayesian kernel shaping algorithm is built for high-dimensional inputs due to its low computational complexity-- it scales linearly with the number of input dimensions. However, numerical problems may arise in case of redundant and irrelevant input dimensions. Future work will address this issue through the use of an automatic relevant determination feature. Other future extensions include an online implementation of the local kernel shaping algorithm. References [1] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, In Advances in Neural Information Processing Systems 8, volume 8. MIT Press, 1995. [2] J. H. Friedman. A variable span smoother. Technical report, Stanford University, 1984. [3] T. Poggio and F. Girosi. Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247:213­225, 1990. [4] J. Fan and I. Gijbels. Local polynomial modeling and its applications. Chapman and Hall, 1996. [5] C. J. Paciorek and M. J. Schervish. Nonstationary covariance functions for Gaussian process regression. In Advances in Neural Information Processing Systems 16. MIT Press, 2004. [6] J. Fan and I. Gijbels. Data-driven bandwidth selection in local polynomial fitting: Variable bandwidth and spatial adaptation. Journal of the Royal Statistical Society B, 57:371­395, 1995. [7] S. Schaal and C.G. Atkeson. Assessing the quality of learned local models. In G. Tesauro J. Cowan and J. Alspector, editors, Advances in Neural Information Processing Systems, pages 160­167. Morgan Kaufmann, 1994. [8] C. E. Rasmussen and Z. Ghahramani. Infinite mixtures of Gaussian processes. In Advances in Neural Information Processing Systems 14. MIT Press, 2002. [9] E. Meeds and S. Osindero. An alternative infinite mixture of Gaussian process experts. In Advances in Neural Information Processing Systems 17. MIT Press, 2005. [10] C. Atkeson and S. Schaal. Robot learning from demonstration. In Proceedings of the 14th international conference on Machine learning, pages 12­20. Morgan Kaufmann, 1997. [11] C. Atkeson, A. Moore, and S. Schaal. Locally weighted learning. AI Review, 11:11­73, April 1997. [12] A. D'Souza, S. Vijayakumar, and S. Schaal. The Bayesian backfitting relevance vector machine. In Proceedings of the 21st International Conference on Machine Learning. ACM Press, 2004. [13] A. Gelman, J. Carlin, H.S. Stern, and D.B. Rubin. Bayesian Data Analysis. Chapman and Hall, 2000. [14] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society. Series B, 39(1):1­38, 1977. [15] Z. Ghahramani and M.J. Beal. Graphical models and variational methods. In D. Saad and M. Opper, editors, Advanced Mean Field Methods - Theory and Practice. MIT Press, 2000. [16] T. S. Jaakkola and M. I. Jordan. Bayesian parameter estimation via variational methods. Statistics and Computing, 10:25­37, 2000. [17] E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems 18. MIT Press, 2006. [18] V. Tresp. Mixtures of Gaussian processes. In Advances in Neural Information Processing Systems 13. MIT Press, 2000. [19] A. M. Schmidt and A. O'Hagan. Bayesian inference for nonstationary spatial covariance structure via spatial deformations. Journal of Royal Statistical Society. Series B, 65:745­758, 2003. [20] B. W. Silverman. Some aspects of the spline smoothing approach to non-parametric regression curve fitting. Journal of Royal Statistical Society. Series B, 47:1­52, 1985. [21] J. Peters and S. Schaal. Learning to control in operational space. International Journal of Robotics Research, 27:197­212, 2008. [22] M. I. Jordan and D. E. Rumelhart. Internal world models and supervised learning. In Machine Learning: Proceedings of Eighth Internatinoal Workshop, pages 70­85. Morgan Kaufmann, 1991. [23] Z. Ghahramani. Solving inverse problems using an EM approach to density estimation. In Proceedings of the 1993 Connectionist Models summer school, pages 316­323. Erlbaum Associates, 1994. DRAFT -- DRAFT -- DRAFT -- DRAFT -- DRAFT -8