Nonlinear causal discovery with additive noise models

Patrik O. Hoyer University of Helsinki Finland

Dominik Janzing MPI for Biological Cybernetics ¨ Tubingen, Germany

Joris Mooij MPI for Biological Cybernetics ¨ Tubingen, Germany

Jonas Peters MPI for Biological Cybernetics ¨ Tubingen, Germany

¨ Bernhard Scholkopf MPI for Biological Cybernetics ¨ Tubingen, Germany

Abstract
The discovery of causal relationships between a set of observed variables is a fundamental problem in science. For continuous-valued data linear acyclic causal models with additive noise are often used because these models are well understood and there are well-known methods to fit them to data. In reality, of course, many causal relationships are more or less nonlinear, raising some doubts as to the applicability and usefulness of purely linear methods. In this contribution we show that in fact the basic linear framework can be generalized to nonlinear models. In this extended framework, nonlinearities in the data-generating process are in fact a blessing rather than a curse, as they typically provide information on the underlying causal system and allow more aspects of the true data-generating mechanisms to be identified. In addition to theoretical results we show simulations and some simple real data experiments illustrating the identification power provided by nonlinearities.

1

Introduction

Causal relationships are fundamental to science because they enable predictions of the consequences of actions [1]. While controlled randomized experiments constitute the primary tool for identifying causal relationships, such experiments are in many cases either unethical, too expensive, or technically impossible. The development of causal discovery methods to infer causal relationships from uncontrolled data constitutes an important current research topic [1, 2, 3, 4, 5, 6, 7, 8]. If the observed data is continuous-valued, methods based on linear causal models (aka structural equation models) are commonly applied [1, 2, 9]. This is not necessarily because the true causal relationships are really believed to be linear, but rather it reflects the fact that linear models are well understood and easy to work with. A standard approach is to estimate a so-called Markov equivalence class of directed acyclic graphs (all representing the same conditional independencies) from the data [1, 2, 3]. For continuous variables, the independence tests often assume linear models with additive Gaussian noise [2]. Recently, however, it has been shown that for linear models, non-Gaussianity in the data can actually aid in distinguishing the causal directions and allow one to uniquely identify the generating graph under favourable conditions [7]. Thus the practical case of non-Gaussian data which long was considered a nuisance turned out to be helpful in the causal discovery setting. In this contribution we show that nonlinearities can play a role quite similar to that of nonGaussianity: When causal relationships are nonlinear it typically helps break the symmetry between the observed variables and allows the identification of causal directions. As Friedman and Nachman have pointed out [10], non-invertible functional relationships between the observed variables can provide clues to the generating causal model. However, we show that the phenomenon is much more general; for nonlinear models with additive noise almost any nonlinearities (invertible or not) will typically yield identifiable models. Note that other methods to select among Markov equivalent DAGs [11, 8] have (so far) mainly focussed on mixtures of discrete and continuous variables. 1


In the next section, we start by defining the family of models under study, and then, in Section 3 we give theoretical results on the identifiability of these models from non-interventional data. We describe a practical method for inferring the generating model from a sample of data vectors in Section 4, and show its utility in simulations and on real data (Section 5).

2

Model definition

We assume that the observed data has been generated in the following way: Each observed variable xi is associated with a node i in a directed acyclic graph G, and the value of xi is obtained as a function of its parents in G, plus independent additive noise ni , i.e. xi := fi (xpa(i) ) + ni , (1) where fi is an arbitrary function (possibly different for each i), xpa(i) is a vector containing the elements xj such that there is an edge from j to i in the DAG G, the noise variables ni may have arbitraryi probability densities pni (ni ), and the noise variables are jointly independent, that is pn (n) = pni (ni ), where n denotes the vector containing the noise variables ni . Our data then consists of a number of vectors x sampled independently, each using G, the same functions fi , and the ni sampled independently from the same densities pni (ni ). Note that this model includes the special case when all the fi are linear and all the pni are Gaussian, yielding the standard linear­Gaussian model family [2, 3, 9]. When the functions are linear but the densities pni are non-Gaussian we obtain the linear­non-Gaussian models described in [7]. The goal of causal discovery is, given the data vectors, to infer as much as possible about the generating mechanism; in particular, we seek to infer the generating graph G. In the next section we discuss the prospects of this task in the theoretical case where the joint distribution px (x) of the observed data can be estimated exactly. Later, in Section 4, we experimentally tackle the practical case of a finite-size data sample.

3

Identifiability

Our main theoretical results concern the simplest non-trivial graph: the case of two variables. The experimental results will, however, demonstrate that the basic principle works even in the general case of N variables. Figure 1 illustrates the basic identifiability principle for the two-variable model. Denoting the two variables x and y , we are considering the generative model y := f (x) + n where x and n are

0.020

a
5

b
p(y | x)

c
0.030 0.025

0.010

y

cm1[theinds]

p(x | y )

0.015

g

y p(x,y)>0

0

0.005

-5

0.005

0.010

0.015

cm1

0.020

0.000

-5

x

0

5

5 -!5

yvals

y

0 0

5 5

0.000

f(x)
3 -!3 !2 !1 xvals[theinds]

noise p(x) x

x

0 0

1

2

33

p(x)

0.020

5
0.015

0.010

0.005

0.01

0.02

y

0

cm1[theinds]

p(y | x)

0.04

d

e

f
p(x | y )

cm1

0.03

-5

noise

p(x,y)>0

0.000

-5

x

0

5

!5 -5

yvals

y

0 0

55

0.00 !3 -3

!2

!1

xvals[theinds]

x

0 0

1

2

3 3

Figure 1: Identification of causal direction based on constancy of conditionals. See main text for a detailed explanation of (a)­(f). (g) shows an example of a joint density p(x, y ) generated by a causal model x  y with y := f (x) + n where f is nonlinear, the supports of the densities px (x) and pn (n) are compact regions, and the function f is constant on each connected component of the support of px . The support of the joint density is now given by the two gray squares. Note that the input distribution px , the noise distribution pn and f can in fact be chosen such that the joint density is symmetrical with respect to the two variables, i.e. p(x, y ) = p(y , x), making it obvious that there will also be a valid backward model. 2


both Gaussian and statistically independent. In panel (a) we plot the joint density p(x, y ) of the observed variables, for the linear case of f (x) = x. As a trivial consequence of the model, the conditional density p(y | x) has identical shape for all values of x and is simply shifted by the function f (x); this is illustrated in panel (b). In general, there is no reason to believe that this relationship would also hold for the conditionals p(x | y ) for different values of y but, as is well known, for the linear­Gaussian model this is actually the case, as illustrated in panel (c). Panels (d-f) show the corresponding joint and conditional densities for the corresponding model with a nonlinear function f (x) = x + x3 . Notice how the conditionals p(x | y ) look different for different values of y , indicating that a reverse causal model of the form x := g (y ) + n (with y and n statistically ~ ~ independent) would not be able to fit the joint density. As we will show in this section, this will in fact typically be the case, however, not always. To see the latter, we first show that there exist models other than the linear­Gaussian and the independent case which admit both a forward x  y and a backward x  y model. Panel (g) of Figure 1 presents a nonlinear functional model with additive non-Gaussian noise and non-Gaussian input distributions that nevertheless admits a backward model. The functions and probability densitities can be chosen to be (arbitrarily many times) differentiable. Note that the example of panel (g) in Figure 1 is somewhat artificial: p has compact support, and x, y are independent inside the connected components of the support. Roughly speaking, the nonlinearity of f does not matter since it occurs where p is zero -- an artifical situation which is avoided by the requirement that from now on, we will assume that all probability densities are strictly positive. Moreover, we assume that all functions (including densities) are three times differentiable. In this case, the following theorem shows that for generic choices of f , px (x), and pn (n), there exists no backward model. Theorem 1 Let the joint probability density of x and y be given by p(x, y ) = pn (y - f (x))px (x) , where pn , px are probability densities on R. If there is a backward model of the same form, i.e., p(x, y ) = pn (x - g (y ))py (y ) , ~ (3) then, denoting  := log pn and  := log px , the triple (f , px , pn ) must satisfy the following differential equation for all x, y with  (y - f (x))f (x) = 0: 
=

(2)


-

f +

f f

2 -

ff+


f

+


 f f -

 (f f

)2 ,

(4)

where we have skipped the arguments y - f (x), x, and x for  ,  , and f , respectively. Moreover, if for a fixed pair (f ,  ) there exists y  R such that  (y - f (x))f (x) = 0 for almost all x  R, the set of all px for which p has a backward model is contained in a 3-dimensional affine space. Loosely speaking, the statement that the differential equation for  has a 3-dimensional space of solutions (while a priori, the space of all possible log-marginals  is infinite dimensional) amounts to saying that in the generic case, our forward model cannot be inverted. A simple corollary is that if both the marginal density px (x) and the noise density pn (y - f (x)) are Gaussian then the existence of a backward model implies linearity of f : Corollary 1 Assume that 
=


=

0 everywhere. If a backward model exists, then f is linear.

Finally, we note that if f is linear then the only model for which there is a backward model is the jointly Gaussian case. This is a stronger result than what was previous known and utilized [7] because it rules out the existence of a nonlinear backward model. Corollary 2 Let p(x, y ) have forward and backward models of the form (2) and (3) with linear non-constant f . Then g is linear and p(x, y ) is bivariate Gaussian. All proofs are provided in the appendix. Although these results strictly only concern the two-variable case, there are strong reasons to believe that the general argument also holds for larger models. In this brief contribution we do not pursue any further theoretical results, rather we show empirically that the estimation principle can be extended to networks involving more than two variables. 3


4

Model estimation

Section 3 established for the two-variable case that given knowledge of the exact densities, the true model is (in the generic case) identifiable. We now consider practical estimation methods which infer the generating graph from sample data. Again, we begin by considering the case of two observed scalar variables x and y . Our basic method is straightforward: First, test whether x and y are statistically independent. If they are not, we continue as described in the following manner: We test whether a model y := f (x) + n is consistent ^ with the data, simply by doing a nonlinear regression of y on x (to get an estimate f of f ), calculating ^(x), and testing whether n is independent of x. If so, we the corresponding residuals n = y - f ^ ^ accept the model y := f (x) + n; if not, we reject it. We then similarly test whether the reverse model x := g (y ) + n fits the data. The above procedure will result in one of several possible scenarios. First, if x and y are deemed mutually independent we infer that there is no causal relationship between the two, and no further analysis is performed. On the other hand, if they are dependent but both directional models are accepted we conclude that either model may be correct but we cannot infer it from the data. A more positive result is when we are able to reject one of the directions and (tentatively) accept the other. Finally, it may be the case that neither direction is consistent with the data, in which case we conclude that the generating mechanism is more complex and cannot be described using this model. This procedure could be generalized to an arbitrary number N of observed variables, in the following way: For each DAG Gi over the observed variables, test whether it is consistent with the data by constructing a nonlinear regression of each variable on its parents, and subsequently testing whether the resulting residuals are mutually independent. If any independence test is rejected, Gi is rejected. On the other hand, if none of the independence tests are rejected, Gi is consistent with the data. The above procedure is obviously feasible only for very small networks (roughly N  7 or so) and also suffers from the problem of multiple hypothesis testing; an important future improvement would be to take this properly into account. Furthermore, the above algorithm returns all DAGs consistent with the data, including all those for which consistent subgraphs exist. Our current implementation removes any such unnecessarily complex graphs from the output. The selection of the nonlinear regressor and of the particular independence tests are not constrained. Any prior information on the types of functional relationships or the distributions of the noise should optimally be utilized here. In our implementation, we perform the regression using Gaussian Processes [12] and the independence tests using kernel methods [13]. Note that one must take care to avoid overfitting, as overfitting may lead one to falsely accept models which should be rejected.

5

Experiments

To show the ability of our method to find the correct model when all the assumptions hold we have applied our implementation to a variety of simulated and real data. For the regression, we used the GPML code from [14] corresponding to [12], using a Gaussian kernel and independent Gaussian noise, optimizing the hyperparameters for each regression individually.1 In principle, any regression method can be used; we have verified that our results do not depend significantly on the choice of the regression method by comparing with  -SVR [15] and with thinplate spline kernel regression [16]. For the independence test, we implemented the HSIC [13] with a Gaussian kernel, where we used the gamma distribution as an approximation for the distribution of the HSIC under the null hypothesis of independence in order to calculate the p-value of the test result. Simulations. The main results for the two-variable case are shown in Figure 2. We simulated data using the model y = x + bx3 + n; the random variables x and n were sampled from a Gaussian distribution and their absolute values were raised to the power p while keeping the original sign. The
1 The assumption of Gaussian noise is somewhat inconsistent with our general setting where the residuals are allowed to have any distribution (we even prefer the noise to be non-Gaussian); in practice however, the regression yields acceptable results as long as the noise is sufficiently similar to Gaussian noise. In case of significant outliers, other regression methods may yield better results.

4


1 paccept paccept correct reverse 0 0 .5 1 p 1 .5 2

1 correct reverse 0 -1 0 b 1

(a)

(b)

Figure 2: Results of simulations (see main text for details): (a) The proportion of times the forward and the reverse model were accepted, paccept , as a function of the non-Gaussianity parameter p (for b = 0), and (b) as a function of the nonlinearity parameter b (for p = 1).

parameter b controls the strength of the nonlinearity of the function, b = 0 corresponding to the linear case. The parameter p controls the non-Gaussianity of the noise: p = 1 gives a Gaussian, while p > 1 and p < 1 produces super-Gaussian and sub-Gaussian distributions respectively. We used 300 (x, y ) samples for each trial and used a significance level of 2% for rejecting the null hypothesis of independence of residuals and cause. For each b value (or p value) we repeated the experiment 100 times in order to estimate the acceptance probabilities. Panel (a) shows that our method can solve the well-known linear but non-Gaussian special case [7]. By plotting the acceptance probability of the correct and the reverse model as a function of non-Gaussianity we can see that when the distributions are sufficiently non-Gaussian the method is able to infer the correct causal direction. Then, in panel (b) we similarly demonstrate that we can identify the correct direction for the Gaussian marginal and Gaussian noise model when the functional relationship is sufficiently nonlinear. The results are consistent with Corollaries 2 and 1, respectively. Note in particular, that the model is identifiable also for positive b in which case the function is invertible, indicating that non-invertibility is not a necessary condition for identification. We also did experiments for 4 variables w, x, y and z with a diamond-like causal structure. We took w  U (-3, 3), x = w2 + nx with nx  U (-1, 1), y = | 4 w|+ny with ny  U (-1, 1), z = 2 sin x+2 sin y +nz with nz  U (-1, 1). We sampled 500 (w, x, y , z ) tuples from the model and applied the algorithm described in Section 4 in order to reconstruct the DAG structure. The simplest DAG that was consistent with the data (with significance level 0.05 for each test) turned out to be precisely the true causal structure. All five other DAGs for which the true DAG is a subgraph were also consistent with the data. w x z y

Real-world data. Of particular empirical interest is how well the proposed method performs on real world datasets for which the assumptions of our method might only hold approximately. Due to space constraints we only discuss three real world datasets here. The first dataset, the "Old Faithful" dataset [17] contains data about the duration of an eruption and the time interval between subsequent eruptions of the Old Faithful geyser in Yellowstone National Park, USA. Our method obtains a p-value of 0.5 for the (forward) model "current duration causes next interval length" and a p-value of 4.4 × 10-9 for the (backward) model "next interval length causes current duration". Thus, we accept the model where the time interval between the current and the next eruption is a function of the duration of the current eruption, but reject the reverse model. This is in line with the chronological ordering of these events. Figure 3 illustrates the data, the forward and backward fit and the residuals for both fits. Note that for the forward model, the residuals seem to be independent of the duration, whereas for the backward model, the residuals are clearly dependent on the interval length. Time-shifting the data by one time step, we obtain for the (forward) model "current interval length causes next duration" a p-value smaller than 10-15 and for the (backward) model "next duration causes current interval length" we get a p-value of 1.8 × 10-8 . Hence, our simple nonlinear model with independent additive noise is not consistent with the data in either direction. The second dataset, the "Abalone" dataset from the UCI ML repository [18], contains measurements of the number of rings in the shell of abalone (a group of shellfish), which indicate their age, and the length of the shell. Figure 4 shows the results for a subsample of 500 datapoints. The correct model "age causes length" leads to a p-value of 0.19, while the reverse model "length causes age" comes 5


100 residuals of (a) interval 80 60 40 0

40 duration 20 0 -20 0

6 4 2 0 40 residuals of (c) 60 80 interval 100

2 0 -2 -4 40

(a)

2 4 duration

6

(b)

2 4 duration

6

(c)

(d)

60 80 interval

100

Figure 3: The Old Faithful Geyser data: (a) forward fit corresponding to "current duration causes next interval length"; (b) residuals for forward fit; (c) backward fit corresponding to "next interval length causes current duration"; (d) residuals for backward fit.
0.8 residuals of (a) 0.6 length 0.4 0.2 0 0 10 rings 20 30 0.4 0.2 0 -0.2 -0.4 0 10 rings 20 30 rings 30 20 10 0 0 residuals of (c) 0.5 length 1 20 10 0 -10 0

(a)

(b)

(c)

(d)

0.5 length

1

Figure 4: Abalone data: (a) forward fit corresponding to "age (rings) causes length"; (b) residuals for forward fit; (c) backward fit corresponding to "length causes age (rings)"; (d) residuals for backward fit.
15 residuals of (a) temperature 10 5 0 -5 0 1000 2000 altitude 3000 2 1 0 -1 -2 0 1000 2000 altitude 3000 altitude 3000 residuals of (c) 0 10 temperature 20 2000 1000 0 -10 400 200 0 -200 -400 -10 0 10 temperature 20

(a)

(b)

(c)

(d)

Figure 5: Altitude­temperature data. (a) forward fit corresponding to "altitude causes temperature"; (b) residuals for forward fit; (c) backward fit corresponding to "temperature causes altitude"; (d) residuals for backward fit.

with p < 10-15 . This is in accordance with our intuition. Note that our method favors the correct direction although the assumption of independent additive noise is only approximately correct here; indeed, the variance of the length is dependent on age. Finally, we assay the method on a simple example involving two observed variables: The altitude above sea level (in meters) and the local yearly average outdoor temperature in centigrade, for 349 weather stations in Germany, collected over the time period of 1961­1990 [19]. The correct model "altitude causes temperature" leads to p = 0.017, while "temperature causes altitude" can clearly be rejected (p = 8 × 10-15 ), in agreement with common understanding of causality in this case. The results are shown in Figure 5.

6

Conclusions

In this paper, we have shown that the linear­non-Gaussian causal discovery framework can be generalized to admit nonlinear functional dependencies as long as the noise on the variables remains additive. In this approach nonlinear relationships are in fact helpful rather than a hindrance, as they tend to break the symmetry between the variables and allow the correct causal directions to be identified. Although there exist special cases which admit reverse models we have shown that in the generic case the model is identifiable. We have illustrated our method on both simulated and real world datasets. 6


Acknowledgments This work was supported in part by the IST Programme of the European Community, under the PASCAL2 Network of Excellence, IST-2007-216886. P.O.H. was supported by the Academy of Finland and by University of Helsinki Research Funds. A Proof of Theorem 1 Set  (x, y ) := log p(x, y ) =  (y - f (x)) +  (x) , and  := log pn ,  := log py . If eq. (3) holds, then  (x, y ) =  (x - g (y )) +  (y ) , implying ~ ~ ~ 2 = - (x - g (y ))g (y ) and ~  x y 2   / x2 2  /( x y ) x  2 =  (x - g (y )) . ~  x2 = 0. (5)

We conclude

(6)

Using eq. (5) we obtain 2 = - (y - f (x))f (x) ,  x y and (7)

2  = (- (y - f (x))f (x) +  (x)) =  (f )2 -  f +  , (8) 2 x x where we have dropped the arguments for convenience. Combining eqs. (7) and (8) yields = 2   (  f ( f  (f )2  1 f  x2 -2f +  - ( )2 -   )2 +  . 2 f- (f )2  x x y  f +  )2 f  Due to eq. (6) this expression must vanish and we obtain DE (4) by term reordering. Given f ,  , we obtain for every fixed y a linear inhomogeneous DE for  :  (x) =  (x)G(x, y ) + H (x, y ) , (9) where G and H are defined by f f   f f  (f )2  f nd H := -2 f f +  f + G := - + a - . f Setting z :=  we have z (x) = z (x)G(x, y ) + H (x, y ) . Given that such a function z exists, it is given by xR Rx Rx x ^ G(x,y )dx ~ ~ G(x,y )dx- x G(x,y )dx ~ ~ ¯ ¯ x0 0 z (x) = z (x0 )e + e x0 H (x, y )dx . ^ ^ (10) Let y be fixed such that  (y - f (x)f (x)) = 0 holds for almost all x. Then z is determined by z (x0 ) since we can extend eq. (10) to the remaining points. The set of all functions  satisfying the linear inhomogenous DE (9) is a 3-dimensional affine space: Once we have fixed  (x0 ),  (x0 ),  (x0 ) for some arbitrary point x0 ,  is completely determined. Given fixed f and  , the set of all  admitting B a backward model is contained in this subspace. Proof of Corollary 1 Substituting 
= x0

0 into the definitions of G and H (see Appendix A), we obtain f  (f )2 f nd H = -2 f f +  f - G= a . f
f ( )2

f Substituting this and  = 0 into (9), we obtain the equation 0 =  f - 2 f f +  f -  f . = i t Now, since  0,  s a linear function. Without loss of generality, we can assume  o have a root , that is,  () = 0. Then, restricting ourselves to the submanifold {(x, y )  R2 : y =

f (x) + } on which 
w

=

0, we have 0 = f


f -

2

f

,

so either f

=

0 or (f

)2

=

2

 
hich means f is constant. In either case, f is linear. 7


C Proof of Corollary 2 Since f is linear, we know that f =  y - f (x) 0: 
( =

f

=

0. Plugging this into (4) yields for all (x, y ) such that
(

- f (x) . (11) - f (x) We will show that  = c1  . Assume this is not the case. Then there are a, b, such that ( ( a) b)  (a),  (b) = 0 and  (a) =  (b) . Setting y = f (x) + a and y = f (x) + b this would   imply  (x) = 0 for all x. This is a contradiction, because exp(c2 x + c3 ) can never be proportional to a density. Therefore we have  = c1  . Note that because of the same reason as above,  cannot be constantly zero. Only two solutions cto this equatic n are left: the non-trivial o one  (z ) = c4 exp(c1 z ), is not possible, because exp c4 exp(c1 z ) annot be a density. Thus 2 1 the trivial solution c1 = 0 must be valid and we conclude that   0. This implies   0 and means that both and x are normally distributed. It further follows that y is normally distributed as ´ well. Because x is the sum of two independent random variables (g (y ) and ~), Cramer's theorem [20] tells us that g (y ) and ~ are normally distributed, too. But because of Corollary 1, the Gaussian distributions of y and ~ imply that g has to be linear. R x) = - x)f x)
( y

 
y

eferences
[1] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000. [2] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. Springer-Verlag, 1993. (2nd ed. MIT Press 2000). [3] D. Geiger and D. Heckerman. Learning Gaussian networks. In Proc. of the 10th Annual Conference on Uncertainty in Artificial Intelligence, pages 235­243, 1994. [4] D. Heckerman, C. Meek, and G. Cooper. A Bayesian approach to causal discovery. In C. Glymour and G. F. Cooper, editors, Computation, Causation, and Discovery, pages 141­166. MIT Press, 1999. [5] T. Richardson and P. Spirtes. Automated discovery of linear feedback models. In C. Glymour and G. F. Cooper, editors, Computation, Causation, and Discovery, pages 253­304. MIT Press, 1999. [6] R. Silva, R. Scheines, C. Glymour, and P. Spirtes. Learning the structure of linear latent variable models. Journal of Machine Learning Research, 7:191­246, 2006. ¨ [7] S. Shimizu, P. O. Hoyer, A. Hyvarinen, and A. J. Kerminen. A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7:2003­2030, 2006. ¨ [8] X. Sun, D. Janzing, and B. Scholkopf. Distinguishing between cause and effect via kernel-based complexity measures for conditional probability densities. Neurocomputing, pages 1248­1256, 2008. [9] K. A. Bollen. Structural Equations with Latent Variables. John Wiley & Sons, 1989. [10] N. Friedman and I. Nachman. Gaussian process networks. In Proc. of the 16th Annual Conference on Uncertainty in Artificial Intelligence, pages 211­219, 2000. ¨ [11] X. Sun, D. Janzing, and B. Scholkopf. Causal inference by choosing graphs with most plausible Markov kernels. In Proceeding of the 9th Int. Symp. Art. Int. and Math., Fort Lauderdale, Florida, 2006. [12] C. E. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. ¨ [13] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Scholkopf. Kernel methods for measuring independence. Journal of Machine Learning Research, 6:2075­2129, 2005. [14] GPML code. http://www.gaussianprocess.org/gpml/code. ¨ [15] B. Scholkopf, A. J. Smola, and R. Williamson. Shrinking the tube: A new support vector regression algorithm. In Advances in Neural Information Processing 11 (Proc. NIPS*1998). MIT Press, 1999. [16] G. Wahba. Spline Models for Observational Data. Series in Applied Math., Vol. 59, SIAM, Philadelphia, 1990. [17] A. Azzalini and A. W. Bowman. A look at some data on the Old Faithful Geyser. Applied Statistics, 39(3):357­365, 1990. [18] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007. [19] Climate data collected by the Deutscher Wetter Dienst. http://www.dwd.de/. ¨ ´ [20] H. Cramer. Uber eine Eigenschaft der normalen Verteilungsfunktion. Mathematische Zeitschrift, 41(1):405­414, December 1936.

8