Telling cause from effect based on high-dimensional observations Dominik Janzing dominik.janzing@tuebingen.mpg.de Max Planck Institute for Biological Cybernetics, Spemannstr. 38, 72076 T¨bingen, Germany u Patrik O. Hoyer patrik.hoyer@helsinki.fi University of Helsinki, Gustaf H¨llstr¨min katu 2b, FIN-00560 Helsinki, Finland a o Bernhard Sch¨lkopf o bernhard.schoelkopf@tuebingen.mpg.de Max Planck Institute for Biological Cybernetics, Spemannstr. 38, 72076 T¨bingen, Germany u Abstract We describe a method for inferring linear causal relations among multi-dimensional variables. The idea is to use an asymmetry between the distributions of cause and effect that occurs if the covariance matrix of the cause and the structure matrix mapping the cause to the effect are independently chosen. The method applies to both stochastic and deterministic causal relations, provided that the dimensionality is sufficiently high (in some experiments, 5 was enough). It is applicable to Gaussian as well as non-Gaussian data. are assumed to derive from the structure of the DAG. Recently, several authors have proposed an entirely different route to causal discovery, based on ICA or (more generally) additive-noise models. These methods assume that the effects are given by some (possibly nonlinear) functions of the cause up to an additive noise that is statistically independent of the cause (Kano & Shimizu, 2003; Shimizu et al., 2006; Hoyer et al., 2009). A recent proposal generalizes this model class by further allowing non-linear transformations of the effect (Zhang & Hyv¨rinen, 2009). a These methods both have their relative advantages but also limitations. Approaches based solely on conditional independencies cannot distinguish between causally distinct models that impose the same set of indendences; in particular, they cannot infer whether X causes Y or Y causes X for just two observed variables (Mooij & Janzing, 2010) X and Y . Methods based on additive noise models fail for linear relationships with Gaussian noise. Finally, neither of the above approaches can deal with deterministic relationships between the observed variables. In the present paper, we describe a method based on a recently proposed third principle. The idea is to reject the causal hypothesis X Y whenever there are some kind of dependences between P (X) and P (Y|X) that suggest that P (X) and P (Y|X) were not generated by "independent mechanisms" of nature. Janzing & Sch¨lkopf (2008) show examples illustrating how an ino dependent choice of P (X) and P (Y|X) typically leads to joint distributions where P (Y) and P (X|Y) satisfy non-generic dependences indicating that Y X is not a plausible model. Based on an idea of Lemeire & Dirkx (2006), Janzing & Sch¨lkopf (2008) express o these dependences in terms of algorithmic informa- 1. Motivation Inferring the causal relations that have generated statistical dependencies among a set of observed random variables is challenging if no controlled randomized studies can be made. Here, causal relations are represented as arrows connecting the variables, and the structure to be inferred is a directed acyclic graph (DAG). There are several well-known approaches to this task, of which perhaps the most established one is the independence-based approach (Pearl, 2000; Spirtes et al., 1993) based on the causal Markov condition and an assumption of faithfulness: The guiding principle is to accept only those causal DAGs that explain all of the observed dependencies in the data and furthermore explain only those dependencies, i.e. all inferred (marginal and conditional) independencies in the data Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). Telling cause from effect in high dimensions tion theory. Unfortunately, this leads to a criterion that cannot be used for practical purposes due to the uncomputability of Kolmogorov complexity. In this contribution we provide an easily computable criterion for detecting dependences between P (X) and P (Y|X) for the case of two high-dimensional variables X and Y coupled by a linear causal mechanism. We show that the principle works even for multivariate Gaussian models, and also if the relation is deterministic, provided that the joint covariance matrix of X and Y is sufficiently anisotropic. Before proceeding to describe our method, we should mention connections to Bayesian approaches (Heckerman et al., 1999) to causal discovery. Such methods can, in principle and depending on the priors chosen, use any of the information relied upon by the abovementioned three approaches. However, to date most Bayesian causal discovery methods have focused on conditional independence information. Furthermore, the fact that deterministic relationships exist in realworld settings shows that priors that are densities on the parameters of the Bayesian networks, as is usually assumed, are problematic and the construction of good priors is difficult. Thus, rather than defining a prior explicitly, we will assume that it satisfies some symmetry constraints and show how this already leads to our inference rule (Theorem 1). We start with two motivating examples. First, assume that X is a multivariate Gaussian variable with values in Rn and the isotropic covariance matrix XX = I. Let Y be another Rn -valued variable that is deterministically influenced by X via the linear relation Y = AX for some n × n-matrix A. This induces the covariance matrix YY = AXX AT = AAT . The converse causal hypothesis Y X becomes implausible because P (Y) (which is determined by the covariance matrix AAT ) and P (X|Y) (which is described by X = A-1 Y) are related in a suspicious way: the mechanism from Y to X seems to be adjusted to the distribution P (Y) because it exactly stretches the directions with small variance and shrinks the ones with large variance "in order" to get an isotropic output P (X). This can, indeed, be the result of a "designed" mechanism, but it is unlikely to be obtained by a simple process in nature having no feedback loops.1 The "atypical" relation between YY and A-1 can 1 The argument that complex processes like evolution may be able to develop such intelligent system design is certainly correct, but this is a problem for all approaches to causal inference from non-experimental data. also be phrased in terms of symmetries: A-1 YY A-T (here we have used the short notation A-T := (A-1 )T ) is surprisingly isotropic compared to those matrices obtained by applying A-1 to U YY U T for some generic orthogonal transformation U O(n). We will show below that this remains true with high probability (in high dimensions) if we start with an arbitrary covariance matrix XX and apply a random linear transformation A chosen independently of XX . To understand in what sense independent choices of XX and A typically induce atypical relations between A-1 and YY we also discuss a second example where XX and A are simultaneously diagonal with cj and aj (j = 1, . . . , n) as corresponding diagonal entries. Thus YY is also diagonal and its diagonal entries (which equal its eigenvalues) are a2 cj . We now assume that j "nature has chosen" the values cj independently from some distribution and the aj independently from some other distribution. We can then interpret the values cj as instances of n-fold sampling of the random variable c with expectation E(c) and the same for aj . If we assume that a and c are independent, we have E(a2 c) = E(a2 )E(c) . (1) Due to the law of large numbers, this equation will for large n approximately be satisfied by the empirical averages, i.e., n n n 1 1 1 a2 cj a2 cj . (2) n j=1 j n j=1 j n j=1 For the backward direction Y X we observe that the diagonal entries cj = a2 cj of YY and the diagonal ~ j -1 ~ entries aj = aj of A := A-1 have not been chosen ~ independently because E(~2 c) = E(a-2 a2 c) = E(c), a ~ whereas E(~2 )E(~) a c = E(a-2 )E(a2 c) = E(a-2 )E(a2 )E(c) > E(c) . The last inequality holds because the random variables a2 and a-2 are always negatively correlated (from the Cauchy-Schwarz inequality we obtain E(a2 )E(a-2 ) 1 with equality only for the trivial case where a is constant). We thus observe a systematic violation of (1) in the backward direction. The proof for non-diagonal matrices in Section 2 uses spectral theory, and is based upon the same intuition. The paper is structured as follows. In Section 2, we show that the above mentioned atypical relations between covariance and structure matrices can be detected via a simple trace formula. In Section 3 we describe an algorithm that is based upon this result and Telling cause from effect in high dimensions in Section 4 discuss experiments with simulated and real data. Section 5 proposes possible generalizations. with probability at least q := 1 - exp(-(n - 1) 2 ) for some constant (independent of , A, n, m, ). Proof: for an arbitrary (j )j=1,...,m we have m (AU U T AT ) = 1 m m 2. Identifiability results Given a hypothetical linear causal model Y = AX + E (where X and Y are n- and m-dimensional, respectively) we want to check whether the pair (XX , A) satisfies some relation that typical pairs (U XX U T , A) only satisfy with low probability if U O(n) is randomly chosen. To this end, we introduce the renormalized trace n (.) := tr(.)/n for dimension n and compare the values m (AXX AT ) and n (XX )m (AAT ) . (3) One shows easily that the expectation of both values coincide if XX is randomly drawn from a distribution that is invariant under transformations XX U XX U T . This is because averaging the matrices U XX U T over all U O(n) projects onto n (XX )I since the average U XX U T commutes with all matrices and is therefore a multiple of the identity. For our purposes, it is decisive that the typical case is close to this average, i.e., the two expressions in (3) almost coincide. To show this, we need the following result (Ledoux, 2001): Lemma 1 (L´vy's Lemma) e Let g : Sn R be a Lipschitz continuous function on the n-dimensional sphere with |g() - g( )| . L := max = - If a point on Sn is randomly chosen according to an O(n)-invariant prior, it satisfies |g() - g | ¯ with probability at least 1 - exp(-(n - 1) 2 /L2 ) for some constant , where g can be interpreted as the ¯ median or the average of g(). Given the above Lemma, we can prove the following Theorem: Theorem 1 (multiplicativity of traces) Let be a symmetric, positive definite n × n-matrix and A an arbitrary m × n-matrix. Let U be randomly chosen from O(n) according to the unique O(n)invariant distribution (i.e. the Haar measure). Introducing the operator norm B := max x =1 Bx , we have |m (AU U T AT ) - n ()m (AAT )| 2 AAT orthonormal system j , AU U T AT j . j=1 We define the unit vectors j := U T AT j / AT j . Dropping the index j, we introduce the function f () := , . For a randomly chosen U O(n), is a randomly chosen unit vector according to a uniform prior on the n-dimensional sphere Sn . ¯ The average of f is given by f = n (). The Lipschitz constant is given by the operator norm of , i.e., L = 2 . An arbitrarily chosen j satisfies | j , j - n ()| 2 with probability 1 - exp(-(n - 1) 2 ). This follows from Lemma 1 after replacing with L. Hence | j , AU U T AT j - n () j , AAT j | 2 Due to m (AU U T AT ) = we thus have |m (AU U T AT ) - m (AAT )n ()| 2 AAT . 1 m m AAT . j , AAT j j , j , j=1 It is convenient to introduce (, A) := log m (AAT ) - log n () - log m (AAT ) as a scale-invariant measure for the strength of the violation of the equality of the expressions (3). We will also write XY if is the covariance matrix of X and A the structure matrix defining the linear model from X to Y. Note that vanishes for dimension one, our method is certainly not able to distinguish between cause and effect for just two one-dimensional variables. We will assume that can be used to detect the causal direction because we expect 0 for the correct one (due to Theorem 1). Certainly, can also be close to zero for the wrong direction, but our theory and experiments will suggest that this rarely happens. Telling cause from effect in high dimensions First, we restrict the attention to deterministic models Y = AX and the case where n m and A has rank n. This ensures that the backward model is also deterministic, i.e., X = A-1 Y , with (.)-1 denoting the pseudo inverse. The following theorem shows that XY = 0 then implies YX 0: Theorem 2 (violation of trace formula) Let n, m with n m denote the dimensions of X and Y, respectively. If Y = AX and X = A-1 Y, the covariance matrices satisfy XY + YX = - log (1 - Cov(Z, 1/Z)) + log n , m We should, however, mention a problem that occurs for m > n in the noise-less case discussed here: Since YY has only rank n, we could equally well replace ^ A-1 with some other matrix A that coincides with A-1 ^ on all of the observed y-values. For those matrices A, the value can get closer to zero because the term log n/m expresses the fact that the image of YY is orthogonal to the kernel of A-1 , which is already an atypical relation. It turns out that the observed violation of the multiplicativity of traces can be interpreted in terms of relative entropy distances. To show this, we need the following result: Lemma 2 (anisotropy and relative entropy) Let be the covariance matrix of a centralized nondegenerate multi-variate Gaussian distribution P in n dimensions. Let the anisotropy of be defined by the relative entropy distance to the closest isotropic Gaussian D() := min D(P ||Q) . Q isotropic Then D() = 1 (n log n () - log det()) . 2 (6) (4) where Z is a real-valued random variable whose distribution is the empirical distribution of eigenvalues of AAT , i.e., m ((AAT )k ) = E(Zk ) for all k Z. Proof: We have n () m (AAT )n (A-1 A-T ) 1 n ()m (AT A) = . (5) m (AAT )n (A-1 A-T ) m (AAT ) Using n (A-1 A-T ) = = n (A-T A-1 ) = n ((AAT )-1 ) m m ((AAT )-1 ) n Proof: the relative entropy distance of two centralized Gaussians with covariance matrices , 0 in n dimensions is given by D(P ||P0 ) = 1 2 log det0 + tr(-1 ) - n 0 det . Setting 0 = I, the distance is minimized for = n (), which yields eq. (6). Straightforward computations show: Theorem 3 (multiplicativity and rel. entropy) Let and A be n×n-matrices with positive definite. Then n D(AAT ) = D() + D(AAT ) + (, A) . 2 Hence, for independently chosen A and , the anisotropy of the output covariance matrix AAT is approximately given by the anisotropy of plus the anisotropy of AAT , which is the anisotropy of the output that A induces on an isotropic input. For the backward direction, the anisotropy is smaller than the typical value. We now discuss an example with a stochastic relation between X and Y. We first consider the general linear model Y = AX + E , and taking the logarithm we obtain YX 1 n = log + log - (, A) . E(Z)E(1/Z) m Then the statement follows from Cov(Z, 1/Z) = 1 - E(Z)E(1/Z) . If |XY | is smaller than the absolute value of the right hand side of eq. (4), we obtain a non-trivial lower bound on |YX |. To show that this bound need not be small in high dimensions, we consider the case n = m and a sequence of n × n random matrices (An ) whose eigenvalue distributions Zn converge to some distribution on R with non-zero variance. - log (1 - Cov(Z, 1/Z)) then converges to some negative value. Telling cause from effect in high dimensions where A is an m × n matrix and E is a noise term (statistically independent of X) with covariance matrix EE . We obtain YY = AXX AT + EE . The corresponding backward model2 reads ~ ~ X = AY + E . with ~ A := XY -1 . YY Algorithm 1 Identifying linear causal relations via traces 1: Input: (x1 , y1 ), . . . , (xk , yk ), 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: Now we focus on the special case where A is an orthogonal transformation and E is isotropic, i.e., EE = I with > 0. Then YX is positive: Lemma 3 (noisy case) Let Y = AX + E with A O(n) and the covariance matrix of E be given by EE = I. Then we have YX > 0 . Proof: We have YY = AAT + I and YX = A, with := XX . Therefore, ~ A = AT (AAT + I)-1 = ( + I)-1 AT . One checks easily that the orthogonal transformation A is irrelevant for the traces and we thus have YX (2 ( + I)-1 ) = log ( + I) (2 ( + I)-2 ) = log E Z2 /(Z + ) , E(Z + )E (Z2 /(Z + )2 ) Compute the estimators XX , XY , YX , YY Compute A := YX -1 XX ~ Compute A := XY -1 YY if |YX | > (1 + )|XY | then write "X is the cause" else if |XY | > (1 + )|YX | then write "Y is the cause" else write "cause cannot be identified" end if end if 3. Inference algorithm Motivated by the above theoretical results, we propose to infer the causal direction using Alg. 1.3 In light of the theoretical results, the following issues have to be clarified by experiments with simulated data: 1. Is the limit for dimension to infinity already justified for moderate dimensions? 2. Is the multiplicativity of traces sufficiently violated for noisy models? ­ Note that Lemma 3 only covers a special case. Furthermore, the following issue has to be clarified by experiments with real data: 3. Is the behaviour of real causal structures qualitatively sufficiently close to our model with independent choices of A and XX according to an isotropic prior? How large must we choose the threshold in Alg. 1 to get reliable results? where Z is a random variable of which distribution reflects the distribution of eigenvalues of . The function z z/(z + ) is monotonously increasing for positive and z and thus also z z 2 /(z + )2 . Hence Z + and Z2 /(Z + )2 are positively correlated, i.e., E(Z /(Z + )) 2 = E((Z + )Z /(Z + ) ) 2 2 > E(Z + )E Z2 /(Z + )2 , for all distributions of Z with non-zero variance. Hence the logarithm is positive and thus YX > 0. Theorem 2 and Lemma 3 show that independent choices of A and XX can induce positive or negative values of in the wrong direction. We therefore propose to prefer the causal direction for which is closer to zero. For non-Gaussian X or E, this induces a joint distribution P (X, Y) that does not admit a linear backward model ~ with an independent noise E, we can then only obtain uncorrelated noise. We could in principle already use this fact for causal inference (Kano & Shimizu, 2003). However, our method also works for the Gaussian case or if the dimension is too high for testing higher-order statistical dependences reliably. 2 4. Experiments 4.1. Simulated data We have generated random models Y = AX + E as follows: We independently draw each element of the m × n structure matrix A from a standardized Gaussian distribution. This implies that the distribution of column vectors as well as the distribution of row vectors is isotropic. To generate a random covariance A code package implementing this algorithm (and reproducing all experiments reported in this paper), is available at: http://www.kyb.tuebingen.mpg.de/causality/ 3 Telling cause from effect in high dimensions % 100 80 60 40 20 0 0.8 0.4 0.0 ! ! ! ! a ! ! 3 2 1 0 -1 -2 -3 3 2 1 -1 0 -3 value of delta b % 100 80 60 40 20 0 0.8 0.4 0.0 ! ! ! ! ! c ! 4 value of delta ! d performance performance 2 0 ! !! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! -2 -4 0 1 2 3 4 5 0 1 2 3 4 5 sigma 10 10 30 30 50 50 10 10 30 30 50 50 0 1 2 3 4 5 0 1 2 3 4 5 sigma dimension dim dimension dim Figure 1. Simulation results. (a) Performance of the method as a function of the input dimensionality n, when the output dimensionality m = n and the sample size is N = 2n. The green curve (circles) denotes the fraction of simulations on which the true causal direction was selected, while the red curve (squares) gives the fraction of wrong answers. (b) Mean values of corresponding to the true direction (green) vs the wrong direction (red). (c) Performance as a function of noise level , for dimensionality n = m = 10 and sample size N = 1000. To compare, the dashed lines give the performance based on the exact covariance matrices rather than based on the samples. (d) Mean values of corresponding to the true direction (green) vs the wrong direction (red). See main text for discussion. drops markedly as is increased. As soon as there is significantly more noise than signal (say, > 2), the number of samples is not sufficient to reliably estimate the required covariance matrices and hence the direction of causality. This is clear from looking at the much better performance of the method when based on the exact, true covariance matrices, given by the dashed lines. In Fig. 1d we show the corresponding values of , from which it is clear that the estimate based on the samples is quite biased for the forward direction. The simulations point out at least one important issue for future work: the construction of unbiased estimators for the trace values or the . The systematic deviation of the sample-based experiments from the covariance-matrix based experiments in Fig. 1c­d suggest that this could be a major improvement. 4.2. Handwritten digits As experiments with real data with known ground truth, we have chosen 16 × 16 pixel images (so n = 256) of handwritten digits (LeCun et al., 1990). As the linear map A we have used both random local translation-invariant linear filters and also standard blurring of the images. (We added a small amount of noise to both original and processed images, to avoid problems with very close-to singular covariances.) The task is then: given a sample of pairs (xj , yj ), each consisting of the picture xj and its processed counterpart yj , infer which of the set of pictures x or y are the originals (`causes'). By partitioning the image set by the digit class (0-9), and by testing a variety of random filters (and the standard blur), we obtained a number of test cases to run our algorithm on. Out of the total of 100 tested cases, the method was able to correctly identify the set of original images 94 times, with 4 unknowns (i.e. only two falsely classified cases). 4.3. Geographic position and precipitation In the first experiment we took precipitation data from 4748 different locations in Germany4 . The cause X was 3-dimensional and consisted of X = (altitude, longitude, latitude) . The effect Y was 12-dimensional and consisted of the average precipitation per month: Y = (prec. in Jan., . . . , prec. in Dec.) . Here we are faced with the following problem. The scales of X1 are uncomparable to those of X2 and X3 4 matrix XX , we similarly draw an n × n matrix B and set XX := BB T . Due to the invariance of our decision rule with respect to the scaling of A and XX , the structure matrix and the covariance can have the same scale without loss of generality. The covariance EE of the noise is generated in the same way, although with an adjustable parameter governing the scaling of the noise with respect to the signal: = 0 yields the deterministic setting, while = 1 equates the power of the noise to that of the signal. First, we demonstrate the performance of the method in the close-to deterministic setting ( = 0.05) as a function of the dimensionality n = m of the simulations, ranging from dimension 2 to 50. To show that the method is feasible even with a relatively small number of samples, we choose the number of samples N to scale with the dimension as N = 2n. (Note that we must have N min(n, m) to obtain invertible estimates of the covariance matrices.) The resulting proportion of correct vs wrong decisions is given in Fig. 1a, with the corresponding values of in Fig. 1b. As can be seen, even at as few as 5 dimensions and 10 samples, the method is able to reliably identify the direction of causality in these simulations. To illustrate the degree to which identifiability is hampered by noise, the solid line in Fig. 1c gives the performance of the method for a fixed dimension (n = m = 10) and fixed sample size (N = 1000) as a function of the noise level . As can be seen, the performance -4 -2 0 2 4 http://www.dwd.de Telling cause from effect in high dimensions since the quantities are measured in different units. We have therefore renormalized all Xj to unit variance. One may debate whether one should then also renormalize Y (even though all Yj refer to the same unit) in order to have a method that treats X and Y in the same way. We have done two runs, one with only renormalizing X (top row of table below) and the second with renormalizing X and Y (bottom row of table below). XY XY = -0.240; YX = -0.278; YX = -2.25 = -2.11 wind strength, but it was discretized into 3 values. We have therefore dropped it because our method actually refers to continuous variables only. However, it is noteworthy that our method actually assumes that the data points are independently drawn from P (X, Y), rather than being part of a time series as it is the case here and in the following experiment. 4.5. Stock returns We also ran an experiment with the relations between the daily stock returns in different countries from the Yahoo finance database at 2394 days (from January 04, 2000 to March 10, 2009). We defined X := (SH, HSI, TWI, N225) where the abbreviations stand for the stock returns of Shanghai (China), Hang Seng (Hong Kong), Taiwan, Nikkei (Japan). Moreover, we combine the European indices Y := (FTSE, DAX, CAC) , corresponding to England, Germany, and France, respectively. Due to the different time zones, the European returns at the same day refer to a later time than the Asian ones. We therefore assume that the ground truth is X Y. The results (without renormalizing covariance matrices) are XY = 0.408; YX = 0.363 . The results qualitatively coincide for both versions and our method prefers the correct causal direction X Y. We will henceforth renormalize whenever a vector contains quantities of different units, even if some of the units coincide. We also tried the same experiment with 3-dimensional precipitation vector Y containing only months j to j + 2 for all j = 1, . . . , 10. In all these 10 experiments we also obtained correct results. 4.4. Weather and air pollution In another experiment we used data on the relation between weather conditions and air pollution in Chemnitz, Germany, measured at 1440 days.5 We defined a 3-dimensional vector of weather conditions X = (sin(wind ), cos(wind ), T ) , where wind is the direction of the wind and T is the air temperature. We then define a 6-dimensional vector "air pollution" Y = (ozone, sulfur dioxid, dust, CO, NO2 , NOx ) . Clearly, the wind can blow the pollutants away from or to the location of measurement, depending on its direction. Moreover, the production of ozone is known to be a problem of days with strong sun radiation. We therefore assume the ground truth to be X Y. Here, we must obviously renormalize X and Y. We obtained the result XY = -0.0504; YX = -1.44 , Here we infer the wrong direction but the difference between the values is small. When applying a conservative decision rule (via setting the threshold appropriately) we would not make a decision here. To get vectors of higher dimensions we combined the returns of Europe and the USA (DJ and Nas) because both refer to a later time than the Asian ones. Y := (FTSE, DAX, CAC, DJ, Nas) . We obtain XY = 0.312; YX = 0.147 , and thus prefer the correct causal direction. To exclude a systematic preference of the lower dimension as the cause we have also defined 3-dimensional vectors (Yj , Yj+1 , Yj+2 ) and obtained the correct result for all j = 1, . . . , 4. The dataset also contained a variable 5 http://www.mathe.tu-freiberg.de/Stoyan/ Umweltdaten/chemnitz.txt which also infers the wrong direction (with higher significance than before). The problem might be the following. The assumption that "the market chooses" XX and A independently is questionable since XX is the result of complex market phenomena that also depend on what happened the day before (which was also determined by A). To further explore the robustness of our method with respect to violating the model assumptions must be left to the future. Since the threshold in Alg. 1 is hard to interpret one may prefer a decision rule that is closer to standard Telling cause from effect in high dimensions statistical hypothesis testing: generate a large number of random orthogonal transformations U , apply them to X (while keeping A) and consider the distribution of all these values U XY . The observed value then defines a p-value and we can infer directions by just comparing p-values. We have done preliminary studies for some of the above examples and obtained qualitatively the same results. causal inference algorithms. It would be interesting to know whether our method could be a sanity check for complex causal networks: Consider, e.g., a causal DAG G connecting 2n realvalued variables. If X1 , . . . , X2n is an ordering that is consistent with G, we define Y := (X1 , . . . , Xn ) and W := (Xn+1 , . . . , X2n ) and test whether YW 0. 5. Outlook: generalizations of the method In this section, we want to place our theoretical results in a more general context and state: Postulate 1 (typicality for group orbits) Let X and Y be random variables with joint distribution P (X, Y) and G be a group of transformations on the range of X. Each g G defines a modified distribution of Y via P (g) (Y) := x P (Y|g(x))P (x). Let K(.) be some real-valued function on the probability distributions of Y. The causal hypothesis X Y is unlikely if K(P (Y)) is smaller or greater than the big majority of all distributions (P (g) (Y))gG . Our prior knowledge about the structure of the data set determines the appropriate choice of G. The idea is that G expresses a set of transformations that generate input distributions Pg (X) that we consider equally likely. We have chosen K(P (Y)) := (YY ) and G = O(n), but the permutation of components of X also defines an interesting transformation group. For time series, the translation group would be the most natural choice. Interpreting this approach in a Bayesian way, we thus use symmetry properties of priors without the need to explicitly define the priors themselves. References Heckerman, D., Meek, C., and Cooper, G. A Bayesian approach to causal discovery. In Glymour, C. and Cooper, G. (eds.), Computation, Causation, and Discovery, pp. 141­165, Cambridge, MA, 1999. MIT Press. Hoyer, P., Janzing, D., Mooij, J., Peters, J., and Sch¨lkopf, B. Nonlinear causal discovery with addio tive noise models. NIPS 2008. Janzing, D. and Sch¨lkopf, B. Causal inference uso ing the algorithmic Markov condition. To appear in IEEE Transactions on Information Theory. Kano, Y. and Shimizu, S. Causal inference using nonnormality. In Proceedings of the International Symposium on Science of Modeling, the 30th Anniversary of the Information Criterion, pp. 261­270, Tokyo, Japan, 2003. LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., and Jackel, L. Handwritten digit recognition with a back-propagation network. In Advances in Neural Information Processing Systems, pp. 396­404. Morgan Kaufman, 1990. Ledoux, M. The concentration of measure phenomenon. American Mathematical Society, 2001. Lemeire, J. and Dirkx, E. Causal models as minimal descriptions of multivariate systems. http://parallel.vub.ac.be/jan/, 2006. Mooij, J. and Janzing, D. Distinguishing between cause and effect. JMLR, W&CP, 6:147­146, 2010. Pearl, J. Causality. Cambridge University Press, 2000. Shimizu, S., Hoyer, P. O., Hyv¨rinen, A., and Kermia nen, A. J. A linear non-Gaussian acyclic model for causal discovery. JMLR, 7:2003­2030, 2006. Spirtes, P., Glymour, C., and Scheines, R. Causation, Prediction, and Search. Springer, New York, 1993. Zhang, K. and Hyv¨rinen, A. On the identifiability of a the post-nonlinear causal model. UAI 2009. 6. Discussion Our experiments with simulated data suggest that the method performs quite well already for moderate dimensions provided that the noise level is not too high. Certainly, the model of drawing XX according to a distribution that is invariant under XX U XX U T may be inappropriate for many practical applications. Even though the example with diagonal matrices in Section 1 shows that the statement 0 holds for a much broader class of models, it remains to show that most cause-effect pairs in the real world indeed satisfy 0 for the true causal direction. Our studies with empirical data are still preliminary in this respect. It is possible that the method presented here only shows a future direction for the development of more mature