Causal discovery of linear acyclic mo dels with arbitrary distributions Patrik O. Hoyer Aap o Hyv¨rinen a Helsinki Institute for Information Technology & Department of Computer Science University of Helsinki Finland Richard Scheines Peter Spirtes Joseph Ramsey Department of Philosophy Carnegie Mellon University Pittsburgh, PA, USA Gustavo Lacerda Machine Learning Department Carnegie Mellon University Pittsburgh, PA, USA Shohei Shimizu Osaka University Japan Abstract An important task in data analysis is the discovery of causal relationships between observed variables. For continuous-valued data, linear acyclic causal models are commonly used to model the data-generating process, and the inference of such models is a wellstudied problem. However, existing methods have significant limitations. Methods based on conditional independencies (Spirtes et al. 1993; Pearl 2000) cannot distinguish between independence-equivalent models, whereas approaches purely based on Independent Component Analysis (Shimizu et al. 2006) are inapplicable to data which is partially Gaussian. In this paper, we generalize and combine the two approaches, to yield a method able to learn the model structure in many cases for which the previous methods provide answers that are either incorrect or are not as informative as possible. We give exact graphical conditions for when two distinct models represent the same family of distributions, and empirically demonstrate the power of our method through thorough simulations. be based on uncontrolled, purely observational data combined with prior information and reasonable assumptions. In cases in which the observed data is continuousvalued, linear acyclic models (also known as recursive Structural Equation Models) have been widely used in a variety of fields such as econometrics, psychology, sociology, and biology; for some examples, see (Bollen 1989). In much of this work, the structure of the models has been assumed to be known or, at most, only a few different models have been compared. During the past 20 years, however, a number of methods have been developed to learn the model structure in an unsupervised way (Spirtes et al. 1993; Pearl 2000; Geiger and Heckerman 1994; Shimizu et al. 2006). Nevertheless, all approaches so far presented have either required distributional assumptions or have been overly restricted in the amount of structure they can infer from the data. In this contribution we show how to combine the strenghts of existing approaches, yielding a method capable of inferring the model structure in many cases where previous methods give incorrect or uninformative answers. The paper is structured as follows: Section 2 precisely defines the models under study, and Section 3 discusses existing methods for causal discovery of such models. In Section 4 we formalize the discovery problem and give exact theoretical results on identifiability. Then, in Section 5 we introduce and analyze a method termed PClingam that combines the strenghts of existing methods and overcomes some of their weaknesses, and is, in the limit, able to estimate all identifiable aspects of the underlying model. Section 6 provides empirical demonstrations of the power of our method. Finally, Section 7 maps out future work and Section 8 provides a summary of the main points of the paper. 1 INTRODUCTION In much of science, the primary focus is on the discovery of causal relationships between quantities of interest. The randomized controlled experiment is geared specifically to inferring such relationships. Unfortunately, in many studies it is unethical, technically extremely difficult, or simply too expensive to conduct such experiments. In such cases causal discovery must 2 LINEAR MODELS a x 3 b x y -2 c x y z x y z x y z d x y z In this paper, we assume that the observed data has been generated by the following process: 1. The observed variables xi , i = {1 . . . n} can be arranged in a causal order, such that no later variable causes any earlier variable. We denote such a causal order by k (i). That is, the generating process is recursive (Bollen 1989), meaning it can be represented graphically by a directed acyclic graph (DAG) (Pearl 2000; Spirtes et al. 1993). 2. The value assigned to each variable xi is a linear function of the values already assigned to the earlier variables, plus a `disturbance' (noise) term ei , and plus an optional constant term ci , that is k bij xj + ei + ci , (1) xi = (j )