On Sparse Nonparametric Conditional Covariance Selection Mladen Kolar mladenk@cs.cmu.edu Ankur P. Parikh apparikh@cs.cmu.edu Eric P. Xing epxing@cs.cmu.edu Machine Learning Department, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213 USA Abstract We develop a penalized kernel smoothing method for the problem of selecting nonzero elements of the conditional precision matrix, known as conditional covariance selection. This problem has a key role in many modern applications such as finance and computational biology. However, it has not been properly addressed. Our estimator is derived under minimal assumptions on the underlying probability distribution and works well in the high-dimensional setting. The efficiency of the algorithm is demonstrated on both simulation studies and the analysis of the stock market. covariance selection under situations where such assumptions are violated. Consider the problem of gene network inference in systems biology, which is of increasing importance in drug development and disease treatment. A gene network is commonly represented as a fixed network, with edge weights denoting strength of associations between genes. Realistically, the strength of associations between genes can depend on many covariates such as blood pressure, sugar levels, and other body indicators; however, biologists have very little knowledge on how various factors affect strength of associations. Ignoring the influence of different factors leads to estimation procedures that overlook important subtleties of the regulatory networks. Consider another problem in quantitative finance, for which one wants to understand how different stocks are associated and how these associations vary with respect to external factors to help investors construct a diversified portfolio. The rule of Diversification, formalized by Modern Portfolio Theory (Markowitz, 1952), dictates that risk can be reduced by constructing a portfolio out of uncorrelated assets. However, it also assumes that the associations between assets are fixed (which is highly unrealistic) and a more robust approach to modeling assets would take into account how their associations change with respect to economic indicators, such as, gross domestic product (GDP), oil price or inflation rate. Unfortunately, there is very little domain knowledge on the exact relationship between economic indicators and associations between assets, which motivates the problem of conditional covariance selection we intend to investigate in this paper. Let X Rp denote a p-dimensional random vector representing genes or stock values, and Z R denote an index random variable representing some body factor or economic indicator of interest. Both of the above mentioned problems in biology and finance can be modeled as inferring non-zero partial correlations between different components of the random vector X conditioned on a particular value of the index variable 1. Introduction In recent years, with the advancement of large-scale data acquisition technology in various engineering, scientific, and socio-economical domains, the problem of estimating latent dependency structures underlying high-dimensional attributes has become a problem of remarkable algorithmic and theoretical interest in the machine learning and statistics community. Although a vast and rich body of work on the so-called covariance selection problem can be found in the recent literature (Meinshausen & B¨hlmann, u 2006; Friedman et al., 2008b; Ravikumar et al., 2008; Banerjee et al., 2008), current focus and progress seems to be restricted to simple scenarios where the data-likelihood function belongs to well-known parametric families, such as discrete Markov random fields or Gaussian Graphical Models, and the precision matrix is analyzed in isolation, without considering effects of other covariates that represent environmental factors. In this paper, we investigate the problem of Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). On Sparse Nonparametric Conditional Covariance Selection Z = z. We assume that the value of partial correlations change with z, however, the set of non-zero partial correlations is constant with respect to z. Let (z) := Cov(X|Z = z) denote the conditional covariance of X given Z, which we assume to be positive definite, and let (z) := (z)-1 denote the conditional precision matrix. The structure of non-zero components of the matrix (z) tells us a lot about associations between different components of the vector X, since the elements of (z) correspond to partial correlation coefficients. One of the challenges we address in this paper is how to select non-zero components of (z) from noisy samples. Usually, very little is known about the relationship between the index variable Z and associations between components of the random variable X; so, in this paper, we develop a nonparametric method for estimating the non-zero elements of (z). Specifically, we develop a new method based on 1 /2 penalized kernel smoothing, that is able to estimate the functional relationship between the index Z and components of (z) with minimal assumptions on the distribution (X, Z) and only smoothness assumption on z (z). In addition to developing an estimation procedure that works with minimal assumptions, we also focus on statistical properties of the estimator in the high-dimensional setting, where the number of dimensions p is comparable or even larger than the sample size. Ubiquity of high-dimensionality in many real world data forces us to carefully analyze statistical properties of the estimator, that would otherwise be apparent in a low-dimensional setting. Our problem setting, as stated above, should be distinguished from the classical problem of covariance selection, introduced in the seminal paper by Dempster (Dempster, 1972). In the classical setting, the main goal is to select non-zero elements of the precision matrix; however, the precision matrix does not vary with respect to the index variables. As mentioned before, non-zero elements of the precision matrix correspond to partial correlation coefficients, which encode associations among sets of random variables. Due to its importance, the problem of covariance selection has drawn lots of attention from both the machine learning and statistical community, which has led to remarkable progress in both computational (Friedman et al., 2008b; Banerjee et al., 2008; Duchi et al., 2008) and statistical issues (Yuan & Lin, 2007; Ravikumar et al., 2008; Peng et al., 2009; Rothman et al., 2008; Meinshausen & B¨hlmann, u 2006). However, almost all of these work have been driven by the simplifying assumption of an invariant covariance structure. Perhaps closest to the scenario we investigate in this paper, are the recent work on estimating high-dimensional time-varying graphical models (Zhou et al., 2008; Kolar et al., 2009). While their work could fit into our framework, there are a few mayor differences. In both of the papers, the distribution of X was explicitly given, either as a multivariate Gaussian distribution or a discrete distribution following an Ising model. Furthermore, time, which is considered to be an index variable in their work, is not random. Finally, the focus of (Zhou et al., 2008) was on point-wise estimation of covariance and precision matrices, where the loss was measured in Frobenius norm, while the correct selection of non-zero elements of the precision matrix was not investigated. To the best of our knowledge, there are only few references for work on nonparametric models for conditional covariance and precision matrices. Yin et al., (2008) develop a kernel estimator of the conditional covariance matrix based on the local-likelihood approach. Since their approach does not perform estimation of non-zero elements in the precision matrix, it is suitable in low-dimensions. Other related work includes nonparametric estimation of the conditional variance function in longitudinal studies (see Ruppert et al., 1997; Fan & Yao, 1998, and references within). Our paper intends to fill this void in the literature. In summary, here are the highlights of our paper. Our main contribution is a new nonparametric model for sparse conditional precision matrices, and the 1 /2 penalized kernel estimator for the proposed model. The estimation procedure was developed under minimal assumptions, with the focus on the high-dimensional setting, where the number of dimensions is potentially larger than the sample size. A modified Bayesian Information Criterion (BIC) is given that can be used to correctly identify the set of non-zero partial correlations. Finally, we demonstrate the performance of the algorithm on synthetic data and analyze the associations between the set of stocks in the S&P 500 as a function of oil price. 2. The Model Let X = (X1 , . . . , Xp )T Rp be a p-dimensional random vector (representing gene expressions or stock values) and let random variable Z [0, 1] be an associated univariate index (representing a body factor or an economy index). We will estimate associations between different components of X conditionally on Z. For simplicity of presentation, we assume that the index variable can be scaled into the interval [0, 1] and, furthermore, we assume that it is a scalar variable. On Sparse Nonparametric Conditional Covariance Selection The kernel smoothing method, to be introduced, can be easily extended to multivariate Z. However, such an extension may only be practical in limited cases, due to the curse of dimensionality (Li & Liang, 2008). Throughout the paper, we assume that E[X|Z = z] = 0 for all z [0, 1]. In practice, one can easily estimate the conditional mean of X given Z using local polynomial fitting (Fan, 1993) and subtract it from X. We denote the conditional covariance matrix of X given Z as (z) := Cov(X|Z = z) = (uv (z))u,v[p] , where we use [p] to denote the set {1, . . . , p}. Assuming that (z) is positive definite, for all z [0, 1], the conditional precision matrix is given as (z) := (z)-1 = (uv (z))u,v[p] . Elements (uv (z))u,v[p] are smooth, but unknown functions of z. With the notation introduced above, the problem of conditional covariance selection, e.g., recovering the strength of association between stocks as a function of oil price, or association between gene expressions as a function of blood pressure, can be formulated as estimating the non-zero elements in the conditional precision matrix (z). As mentioned before, association between different components of X can be expressed using the partial correlation coefficients, which are directly related to the elements of precision matrix as follows; the partial correlation uv (z) between Xu and Xv (u, v [p]) given Z = z can be computed as uv (z) = - uv (z) uu (z)vv (z) . (1) as follows Xu = v=u Xv buv (z) + u (z), u [p], (2) with u (z) being uncorrelated with X\u if and only if buv (z) = - uv (z) = uv (z) uu (z) vv (z) . uu (z) (3) Observe that uv (z) = sign(buv (z)) buv (z)bvu (z), which relates selection of the non-zero partial correlations to selection of covariates in the regression model (2). The relationship given in Eq. (3) has been used for estimation of high-dimensional Gaussian graphical models in (Meinshausen & B¨hlmann, u 2006), however, Eq. (3) holds for any distribution of (X, Z) and can be used for estimation of non-zero elements of the conditional precision matrix. Based on the above discussion, we propose a locally weighted kernel estimator of the non-zero partial correlations. Let Dn = {(xi , z i )}i[n] be an independent sample of n realizations of (X, Z). For each u [p], we define the loss function Lu (Bu ; Dn ) := z{z j }j[n] i[n] 2 xi - u v=u xi buv (z) Kh (z - z i ) v + 2 v=u ||buv (·)||2 The above equation confirms that the non-zero partial correlation coefficients can be selected by estimating non-zero elements of the precision matrix. Let 2 S := {(u, v) : [0,1] uv (z)dz > 0, u = v} denote the set of non-zero partial correlation coefficients, which we assume to be constant with respect to z, i.e., we assume that the associations are fixed, but their strength can vary with respect to the index z. Furthermore, we assume that the number of non-zero partial correlation coefficients, s := |S|, is small. This is a reasonable assumption for many problems, e.g., in biological systems a gene usually interacts with only a handful of other genes. In the following paragraphs, we relate the partial correlation coefficients to a regression problem, and present a computationally efficient method for estimating non-zero elements of the precision matrix based on this insight. For each component Xu (u [p]) we set up a regression model, where Xu is the response variable, and all the other components are the covariates. Let X\u := {Xv : v = u, v [p]}. It is a well known result (e.g., Lauritzen, 1996) that the partial correlation coefficients can be related to a regression model (4) where Bu = (bu (z 1 ), . . . , bu (z n )), bu (z j ) Rp-1 , i Kh (z - z i ) = K( |z-z | ) is a symmetric density funch tion with bounded support that defines local weights, h denotes the bandwidth, is the penalty parameter 2 ^ and ||buv (·)||2 := z{z j }j[n] buv (z) . Define Bu as a minimizer of the loss ^ Bu := argmin Lu (B; Dn ). BRp-1×n (5) ^ Combing {Bu }u[p] gives an estimator ^ S := {(u, v) : max{||^uv (·)||2 , ||^vu (·)||2 } > 0} b b of the non-zero elements of the precision matrix. In Eq. (4), the 1 /2 norm is used to penalize model parameters. This norm is commonly used in the Group Lasso (Yuan & Lin, 2006). In our case, since we assume the set of non-zero elements S, of the precision matrix, to be fixed with respect to z, the 2 norm is a natural way to shrink the whole group of coefficients {buv (z i )}i[n] to zero. Note that the group consists (6) On Sparse Nonparametric Conditional Covariance Selection of the same element, say (u, v), of the precision matrix for different values of z. Our approach should be contrasted to the approach in (Zhou et al., 2008; Kolar et al., 2009), where the set of non-zero elements of the precision matrix changes with respect to time, in which case one cannot use the 2 norm, which would make the set of non-zero elements constant over time. Instead, the use of 1 norm is necessary. Under our assumptions, usage of the 2 norm results in a more efficient procedure. The above described kernel smoothing procedure is well justified under the assumption that the elements of (z) are smooth but unknown functions of z. The loss function in Eq. (4), without the penalty term, is common in varying coefficient regression models, however, relevant coefficients are selected using a generalized likelihood ratio test, (Li & Liang, 2008). We point out that the functions {buv (z)} are being estimated only on {z i }i[n] and not on the whole support of supp(Z) = [0, 1]. This is justified under the assumptions that {z i }i[n] are sufficiently dense on [0, 1]. For example, if Z has the density, fZ (z), that is continuous and bounded away from zero on [0, 1], then the maximal distance between any two neighboring observations is Op ( log n ) (Janson, 1987). Together with n the assumption that {buv (z)} are smooth functions, this implies that the approximation error, when estimating the whole curve buv (z), z [0, 1] with points {^uv (z), z {z i }i[n] }, is of smaller order than the opb timal nonparametric rate of convergence, Op (n-2/5 ). Finally, we note that the statistical efficiency of our procedure can be improved by exploiting the fact that the precision matrix is symmetric and jointly optimizing for {Bu }u[n] , (see Peng et al., 2009, for specific details in estimation of multivariate Gaussian graphical models). Algorithm 1 Procedure for solving Eq. (5) ~ (0) Input: Data Dn = {xi , z i }i[n] , initial solution Bu ^ Output: Solution Bu to Eq. (5) (0) 1: A := {v [p] \ u : ||~uv (·)||2 > 0}, t = 0 b 2: repeat 3: repeat {iterate over v A} i 4: Compute {ruv (z j )}i,j[n] using Eq. (8) 5: if condition (9) is satisfied then ~uv (·) 0 6: b 7: else ~uv (·) argmin Lv buv (·); Dn 8: b u 9: end if 10: until convergence on A; 11: forall v [p] \ u compute lines 4 through 9 once 12: A := {v [p] \ u : ||~uv (·)||2 > 0} b 13: until A did not change ^ 14: Bu {~uv (·)}v[p]\u b decomposes across different rows of the matrix Bu (Friedman et al., 2010). Now, we derive an update for row v, while keeping all other rows of Bu fixed. Let {~uv (z j )}j[n] be a minimizer of b Lv ({buv (z j )}j[n] ; Dn ) := u z{z j }j[n] i[n] i ruv (z) - xi buv (z) Kh (z - z i ) v 2 + 2 ||buv (·)||2 , where i ruv (z) = xi - u v =u,v (7) xi ~uv (z) v b (8) 3. Optimization algorithm In this section, we detail an efficient optimization algorithm that can be used to solve the problem given in Eq. (5). Given that the optimization problem is convex, a variety of techniques can be used to solve it. A particularly efficient optimization algorithm has been devised for 1 /2 penalized problems, that is based on the group-coordinate descent and is referred to as the active-shooting algorithm (Peng et al., 2009; Friedman et al., 2010). A modification of the procedure, suitable for our objective, is outlined in Algorithm 1, which we now explain. We point out that the group coordinate descent will converge to an optimum, since the loss function is smooth and the penalty term in Eq. (4) and {~uv (z)} denotes the current solution for all the b other variables. Solving Eq. (7) iteratively, by cycling through rows v [p] \ u, will lead to an optimal solu^ tion Bu of Eq. (5). By analyzing Karush-Kuhn-Tucker conditions of the optimization problem in Eq. (7), we can conclude that the necessary and sufficient condition for {~uv (z j )}j[n] 0 is b 2 1 i xi ruv (z)Kh (z - z i ) 1. (9) v 2 j z{z }j[n] i[n] Eq. (9) gives a fast way to explicitly check if the row v of a solution is identical to zero or not. If the condition in Eq. (9) is not satisfied, only then we need to find a minimizer of Eq. (7), which can be done by the gradient descent, since the objective is differentiable when {buv (z j )}j[n] 0. In practice, one needs to find a solution to (5) for a large number of penalty parameters . Comput- On Sparse Nonparametric Conditional Covariance Selection ing solutions across a large set of possible values can effectively be implemented using the warm start technique (Friedman et al., 2008a). In this technique, Eq. (5) is solved for a decreasing sequence of penalty ~ (0) parameters 1 > . . . > N and the initial value Bu ^ provided to Algorithm 1 for i is the final solution Bu for i-1 . This experimentally results in faster convergence and a more stable algorithm. 4. Theoretical properties In this section, we give some theoretical properties of the estimation procedure given in Section 2. These results are given for completeness and are presented without proofs, which will be reported elsewhere. In particular, we provide conditions under which there ^ ^ exists a set S = S() of selected non-zero partial correlations, which consistently estimates S, the true set of ^ non-zero partial correlations. Observe that S depends on the penalty parameter , so it is of practical importance to correctly select the parameter for which ^ S consistently recovers S. We give conditions under which the modified BIC criterion is able to identify the correct penalty parameter . We start by giving general regularity conditions. The following regularity conditions are standard in the literature (Fan & Huang, 2005; Wang & Xia, 2008): (A1) There is an s > 2 such that E[||X||2s ] ; 2 (A2) The density function f (z) of the random variable Z is bounded away from 0 on [0, 1] and has bounded second order derivative; (A3) The matrix (z) is positive definite for all z [0, 1] and its elements (uv (z)) are functions that have bounded second derivatives; (A4) The function E[||X||4 Z = z] is 2 bounded; (A5) The kernel K(·) is a symmetric density with compact support. In addition the standard regularity conditions, we need the following identifiability condition, which allows us to correctly identify the 1 true model (A6) supz[0,1] maxu=v |uv (z i )| O( d ), where d := maxu[p] |{v : (u, v) S}| Theorem 1 Assume that the regularity conditions (A1)-(A6) are satisfied. Furthermore, assume that E[exp(tX)|Z = z] exp( 2 t2 /2) for all z [0, 1], t R and some (0, ). Let h = O(n-1/5 ), = O(n7/10 log p) and n-9/5 0. If 11/10 n ^ minu,vS ||buv (·)||2 , then P[S = S] 1. log p Assuming that X is a subgaussian random variable in Theorem 1 is due to technical reasons. The assumption is needed to establish exponential inequalities for the ^ probability that each solution Bu of Eq. (5) correctly identifies the set of non-zero rows of Bu . Then consis^ tency of S can be established by applying the union ^ bound over the events that estimators {Bu }u[p] consistently estimate non-zero rows of {Bu }u[p] . For the last claim to be true when the dimension p is large, e.g., p = O(exp(n )), > 0, we need a good tail behavior of the distribution of X. The statement of the theorem still holds true, even if we do not establish exponential inequalities, but only for smaller dimensions. Another commonly used regularity condition on X is to assume that it is bounded with probability 1, which would again allow us to establish exponential inequalities needed in the proof. Finally, we need to assume that for (u, v) S, ||buv (·)||2 does not decay to zero too quickly. Otherwise, the element of the precision matrix would be to hard to distinguish from 0. Next, we show that the correct penalty parameter can be chosen using the modified BIC criterion of ^ (Chen & Chen, 2008). Denote Bu, as the solution of Eq. (5) obtained for the penalty parameter . We define the residual sum of squares as RSSu () := n-2 z i[n] xi - u xi ^uv, (z) Kh (z-z i ) vb v=u 2 and the BIC-type criterion BICu () = log(RSSu ()) + df u, (log(nh) + 2 log p) , nh where df u, denotes the number of non-zero rows of ^ Bu, . We used the modified version of the BIC criterion, since the ordinary BIC criterion tends to include many spurious variables when the complexity of the model space is large (Chen & Chen, 2008). Now, is chosen by a minimization: ^ = argmin u[p] BICu (), (10) and the final estimator of the non-zero components of ^ ^ ^ the precision matrix S = S() is obtained by combin^ ^ }u[p] . We have the following theorem. ing {Bu, Theorem 2 Assume that the conditions of Theorem 1 ^ are satisfied. Then the tuning parameter obtained by minimizing criterion (10) asymptotically identifies the ^ ^ correct model, i.e., P[S() = S] 1. 5. Simulation results 5.1. Toy example We first consider a small toy example in order to demonstrate our algorithm's performance. We draw n samples, from the joint distribution of (X, Z) where the conditional distribution of X given Z = z is a 5dimensional multivariate Gaussian with mean 0 and On Sparse Nonparametric Conditional Covariance Selection ideal 100 100 MB (static) 1 Recall on 8x8 grid 1 Precision on 8x8 grid 1 F1 score on 8x8 grid F1 score Frequency Frequency 0.5 Precision Recall 0.5 0.5 50 50 0 0 Precision Matrix element 1 2 3 4 5 6 7 8 9 10 0 total number of samples 1 2 3 4 5 6 7 8 9 10 200 400 600 800 0 total number of samples 200 400 600 800 0 total number of samples 200 400 600 800 Precision Matrix element (a) kernel + l1 penalty 100 100 (b) kernel + group penalty Frequency Figure 2. Simulation results for 8x8 grid. See section 5.2 for details. 50 50 0 Precision Matrix element 1 2 3 4 5 6 7 8 9 10 0 Precision Matrix element 1 2 3 4 5 6 7 8 9 10 (c) (d) Figure 1. Toy example results. Each bar represents the number of times the corresponding precision matrix ele^ ment was included in S. Performance of the ideal algorithm is shown in Figure 1(a). Our algorithm (Figure 1(d)) gets close to this, and far outperforms both the other methods. precision matrix (z), and Z is uniformly distributed on [0, 1]. The set S = {(1, 2), (3, 4), (2, 4), (1, 5), (3, 5)} denotes the non-zero elements of (z). We set elements uv (z) = uv (z) = fuv (z) for all (u, v) S, where the functions {fuv (z)} are defined as follows: (1) f1,2 1 (constant), (2) f3,4 1 (constant), (3) f2,4 (z) = 1 if z .5 and -1 for z > .5 (piecewise constant), (4) f1,5 (z) = 2z - 1 (linear), (5) f3,5 (z) = sin(2z) (sinusoid). The diagonal elements uu (z) (z [0, 1]) are set to a constant number such that (z) is diagonally dominant, and hence positive definite. We compared our method against the approach of u (Meinshausen & B¨hlmann, 2006) (referred to as MB), which assumes an invariant covariance matrix and ignores z, and against a simpler variant of our algorithm (called "kernel, 1 penalty"), which replaces the group 1 /2 penalty in Eq. (4) with the 1 penalty. Recall that the 1 penalty does not encourage the set of nonzero elements in the precision matrix to remain fixed for all z [0, 1]. Our algorithm, developed in Section 2 is referred to as "kernel, group penalty". We average our results over 100 random trials. For each trial, n = 300 samples are randomly generated using the procedure described above. We counted the number of times each of the 5 = 10 possible off2 diagonal elements of the precision matrix were selected as non-zeros. Figure 1 displays results as histograms. Bars 1-5 correspond to the true non-zero elements in S, as enumerated above, while bars 6-10 correspond to the elements that should be set to zero. Thus, in the ideal case, bars 1-5 should be estimated as non-zero for all 100 trials, while bars 6-10 should never be selected. As we can see, all algorithms select the constant elements 12 (·) (bar 1) and 34 (·) (bar 2). However, the MB approach fails to recover the three varying precision matrix elements and also recovers many false elements. Just using the kernel + 1 penalty, described above, performs better, but still selects many elements not in S. Our algorithm, on the other hand, selects all the elements in S almost all of the time, and also excludes the elements not in S the vast majority of the time. This higher precision is the result of our group penalty, and gives superior performance to just using an 1 penalty (assuming that the set of non-zero partial correlation coefficients is fixed with respect to z). 5.2. Large simulations We next tested our algorithm on a larger problem where X R64 . The components of X were arranged into an 8x8 grid, so that only adjacent components in the grid have has non-zero partial correlation. For all adjacent (u, v), uv (z) = sin(2z + cuv ), where cuv Unif([0, 1]) is a random offset. We measure how well the algorithm recovers the true set of non-zero precision matrix elements. Both MB and "kernel + 1 " perform much worse than our estimator, so we do not display their performance. Performance of the "kernel + group penalty" estimator is shown in Figure 2. Even though the problem is significantly harder, after 800 samples our algorithm achieves an F1 score above 0.9. Frequency 6. Analyzing the stock market We next apply our method to analyzing relationships among stocks in the S&P 500. Such an analysis would be useful to an economist studying the effect of various indicators on the market, or an investor who is seeking to minimize his risk by constructing a diverse portfolio according to Modern Portfolio Theory (Markowitz, 1952). Rather than assume static associations among stocks we believe it is more realistic to model them as a function of an economic indicator, such as oil price. We acquired closing stock prices from all stocks in the S&P 5001 and oil prices2 for all the days that the mar1 2 http://www.finance.yahoo.com http://tonto.eia.doe.gov/ On Sparse Nonparametric Conditional Covariance Selection ket was open from Jan 1, 2003 through Dec 31, 2005. This gave us 750 samples of 469 stocks (we only considered stocks that remained in the S&P 500 during the entire time period). Instead of considering the raw prices, which often are a reflection of other factors, such as number of shares, we used the logarithm of the ratio of the price at time t to the price at time t - 1 and subtracted the mean value and divided by the standard deviation for each stock. Our data consists of pairs {xi , z i }, the vector of standardized stock prices and the oil price, respectively, obtained over a period of time. We analyze the data to recover the strength of associations between different stocks as a function of the oil price. Our belief is that each stock is associated with a small number of other stocks and that the set of associations is fixed over a time-period of interest, although the strengths may change. We believe this is justified since we are looking for long-term trends among stocks and want to ignore transient effects. Figure 3 illustrates the estimated network, where an edge between two nodes correspond to a non-zero element in the precision matrix. Note that the presented network is not a representation of an undirected probabilistic graphical model. Clusters of related stocks are circled in Figure 3, and these largely confirm our intuition. Here are some of the stocks in a few of the clusters: (1) Technology/semiconductors - Hewlett Packard, Intel, Teradyne, Analog Devices etc.; (2) Oil/drilling/energy Diamond Offshore Drilling, Baker Hughes, Halliburton, etc.; (3) Manufacturing - Alcoa, PPG Industries (coating products), International Paper Co. etc.; (4) Financial - American Express, Wells Fargo, Franklin Resources etc. It is also interesting that there exist coherent subgroups inside these clusters. For example, the "Retail stores" sector could be further divided into companies that specialize in clothes, like Gap and Limited, and those that are more general purpose department stores, like Wal-Mart and Target. Another point of interest are two hubs (enlarged and highlighted in green in Figure 3), that connect a set of diverse stocks that do not easily categorize into an industrial sector. They correspond to JPMorgan Chase and Citigroup (two prominent financial institutions). It possible that these stocks are good indicators of the status of the market or have certain investment portfolios that contribute to their central positions in the network. Finally, we explore the evolving nature of our edge weights as a function of oil price to demonstrate the advantages over simply assuming static partial correlations. Recall that the edge weights vary with oil price Figure 3. Overall stock market network that was recovered by the algorithm. Edges in the graph correspond to nonzero elements in the precision matrix. As one can see, the recovered network contains many clusters of related stocks. The green (and enlarged) hubs are described in the text. and are proportional to the estimated partial correlation coefficients. Consider the two stocks Analog Devices (ADI), which makes signal processing solutions, and NVIDIA (NVDA), which makes graphics processing units. Ignoring the effect of the oil price, both of these companies are highly related since they belong to the semiconductor sector. However, if one analyzes the edge weights as a function of oil price, as shown in Figure 4 (a) and (b), both behave quite differently. This changing relationship is reflected by the varying strength of the edge weight between NVIDIA and Analog Devices (shown in Figure 4 (c) ). Note that when oil prices are low, the edge weight is high since Analog Devices and NVIDIA are both rising as a function of oil price. However, as oil prices increase, Analog Devices stabilizes while NVIDIA is more erratic (although it is mostly rising), so the edge weight sharply decreases. Thus, if an investor is aiming for diversification to reduce risk, he/she may be wary of investing in both of these stocks together when oil prices are low since they are highly associated, but might consider it if oil prices are high and the stocks are less associated. 7. Discussion We develop a new nonparametric estimator for the problem of high-dimensional conditional covariance selection. Elements of the precision matrix are related to the partial correlation coefficients, whose nonzero structure tells a lot about the associations between different components of the vector X. Our work is motivated by problems arising in biology and finance, where the associations between different variables change with respect to environmental factors. On Sparse Nonparametric Conditional Covariance Selection ADI and oil price 4 4 3 3 2 1 40 60 80 0 20 40 60 80 NVDA and oil price edge weight ADI and NVDA relationship 0.4 2 1 0 20 0.2 oil price oil price 0 20 40 oil price 60 80 (a) (b) (c) Figure 4. This figure demonstrates how the changing edge weight between Analog Devices and NVIDIA ((c)) corroborates with the fact that Analog Devices and NVIDIA behave quite differently as a function of oil price ((a) and (b)). In (a) and (b), the y-axis is the ratio of the stock price to its price on January 1, 2003. We believe that our method will help in extracting subtle information from noisy data, that otherwise couldn't be found using standard methods for covariance selection that analyze data in isolation, without considering various environmental factors. Acknowledgements This paper is based on work supported by ONR N000140910758, NSF IIS-0713379, NSF Career DBI0546594, NIH 1 R01 GM078622-01, and an Alfred P. Sloan Research Fellowship to EPX. References Banerjee, O., El Ghaoui, L., and d'Aspremont, A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J. Mach. Learn. Res., 9:485­516, 2008. ISSN 15337928. Chen, J. and Chen, Z. Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759­771, 2008. doi: 10.1093/biomet/asn034. Dempster, A. P. Covariance selection. Biometrics, 28 (1):157­175, 1972. ISSN 0006341X. Duchi, J., Gould, S., and Koller, D. Projected subgradient methods for learning sparse gaussians. In Proceedings of the Twenty-fourth Conference on Uncertainty in AI (UAI), 2008. Fan, J. Local Linear Regression Smoothers And Their Minimax Efficiencies. The Annals of Statistics, 21 (1):196­216, 1993. Fan, J. and Huang, T. Profile likelihood inferences on semiparametric varying-coefficient partially linear models. Bernoulli, 11(6):1031­1057, 2005. Fan, J. and Yao, Q. Efficient Estimation of Conditional Variance Functions in Stochastic Regression. Biometrika, 85(3):645­660, 1998. Friedman, J., Hastie, T., and Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. Department of Statistics, Stanford University, Tech. Rep, 2008a. Friedman, J., Hastie, T., and Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostat, 9(3):432­441, 2008b. doi: 10.1093/ biostatistics/kxm045. Friedman, J., Hastie, T., and Tibshirani, R. A note on the group lasso and a sparse group lasso. Preprint, 2010. Janson, S. Maximal Spacings In Several Dimensions. The Annals of Probability, 15(1):274­280, 1987. Kolar, M., Song, L., Ahmed, A., and Xing, E.P. Estimating time-varying networks. Annals of Applied Statistics (to appear), 2009. Lauritzen, S. L. Graphical Models (Oxford Statistical Science Series). Oxford University Press, USA, July 1996. Li, R. and Liang, H. Variable Selection In Semiparametric Regression Modeling. The Annals of Statistics, 36(1):261­286, 2008. Markowitz, H. Portfolio selection. The journal of finance, 7(1):77­91, 1952. Meinshausen, N. and B¨hlmann, P. High-dimensional u graphs and variable selection with the lasso. Annals of Statistics, 34(3):1436­1462, 2006. Peng, J., Wang, P., Zhou, N., and Zhu, J. Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association, 104(486):735­746, 2009. doi: 10.1198/jasa. 2009.0126. Ravikumar, P., Wainwright, M. J., Raskutti, G., and Yu, B. High-dimensional covariance estimation by minimizing 1-penalized log-determinant divergence. Nov 2008. Rothman, A. J., Bickel, P. J., Levina, E., and Zhu, J. Sparse permutation invariant covariance estimation. Electronic Journal Of Statistics, 2:494, 2008. Ruppert, D., Wand, MP, and Holst, U. Local polynomial variance-function estimation. Technometrics, 39(3):262­273, 1997. Wang, H. and Xia, Y. Shrinkage estimation of the varying coefficient model. Manuscript, 2008. Yin, J., Geng, Z., Li, R., and Wang, H. Nonparametric Covariance Model. Statistica Sinica, Forthcoming, 2008. Yuan, M. and Lin, Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68:49­67, 2006. Yuan, M. and Lin, Y. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1): 19­35, March 2007. doi: 10.1093/biomet/asm018. Zhou, S., Lafferty, J., and Wasserman, L. Time varying undirected graphs. In Servedio, Rocco A. and Zhang, Tong (eds.), COLT, pp. 455­466. Omnipress, 2008. NVDA ADI