Time Varying Undirected Graphs Shuheng Zhou, John Lafferty and Larry Wasserman Carnegie Mellon University { szhou, lafferty} @cs.cmu.edu, larry@stat.cmu.edu Abstract Undirected graphs are often used to describe high dimensional distributions. Under sparsity conditions, the graph can be estimated using 1 penalization methods. However, current methods assume that the data are independent and identically distributed. If the distribution, and hence the graph, evolves over time then the data are not longer identically distributed. In this paper, we show how to estimate the sequence of graphs for non-identically distributed data, where the distribution evolves over time. 1 Introduction Let Z = (Z1 , . . . , Zp )T be a random vector with distribution P . The distribution can be represented by an undirected graph G = (V , F ). The vertex set V has one vertex for each component of the vector Z . The edge set F consists of pairs (j, k ) that are joined by an edge. If Zj is independent of Zk given the other variables, then (j, k ) is not in F . When Z is Gaussian, missing edges correspond to zeroes in the inverse covariance matrix -1 . Suppose we have independent, identically distributed data D = (Z 1 , . . . , Z t , . . . , Z n ) from P . When p is small, the graph may be estimated from D by testing which partial correlations are not significantly different from zero [DP04]. When p is large, estimating G is much more difficult. However, if the graph is sparse and the data are Gaussian, then several methods can successfully estimate G; see [MB06, BGd08, FHT07, LF07, BL08, RBLZ07]. All these methods assume that the graphical structure is stable over time. But it is easy to imagine cases where such stability would fail. For example, Z t could This research was supported in part by NSF grant CCF0625879. SZ thanks Alan Frieze and Giovanni Leoni for helpful discussions on sparsity and smoothness of functions. We thank J. Friedman, T. Hastie and R. Tibshirani for making GLASSO publicly available, and anonymous reviewers for their constructive comments. represent a large vector of stock prices at time t. The conditional independence structure between stocks could easily change over time. Another example is gene expression levels. As a cell moves through its metabolic cycle, the conditional independence relations between proteins could change. In this paper we develop a nonparametric method for estimating time varying graphical structure for multivariate Gaussian distributions using 1 regularization method. We show that, as long as the covariances change smoothly over time, we can estimate the covariance matrix well (in predictive risk) even when p is large. We make the following theoretical contributions: (i) nonparametric predictive risk consistency and rate of convergence of the covariance matrices, (ii) consistency and rate of convergence in Frobenius norm of the inverse covariance matrix, (iii) large deviation results for covariance matrices for non-identically distributed observations, and (iv) conditions that guarantee smoothness of the covariances. In addition, we provide simulation evidence that we can recover graphical structure. We believe these are the first such results on time varying undirected graphs. 2 The Model and Method Let Z t N (0, (t)) be independent. It will be useful to index time as t = 0, 1/n, 2/n, . . . , 1 and thus the data are Dn = (Z t : t = 0, 1/n, . . . , 1). Associated with each each Z t is its undirected graph G(t). Under the assumption that the law L(Z t ) of Z t changes smoothly, we estimate the graph sequence G(1), G(2), . . . ,. The graph G(t) is determined by the zeroes of (t)-1 . This method can be used to investigate a simple time series model of the form: W 0 N (0, (0)), and Ultimately, we are interested in the general time series model where the Z t 's are dependent and the graphs change over time. For simplicity, however, we assume independence but allow the graphs to change. Indeed, it is the changing graph, rather than the dependence, that is the biggest hurdle to deal with. In the iid case, recent work [BGd08, FHT07] has considered 1 -penalized maximum likelihood estimators W t = W t-1 + Z t , where Z t N (0, (t)). where Sn is the sample covariance matrix. In the non-iid case our approach is to estimate (t) at time t by w t n (t) = arg min r(-1 Sn (t)) + log || + |-1 |1 0 over the entire set of positive definite matrices, t n = arg min r(-1 Sn ) + log || + |-1 |1 0 ( 1) T wst Zs Zs s (2 ) wst is a |weighted covariance matrix, with weights wst = g iven by a symmetric nonnegative function K s-t| hn kernel over time; in other words, Sn (t) is just the kernel estimator of the covariance at time t. An attraction of this approach is that it can use existing software for covariance estimation in the iid setting. here Sn (t) = s where Sn is the sample covariance matrix. Minimizing Rn () without constraints gives n = Sn . We would like to minimize Rn () subject to -1 0 L. This would give the "best" sparse graph G, but it is not a convex optimization problem. Hence we estimate n by solving a convex relaxation problem as written in (1) instead. Algorithms for carrying out this optimization are given by [BGd08, FHT07]. Given Ln , n, let 1 Sn = { : 0, -1 Ln }. (5 ) We define the oracle estimator and write (1) as (7) (n) n = arg min R(), = arg min Rn (). Sn Sn The maximum likelihood estimate minimizes Rn () = tr(-1 Sn ) + log ||, (6 ) (7 ) max (W W ). We write | · |1 for the 1 norm of a matrix vej torized, i.e., for a matrix |W |1 = vecW 1 = ic |wij |, and write W 0 for the number of nonzero entries in the matrix. We use (t) = -1 (t). 2.1 Notation We use the following notation throughout the rest of the paper. For any matrix W = (wij ), let |W | denote the determinant of W , tr(W ) the trace of W . Let max (W ) and min (W ) be the largest and smallest eigenvalues, respectively. We write W = diag(W ) for a diagonal matrix with the same diagonal as W , and W = W - W .i The matrix Frobenius norm is given by W F = j2 wij . The operator norm W 2 is given by 2 T Note that one can choose to only penalize off-diagonal elements of -1 as in [RBLZ07], if desired. We have the following result, whose proof appears in Section 3.2. Theorem 1 Suppose that pn n for some 0 and n 1/2 Ln = o log pn for (5). Then for the sequence of empirical estimators as defined in (7) and (n), n as in (6), 3.1 Risk Consistency for the Non-identical Case In the non-iid case we estimate (t) at time t [0, 1]. Given (t), let Rn ((t)) = tr((t)-1 Sn (t)) + log |(t)|. P R(n ) - R( (n)) 0. 3 Risk Consistency In this section we define the loss and risk. Consider estimates n (t) and Gn (t) = (V , Fn ). The first risk function is U (G(t), Gn (t)) = EL(G(t), Gn (t)) (3 ) F , where L(G(t), Gn (t)) = (t) Fn (t) that is, the size of the symmetric difference between two edge sets. P We say that Gn (t) is sparsistent if U (G(t), Gn (t)) 0 as n . The second risk is defined as follows. Let Z N (0, 0 ) and let be a positive definite matrix. Let R() = tr(-1 0 ) + log ||. Note that, up to an additive constant, R() = -2E0 (log f (Z )), where f is the density for N (0, ). We say that Gn (t) is persistent [GR04] with respect to a class of positive P definite matrices Sn if R(n ) - minSn R() 0. In the iid case, 1 regularization yields a persistent estimator, as we now show. (4 ) where Sn (t) is given in (2), with K (·) a symmetric nonnegative function with compact support: Lemma 2 Let (t) = [j k (t)]. Suppose the following conditions hold: For a given 1 bound Ln , we define n (t) as the minimizer of Rn () subject to Sn , t ( n (t) = arg min r(-1 Sn (t)) + log || 8) Sn A 1 The kernel function K has a bounded support [-1, 1]. f Then maxj,k |Sn (t, j, k ) - (t, j, k )| = OP or all t > 0. 2. pn n for some 0. 3. hn n-1/3 . 1. There exists C0 > 0, C such that maxj,k supt |j k (t)| C0 and maxj,k supt |jk (t)| C . log n n1/3 Proof: By the triangle inequality, |Sn (t, j, k ) - (t, j, k )| |Sn (t, j, k ) - ESn (t, j, k )|+ or some c1 > 0. Hence, m Sn (t, j, k ) - ESn (t, j, k )| > P ax | j,k - a ex p nhn (C 2 - 2 log n/(nhn )) nd (9 ) l . og n maxj,k |Sn (t, j, k ) - ESn (t, j, k )| = OP nhn HenW the result holds for hn n-1/3 . ce ith the use of Lemma 2, the proof of the following follows the same lines as that of Theorem 1. In Lemma 15, we show that f| - ex p c1 h n n 2 P Sn (t, j, k ) - ESn (t, j, k ))| > |ESn (t, j, k ) - (t, j, k )|. In Lemma 14 we show that max sup |ESn (t, j, k ) - (t, j, k )| = O(C0 hn ). j,k t A 3 Let ij (t), i, j, be twice differentiable functions such that ij (t) < and i (t) < , t [0, 1]. In addij tion, there exist constants S1 , S2 < such that sup t[0,1] =1 =1 =1 =1 kp kp p i p j p sup |ki (t)j (t)| S1 ( 1 3 ) S2 , ( 1 4 ) t[0,1] =1 =1 Lemma 7 Denote the elements of (t) = (t)-1 by j k (t). Under A 2 and A 3, the smoothness condition in Lemma 2 holds. The proof is in Section 6. In Section 7, we show some preliminary results on achieving upper bounds on quantities that appear in Condition 1 of Lemma 2 through the sparsity level of the inverse covariance matrix, i.e., t 0 , t [0, 1]. 3.2 Proof of Theorem 1 Note that n, supSn |R() - Rn ()| j - 1 1 , |-1 | |Sn (j, k ) - 0 (j, k )| n jk ,k where the first inp qualitp guarantees that e y supt[0,1] k=1 =1 |k (t)| < S1 < . p |k (t)| Theorem 3 Suppose all conditions in Lemma 2 and the following hold: n . l Ln = o 1/3 / og n (1 0 ) Then, t > 0, for the sequence of estimators as in (8), P R(n (t)) - R( (t)) 0. Remark 4 If a local linear smoother is substituted for a kernel smoother, the rate can be improved from n1/3 to n2/5 as the bias will be bounded as O(h2 ) in (3.1). Remark 5 Suppose that i, j , if ij = 0, we have ij = (1). Then Condition (10) allows that ||1 = Ln ; hence if p = n and < 1/3, we have that 0 = (p). Hence the family of graphs that we can guarantee persistency for, although sparse, is likely to include connected graphs, for example, when (p) edges were formed randomly among p nodes. The smoothness condition in Lemma 2 is expressed in terms of the elements of (t) = [ij (t)]. It might be more natural to impose smoothness on (t) = (t)-1 instead. In fact, smoothness of t implies smoothness of t as the next result shows. Let us first specify two 2 assumptions. We use i (x) as a shorthand for ii (x). Definition 6 For a function u : [0, 1] R, let u supx[0,1] |u(x)|. A 2 There exists some constant S0 < such that i=1...,p t[0,1] i=1...,p n 1/2 Hence, minimizing over Sn with Ln = o log pn , Rn ()| = oP (1). By the definitions supSn |R() - of (n) Sn and n Sn , we immediately have R( (n)) R(n ) and Rn (n ) Rn ( (n)); thus 0 = R(n ) - R( (n)) R(n ) - Rn (n ) + Rn (n ) - R( (n)) R(n ) - Rn (n ) + Rn ( (n)) - R( (n)) where it follows from [RBLZ07] that n = max |Sn (j, k ) - 0 (j, k )| = OP ( j,k l og p/n). Using the triangle inequality and n , (n) Sn , |R(n ) - R( (n))| |R(n ) - Rn (n ) + Rn ( (n)) - R( (n))| Sn = the event |R(n ) - Rn (n )| + |Rn ( (n)) - R( (n))| 2 sup |R() - Rn ()|. Thus > 0, i s contained in . s Thus, upSn |R() - Rn ()| > /2 (n ) - R( (n)) R > the event max sup |i (t)| max i S0 < , hence (11) S0 . (1 2 ) for Ln = o((n/ log n)1/2 ), and > 0, as n , > R P (n ) - R( (n)) s 0. P upSn |R() - Rn ()| > /2 4 Frobenius Norm Consistency In this section, we show an explicit convergence rate in the Frobenius norm for estimating (t), t, where p, |F | grow with n, so long as the covariances change smoothly over t. Note that certain smoothness assumptions on a matrix W would guarantee the corresponding smoothness conditions on its inverse W -1 , so long as W is non-singular, as we show in Section 6. We first write our time-varying estimator n (t) for -1 (t) at time t [0, 1] as the minimizer of the 1 regularized negative smoothed log-likelihood over the entire set of positive definite matrices, t ( n (t) = arg min r(Sn (t)) - log || + n ||1 15) 0 Consider now the set Tn = { : = B - 0 , B , 0 0, where rn = F = M rn }, ( p + s) log n n p + s 0 . 2/3 n (1 9 ) F Claim 9 Under A 4, for all Tn such that o(1) as in (19), 0 + v 0, v I [0, 1]. = where n is a non-negative regularization parameter, and Sn (t) is the smoothed sample covariance matrix using a kernel function as defined in (2). Now fix a point of interest t0 . In the following, we use 0 = (ij (t0 )) to denote the true covariance matrix at this time. Let 0 = -1 be its inverse matrix. Define 0 the set S = {(i, j ) : ij (t0 ) = 0, i = j }. Then |S | = s. Note that |S | is twice the number of edges in the graph G(t0 ). We make the following assumptions. n a A 4 Let p + s = o (2/3) / log n nd min (0 ) k > 0, hence max (0 ) 1/k. For some sufficiently la.rge 2 ( p+s) log n constant M , let min (0 ) = M n2/3 The proof draws upon techniques from [RBLZ07], with modifications necessary to handle the fact that we penalize ||1 rather than | |1 as in their case. Proof: It is sufficient to show that 0 + (1 + ) 0 and 0 - 0 for some 1 > > 0. Indeed, min (0 + (1 + )) min (0 ) - (1 + ) 2 > 0 for < 1, given that min (0 ) = (2M rn ) and 2 F = M rn . Similarly, min (0 - ) miT(0 ) - 2 > 0 for < 1. n hus we have that log det(0 + v ) is infinitely differentiable on the open interval I [0, 1] of v . This allows us to use the Taylor's formula with integral remainder to obtain the following lemma: Lemma 10 With probability 1 - 1/nc for some c 2, G() > 0 for all Tn . Proof: Let us use A as a shorthand for 1 v vecT (1 - v )(0 + v )-1 (0 + v )-1 dv ec, 0 Theorem 8 Let n (t) be the minimizer defined by (15). Suppose all conditions in Lemma 2 and A 4 hold. If l og n n , then n2/3 . 2 ( p + s) log n (1 6 ) M n (t) - 0 F = OP n2/3 Proof: Let 0 be a matrix with all entries being zero. Let Q() = = + tr(Sn (t0 )) - log || + || - tr(0 Sn (t0 )) + log |0 | - |0 |1 ( - tr - 0 )(Sn (t) - 0 ) where is the Kronecker product (if W = (wij )m×n , P = (bk )p×q , then W P = (wij P )mp×nq ), and 2 vec Rp is p×p vectorized. Now, the Taylor expansion gives d log |0 + | - log |0 | = dv log |0 + v ||v=0 + 1 2 (1 - v ) dd 2 log det(0 + v )dv = tr(0 ) + A, v 0 where by symmetry, tr(0 ) = tr( - 0 )0 . Hence G() = (2 0 ) + A + tr (Sn - 0 ) n (|0 + |1 - |0 |1 ) . For an index set S and a matrix W = [wij ], write WS (wij I ((i, j ) S )), where I (·) is an indicator function. Recall S = {(i, j ) : 0ij = 0, i = j } and let S c = {(i, j ) : 0ij = 0, i = j }. Hence = + + S c , in our notation. Note that we have S c = 0, S 0 | + |1 0 | + |1 - | |1 0 0 | |1 0 = |S |1 , hence 0 1 1 - , Sc S -| |1 , = |S + |1 + |c |1 , 0 S S minimizes Q(), or equivalently n = - 0 minimizes G() Q(0 + ). Hence G(0) = 0 and G(n ) G(0) = 0 bly definition. Define for some (log || - log |0 |) + tr (( - 0 )0 ) (||1 - |0 |1 ). (1 7 ) | + |1 - | |1 0 0 where the last two steps follow from the triangle inequality. Therefore |0 + |1 - |0 |1 = | + |1 - | |1 + | + |1 - | |1 0 0 1 1 0 0 - S - | |1 . (2 1 ) Sc og constant C1 , n = C1 n2/n . Now, let 3 l og n n C1 = n = for some 0 < < 1. (18) 2/3 n Now, om Le=ma 2, maxj,k |Sn (t, j, k ) - (t, j, k )| = fr m OP log n n1/3 1 OP (n ). By (9), with probability 1 - n2 1 | (n + n ) |1 + S (n + n ) p F + s F S (n + n ) p F + s F (n + n ) max{ p, s} F + F (n + n ) max{ p, s} 2 F 1 + (2 3 ) n p + s 2 F. Combining (20), (22), and (23), we have with probabil1 ity 1 - nc , for all Tn , 1 | G() A - (n + n ) |1 + S k2 1 + p+s 2 F 2 - n F 2+ k = 2 2(1 + ) 2 = F - n p+s 2+ F > k 2 n 2(1 + ) p+s 0 - 2 F 2+ M rn L from t r (Sn - 0 ) n ||1 , hence by (21) + tr (Sn - 0 ) n (|0 + |1 - |0 |1 ) 1 1 -n | |1 - n c - n S S 1 1 -n | |1 + n S c - n S 1 1 | -(n + n ) |1 + + (n - n ) c S S 1 | -(n + n ) |1 + , where (2 2 ) S y Claim 12 and the fact that G(n ) G(0) = 0, we have the following: If G() > 0, Tn , then n (Tn Vn ), that is, n F < M rn , given that n = n - 0 , where n , 0 0. Therefore = 1-P P n F < M rn n F M rn 1 - P (G() > 0, Tn ) = P (G() 0 for some Tn ) < 1 . nc We thus establish that n F OP (M rn ). C laim 13 Let B be a p × p matrix. If B 0 and B + D 0, then B + v D 0 for all v [0, 1]. Proof: We only need to check for v (0, 1), where 1 - v > 0; x Rp , by B 0 and B + D 0, xT B x > 0 and xT (B + D)x > 0; hence xT Dx > -xT B x. Thus xT (B + v D)x = xT B x + v xT Dx > (1 - v )xT B x > 0. 5 Large Deviation Inequalities Before we go on, we explain the notation that we follow throughout this section. We switch notation from t to x and form a regression problem for non-iid data. Given an interval of [0, 1], the point of interest is x0 = 1. We form a design matrix by sampling a set of n pdimensional Gaussian random vectors Z t at t = 0, 1/n, 2/n, . . . , 1, where Z t N (0, t ) are independently distributed. In this section, we index the random vectors Z with k = 0, 1, . . . , n such that Zk = Z t for k = nt, with corresponding covariance matrix denoted by k . H en ce Zk = (Zk1 , . . . , Zkp )T N (0, k ), k . (2 4 ) These are independent but not identically distributed. We will need to generalize the usual inequalities. In Section A, via a boxcar kernel function, we usenmoment 1 T generating functions to show that for = n k=1 Zk Zk , for M sufficiently large, where the bound on A comes Lemma 11 by [RBLZ07]. 2 P n (|ij - ij (x0 )| > ) < e-cn (2 5 ) emma 11 ([RBLZ07]) For some = o(1), under A 4, 1 v where P n = P1 × · · · × Pn denotes the product measure. vecT 0 (1 - v )(0 + v )-1 (0 + v )-1 dv ec We look across n time-varying Gaussian vectors, and 2 roughly, we compare ij with ij (x0 ), where (x0 ) = 2 2k , for all Tn . F+ n is the covariance matrix in the end of the window for t0 = n. Furthermore, we derive inequalities in SecWe next show the following claim. tion 5.1 for a general kernel function. Claim 12 If G() > 0, Tn , then G() > 0 for 5.1 Bounds For Kernel Smoothing all in Vn = { : = D - 0 , D 0, F > M rn , for rn as in (19)}. Hence if G() > 0, In this section, we derive large deviation inequalities for Tn , then G() > 0 for all Tn Vn . the covariance matrix based on kernel regression estimations. Recall that we assume that the symmetric nonneg Proof: Now by contradiction, suppose G( ) 0 for ative kernel function K has a bounded support [-1, 1] in M rn some Vn . Let 0 = F . Thus 0 = 0 + A 1. This kernel has the property that: 0 0 (1 - ) , where 0 < 1 - = MrnF < 1 by definition K (v )dv = 1 (26) v K (v )dv 2 2 of 0 . Hence 0 Tn given that 0 + 0 0 by -1 -1 Claim 13. Hence by convexity of G(), we have that 0 G(0 ) G(0) + (1 - )G( ) 0, contradicting v 2 K (v )dv 1. (2 7 ) 2 that G(0 ) > 0 for 0 Tn . B -1 due to the fact that K (v ) has compact support in [-1, 1] and h 1. Let k = (ij (xk )) , k = 1, . . . , n, where ij (xk ) = cov (Zki , Zkj ) = ij (xk )i (xk )j (xk ) and ij (xk ) is the correlation coefficient between Zi and Zj at time xk . Recall that we have independent (Zki Zkj ) for all k = 1, . . . , n such that E(Zki Zkj ) = ij (xk ). L et x n 1k 2 k - x0 1 (i, j ) = K ij (xk ), hence n h h =1 In order to estimate t0 , instead of taking an average of sample variances/covariances over the last n samples, we use the weighting scheme such that data close to t0 receives larger weights than those that are far away. Let (x) = (ij (x)). Let us define x0 = t0 = 1, and n - i = 1, . . . , n, xi = t0n i and n x i -x 0 x K 2 i - x0 hx (2 8 ) i (x0 ) = K nh h K i -x 0 i=1 h where the approximation is due to replacing the sum with the Riemann integral: x 0 in 2 in i - x0 i (x0 ) = K (v )dv = 1, 2 K nh h -1 =1 =1 We now use Taylor's Formula to replace ij (x0 + hv ) 0 and obtain 2 -1/h K (v )ij (x0 + hv )dv = d 0 (y (v ))(hv )2 2 -1 K (v ) ij (x0 ) + hv ij (x0 ) + ij 2 v h d 0 2 = ij (x0 ) + 2 -1 K (v ) v ij (x0 ) + C (hv) v, 2 where = 2 -1 2hij (x0 ) hij (x0 ) + 0 K (v ) C h2 , where y (v ) - x0 < hv . 4 ThuW1 (i, j ) - ij (x0 ) = O(h). s e now move on to the large deviation bound for all entries of the smoothed empirical covariance matrix. Lemma 15 For < 2 2 2 C1 (i (x0 )j (x0 )+ij (x0 )) " ", " " x k -x 0 i (xk )j (xk ) maxk=1,...,n 2K h C h2 v K (v )dv + 2 -1 0 h v ij (x0 ) + C (hv )2 2 -1 0 d v v 2 K (v )dv E We thus decompose and bound for point of interest x0 n k k (x0 )Zki Zkj - ij (x0 ) =1 kn k (x0 )Zki Zkj = =1 kn k (x0 )ij (xk ) = 1 (i, j ). where C1 is defined in Claim 18, for some C > 0, | - . P Sn (t, i, j ) - ESn (t, i, j )| > ex p C nh 2 Proof: Let us define Ak = Zki Zkj - ij (xk ). = | P Sn (t, i, j ) - ESn (t, i, j )| > n F k kn P k (x0 )Zki Zkj - k (x0 )ij (xk ) > =1 =1 =1 E k = =1 n kn =1 k (x0 )Zki Zkj - ij (x0 ) kn + k (x0 )Zki Zkj - E k (x0 )Zki Zkj + Before we start our analysis on large deviations, we first look at the bias term. Lemma 14 Suppose there exists C > 0 such that max sup | (t, i, j )| C. Then i,j t k =1 n =1 ( 29) or every t > 0, we have by Markov's inequality =n k P nk (x0 )Ak > n =1 P k (x0 )Zki Zkj - 1 (i, j ) |1 (i, j ) - ij (x0 )| . Proof: W.l.o.g, let t = t0 , hence ESn (t, i, j ) = 1 (i, j ). We use the Riemann integral to approximate the sum, x n 1k 2 k - x0 1 (i, j ) = K ij (xk ) n h h =1 u x0 - x0 2 K ij (u)du h xn h 0 =2 K (v )ij (x0 + hv )dv . -1/h t [0, 1], max |ESn (t, i, j ) - ij (t)| = O(h). i,j . (3 0 ) ent Before we continue, for a given t, let us first define the following quantities, where i, j are omitted from 1 (i, j ) x ( · ak = 2t K k -x0 i (xk )j (xk ) + ij (xk )) h h x ( · bk = 2t K k -x0 i (xk )j (xk )-ij (xk )) thus h h · 1 = 1 n 1 n Ee t e t Pn 2 k =1 h K ( xi -x0 )Ak > ent h Pn 2 k =1 h K ( xi -x0 )Ak h W · 3 = n n k =1 k =1 ak -bk 2t , a3 -b3 k k 6t3 , 2 = 1 n · M = maxk=1,...,n e now establish some convenient comparisons; see Section B.1 and B.2 for their proofs. Claim 16 3 4M and 4 2M 2 , where both equal2 3 2 ities are established at ij (xk ) = 1, k . n a4 +b4 1 4 = n k=1 k t4 k 8 x -x 0 2 k i (xk )j (xk ) hK h n k =1 a2 +b2 k k 4t2 n 1 1 Lemma 17 For bk ak 1 , k , 2 k=1 ln (1-ak )(1+bk ) 2 n t 1 + n t2 2 + n t3 3 + 9 n t4 4 . 5 To show the following, we first replace the sum with a Riemann integral, and then use Taylor's Formula to approximate i (xk ), j (xk ), and ij (xk ), k = 1, . . . , n with i , j ij and their first derivatives at x0 respectively, plus some remainder terms; see Section B.3 for details. Claim 18 For h = n- for some 1 > > 0, there exists some constant C1 > 0 such that 2 (i, j ) = 2 2 2 C1 (i (x0 )j (x0 ) + ij (x0 )) . h Lemma19 computes the moment generating function for x -x 0 Z 2 k ki · Zkj . The proof proceeds exactly as h h x eK that of Lemma 21 after substituting t with 2t K k -x0 h h verywhere. x ( Lemma 19 Let 2t K k -x0 1+ij (xk ))i (xk )j (xk ) h h < 1, k . For bk ak < 1. Ee 2t hK where the last step is due to Remark 20 and Lemma 17. Now -t us consider taking t that minimizes le ; exp nt + nt2 2 + nt3 3 + 9 nt4 4 Let t = 42 : 5 - 9 d - 40 ; Now n t + n t 2 2 + n t 3 3 + 5 n t 4 4 dt 2 1 given that 2 < M , Claim 16 and 18: n A x k2 k - x0 P K k > n h h =1 - 9 ex p n t + n t 2 2 + n t 3 3 + n t 4 4 5 -2 2 2 n n n 3 9 n 2 2 4 ex p + + 2 + 5 256 3 4 2 1 6 2 6 4 2 2 2 2 - 3 n 2 ex p 2 0 2 - . 3 nh 2 ex p 2 2 2 20C1 (i (x0 )j (x0 ) + ij (x0 )) Finally, let's check the requirement on 2 , M C / 2 2 2 h 1 (1 + ij (x0 ))i (x0 )j (x0 ) = x k -x 0 2 (xk )j (xk ) maxk=1,...,n h K i h mC 2 2 (1 + 2j (x0 ))i (x0 )j (x0 ) 1 . 2i x -x0 axk=1,...,n K k h i (xk )j (xk ) F " x k -x 0 h " Zki Zkj = ((1 - ak )(1 + bk )) 42 , -1/2 . Remark 20 Thus when we set t = implies that bk ak 1/2, k : ak the bound on = t(1 + ij (xk ))i (xk )j (xk ) 1 i (xk )j (xk ) . 2ti (xk )j (xk ) = 2 2 2 or completeness, we compute the moment generating function for Zk,i Zk,j . Lemma 21 Let t(1 + ij (xk ))i (xk )j (xk ) < 1, k , so that bk ak < 1, omitting xk everywhere, EetZk,i Zk,j = 1 1/2 . We can now finish showing the large deviation bound for maxi,j |Si,j - ESi,j |. Given that A1 , . . . , An are independent, we have Ee t Pn 2 k =1 h K " x k -x 0 h " Ak = = By (30), (31), Lemma 19, for t 42 , n A x k2 k - x0 P K k > n h h =1 =1 kn =1 kn ex p Ee h 2t - K " 2t K h x k -x 0 h x " kn k Ee h K ( 2t x 1 -x 0 h )Ak · (1 - t(i j + ij )(1 + t(i j - ij )) =1 - x0 h ij (xk ) Zki Zkj (3 1 ) Ee n e e t Pn 2 k =1 h K " x k -x 0 h " Ak k =1 - 2t K h " x k -x 0 h e-nt" = e-nt · " x k -x 0 2t Zki Zkj hK h " 1 (1-ak )(1+bk ) ij (xk ) Pn = -nt-nt1 (i,j )+ 1 2 · Ee ln ex p - k =1 9 n t + n t 2 2 + n t 3 3 + n t 4 4 5 , 22 where 2t12 1 2 + t2 1 2 (1 - 22 ) < 1. This requires 1 that t < (1+11 )1 2 which is equivalent to 2t12 1 2 + 2 22 t2 1 2 (1 - 22 ) - 1 < 0. One can check that if we 1 require t(1 + 12 )1 2 1, which implies that t1 2 22 1 - t12 1 2 and hence t2 1 2 (1 - t12 1 2 )2 , the lemma holds. Proof: W.l.o.g., let i = 1 and j = 2. = = E etZ2 Z1 e |Z2 E E tZ1 Z2 = t Z 2 12 1 t2 1 (1 - 22 ) 1 2 E ex p + 2 2 2 -1/2 1 t 2 12 1 t2 1 (1 - 22 ) 1 2 -2 + 2 2 2 1 1/2 = 22 1 - (2t12 1 2 + t2 1 2 (1 - 22 )) 1 1 1/2 = (1 - t(1 + 12 )1 2 )(1 + t(1 - 12 )1 2 ) 6 Smoothness and Sparsity of t via -1 t In this section we show that if we assume (x) = (ij (x)) are smooth and twice differentiable functions of x [0, 1], i.e., ij (x) < and i (x) < for x [0, 1], i, j , j and satisfy A 3, then the smoothness conditions of Lemma 2 are satisfied. The following is a standard result in matrix analysis. Lemma 22 Let (t) Rp×p has entries that are differentiable functions of t [0, 1]. Assuming that (t) is always non-singular, then d d [(t)] = -(t) [(t)](t). dt dt Lemma 23 Suppose (t) Rp×p has entries that each are twice differentiable functions of t. Assuming that (t) is always non-singular, then d2 [(t)] = (t)D(t)(t), where dt2 d d d2 D(t) = 2 [(t)](t) [(t)] - 2 [(t)]. dt dt dt Proof: The existence of the second order derivatives for d entries of (t) is due to the fact that (t) and dt [(t)] are both differentiable t [0, 1]; indeed by Lemma 22, - = d2 d d (t) [(t)](t) [(t)] = dt2 dt dt = d d d d [(t)](t) - [(t)] [(t)](t) - (t) dt dt dt dt d d d2 - [(t)] [(t)](t) - (t) 2 [(t)](t) - dt dt dt d d (t) [(t)] [(t)] dt2 dt d d d2 = (t) (t), [(t)](t) [(t)] - 2 [(t)] dt dt dt hence the lemma holds by the definition of D(t). L et (x) = (ij (x)) , x [0, 1]. Let (x) = (1 (x), 2 (x), . . . , p (x)), where i (x) Rp denotes a column vector. By Lemma 23, ij (x) i (x) j Theorem 25 Given A 2 and A 3, i, j , x [0, 1], <3 2 2S0 S1 + S0 S2 < . sup ij (x) x[0,1] Proof: By (33) and the triangle inequality, = T ij (x) i (x)D (x)j (x) i=1...,p 2 S0 max = 3 2 2 S0 S1 + S0 S2 , p p where by A 3, k=1 =1 |k (x)| S2 , and kp p 2 i (x) kp p =1 =1 |Dk (x)| =1 =1 2|kT (x)(x) (x)| + |k (x)| k p = p =1 =1 kp p i p j p max |i (x)| S0 S1 . T k (x)(x) (x) =1 =1 =1 =1 i=1...,p kp p i p j p ki (x)j (x)ij (x) =1 =1 =1 =1 ki (x)j (x) 7 Some Implications of a Very Sparse We use L1 to denote Lebesgue measure on R. The aim of this section is to prove some bounds that correspond to A 3, but only for L1 a.e. x [0, 1], based on a single sparsity assumption on as in A 5. We let E [0, 1] represent the "bad" set with L1 (E ) = 0. and L1 a.e. x [0, 1] refer to points in the set [0, 1] \ E such that L1 ([0, 1] \ E ) = 1. When (x) 0 s + p for all x [0, 1], we immediately obtain Theorem 26, whose proof appears in Section 7.1. We like to point out that although we apply Theorem 26 to and deduce smoothness of , we could apply it the other way around. In particular, it might be interesting to apply it to the correlation coefficient matrix (ij ), where the diagonal entries remain invariant. We use (x) and (x) to de note (ij (x)) and (i (x)) respectively x. j A 5 Assume that (x) 0 = -T (x) (x)j (x), i = T (x)D(x)j (x), i , x [0, 1]. ij (x) (3 2 ) (3 3 ) where (x) = Lemma 24 Given A 2 and A 3, x [0, 1], S 2 |ij (x)| S0 1 < . Proof: |ij (x)| s + p x [0, 1]. = |T (x) (x)j (x)| i kp p 2 max |i (x)| i=1...,p W =1 =1 2 |k (x)| S0 S 1. We state a theorem, the proof of which is in Section 7.1 and a corollary. Theorem 26 Under A 5, we have (x) 0 (x) 0 (x) 0 s + p for L1 a.e. x [0, 1]. A 6 S4 , S5 < such that 2 . S4 = max ij and S5 = max i j ij ij (3 4 ) e denote the elements of (x) by j k (x). Let represent a column vector of . Proof: By proof of Lemma 24, p p 2 |ij (x)| maxi=1...,p i k=1 =1 |k (x)|. 1 Hence by Theorem 26, for L a.e. x [0, 1], |ij (x)| p p 2 maxi=1...,p i k=1 =1 |k ( | x) 2 2 L S0 maxk, k (x) 0 S0 S4 (s + p). kp emma 28 Under A 5 and 6, for L1 a.e. x [0, 1], p ip jp ki (x)j (x) Corollary 27 Given A 2 and A 5, for L1 a.e. x [0, 1] S 2 |ij (x)| S0 (3 5 ) 4 (s + p) < . By Lemma 28, using the Lebesgue integral, we can derive the following corollary. Corollary 30 Under A 2, A 5, and A 6, 1 2 3 2 dx 2S0 S4 s + p2 + S0 S5 (s + p) < . ij (x) 0 =1 =1 =1 =1 kp =1 =1 ess sup i (x) j x[0,1] p (s + p)2 max ij k (s + p) max ij 3 2 2S0 (s + p)2 S4 + S0 (s + p)S5 . ij ij 2 , h en ce Take F = {x [0, 1] : ij (x) = 0} and u = ij . For L1 a.e. x F , that is, exceipt for a set Nij of L1 (Nij ) = 0, ij (x) = 0. Let N = j Nij . By Lemma 31, Lemma 32 If x [0, 1] \ N , where L1 (N ) = 0, if ij (x) = 0, then ij (x) = 0 for all i, j . Lemma 31 Let a function u : [0, 1] R. Suppose u has a derivative on F (finite or not) with L1 (u(F )) = 0. Then u (x) = 0 for L1 a.e. x F . 7.1 Proof of Theorem 26. Let (x) 0 s + p for all x [0, 1]. Proof: By the triangle inequality, for L1 a.e. x [0, 1], = = T i D j ij (x) pp k ik (x)j (x)Dk (x) =1 =1 i=1...,p 2 2 S0 Let vij = ij . Take F = {x [0, 1] : vij (x) = 0}. For L1 a.e. x F , that is, except for a set Ni1 with j i Ni1 . By L(Ni1 ) = 0, vij (x) = 0. Let N1 = j j j L em m a 3 1 , max = where for L a.e. x [0, 1], k p p T k =1 =1 3 2S0 (s + p)2 S4 1 kp p 2 i kp =1 =1 p |Dk (x)| 2 S0 Lemma 33 If x [0, 1] \ N1 , where L1 (N1 ) = 0, if ij (x) = 0, then i (x) = 0, i, j . j Thus this allows to conclude that Lemma 34 If x [0, 1] \ N N1 , where L1 (N N1 ) = 0, if ij (x) = 0, then ij (x) = 0 and i (x) = 0, i, j . j |kT | + + 2 S0 (s + p)S5 , kp p =1 =1 |k | Thus for all x [0, 1]\N N1 , (x) 0 (x) 0 8(x) 0 (s + p). Examples ki j ij =1 =1 kp p i p j p i=1...,p max i 2 S0 (s + p) S4 p and k=1 =1 |k | (s + p)S5 . The first inequality is due to the following observation: at most (s + p)2 f k i j elements in the sum of ki (x)j (x) p or L1 a.e. x [0, 1], that is, except for E , are nonzero, due to the fact that for x [0, 1] \ N , (x) 0 (x) 0 s + p as in Theorem 26. The second inequality is obtained similarly using the fact that for L1 a.e. x [0, 1], (x) 0 (x) 0 s + p. R emark 29 For the bad set E [0, 1] with L1 (E ) = 0, ij (x) is well defined as shown in Lemma 22, but it can only be loosely bounded by O(p2 ), as (x) 0 = O(p2 ), instead of s + p, for x E ; similarly, i (x) can j only be loosely bounded by O(p4 ). =1 =1 =1 =1 kp p i p j p =1 =1 =1 =1 k i j In this section, we demonstrate the effectiveness of the method in a simulation. Starting at time t = t0 , the original graph is as shown at the top of Figure 1. The graph evolves according to a type of Erdos-Renyi ran ´ dom graph model. Initially we set = 0.25Ip×p , where p = 50. Then, we randomly select 50 edges and update as follows: for each new edge (i, j ), a weight a > 0 is chosen uniformly at random from [0.1, 0.3]; we subtract a from ij and j i , and increase ii , j j by a. This keeps positive definite. When we later delete an existing edge from the graph, we reverse the above procedure with its weight. Weights are assigned to the initial 50 edges, and then we change the graph structure periodically as follows: Every 200 discrete time steps, five existing edges are deleted, and five new edges are added. However, for each of the five new edges, a target weight is chosen, and the weight on the edge is gradually changed over the ensuing 200 time steps in order ensure smoothness. Similarly, for each of the five edges to be deleted, the weight gradually decays to zero over the ensuing 200 time steps. Thus, almost always, there are 55 edges in the graph and 10 edges have weights that are varying smoothly. Original Graph 8.1 Regularization Paths We increase the sample size from n = 200, to 400, 600, and 800 and use a Gaussian kernel with bandwidth 848 h = 5.1/3 . We use the following metrics to evaluate n model consistency risk for (3) and predictive risk (4) in Figure 1 as the 1 regularization parameter increases. · Let Fn denote edges in estimated n (t0 ) and F denote edges in (t0 ). Let us define precision = = 1- 1- Fn F F \ Fn = . F F Figure 1 shows how they change with . recall · Predictive risks in (4) are plotted for both the oracle estimator (6) and empirical estimators (7) for each n. They are indexed with the 1 norm of various estimators vectorized; hence | · |1 for n (t0 ) and (t0 ) are the same along a vertical line. Note that | (t0 )|1 |(t0 )|1 , 0; for every estimator (the oracle or empirical), ||1 decreases as increases, as shown in Figure 1 for |200 (t0 )|1 . Fn \ F Fn = Fn F Fn , Precision 100 rate % 60 80 n = 200 n = 400 n = 600 n = 800 Oracle 0 0.0 20 40 0.1 0.2 Recall 0.3 0.4 0.5 rate % 0 20 40 60 n = 200 n = 400 n = 600 n = 800 Oracle 80 Figure 2 shows a subsequence of estimated graphs as increases for sample size n = 200. The original graph at t0 is shown in Figure 1. G(p, Fn ) G(p, Fn \ F ) G(p, F \ Fn ) 100 0.0 0.1 0.2 Risk 0.3 0.4 0.5 150 predictive risk 130 n=200 =0.24 n=200 =0.2 n = 200 n = 400 n = 600 n = 800 Oracle 120 140 90 100 110 n=200 =0.14 20 40 ~ ||1 60 80 100 Figure 1: Plots from top to bottom show that as the penalization parameter increases, precision goes up, and then down as no edges are predicted in the end. Recall goes down as the estimated graphs are missing more and more edges. The oracle performs the best, given the same value for |n (t0 )|1 = | |1 , n. Figure 2: n = 200 and h = 1 with = 0.14, 0.2, 0.24 indexing each row. The three columns show sets of edges in Fn , extra edges, and missing edges with respect to the true graph G(p, F ). This array of plots show that 1 regularization is effective in selecting the subset of edges in the true model (t0 ), even when the samples before t0 were from graphs that evolved over time. Edges 0 .3 5 0 .4 8 7 5 0 .5 2 0 .5 2 7 5 0 .6 1 2 5 0 .0 2 0 .0 8 2 5 0 .1 2 7 5 0 .2 1 0 .5 9 5 Figure 3: There are 400 discrete steps in [0, 1] such that the edge set F (t) remains unchanged before or after t = 0.5. This sequence of plots shows the times at which each of the new edges added at t = 0 appears in the estimated graph (top row), and the times at which each of the old edges being replaced is removed from the estimated graph (bottom row), where the weight decreases from a positive value in [0.1, 0.3] to zero during the time interval [0, 0.5]. Solid and dashed lines denote new and old edges respectively. 8.2 Chasing the Changes Finally, we show how quickly the smoothed estimator using GLASSO [FHT07] can include the edges that are being added in the beginning of interval [0, 1], and get rid of edges being replaced, whose weights start to decrease at x = 0 and become 0 at x = 0.5 in Figure 3. J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostat, 2007. [ GH 0 2 ] G. Gregoire and Z. Hamrouni. Change point ´ estimation by local linear smoothing. J. Multivariate Anal., 83:56­83, 2002. [ G R0 4 ] E. Greenshtein and Y. Ritov. Persistency in high dimensional linear predictor-selection and the virtue of over-parametrization. Journal of Bernoulli, 10:971­988, 2004. [LF07] Clifford Lam and Jianqing Fan. Sparsistency and rates of convergence in large covariance matrices estimation, 2007. arXiv:0711.3933v1. [MB06] N. Meinshausen and P. Buhlmann. High dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436­1462, 2006. [RBLZ07] A.J. Rothman, P.J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covariance estimation, 2007. Technical report 467, Dept. of Statistics, Univ. of Michigan. [FHT07] 9 Conclusions and Extensions We have shown that if the covariance changes smoothly over time, then minimizing an 1 -penalized kernel risk function leads to good estimates of the covariance matrix. This, in turn, allows estimation of time varying graphical structure. The method is easy to apply and is feasible in high dimensions. We are currently addressing several extensions to this work. First, with stronger conditions we expect that we can establish sparsistency, that is, we recover the edges with probability approaching one. Second, we can relax the smoothness assumption using nonparametric changepoint methods [GH02] which allow for jumps. Third, we used a very simple time series model; extensions to more general time series models are certainly feasible. A Large Deviation Inequalities for Boxcar Kernel Function References [ BG d 0 8 ] O. Banerjee, L. E. Ghaoui, and A. d'Aspremont. Model selection through sparse maximum likelihood estimation. Journal of Machine Learning Research, 9:485­516, March 2008. P.J. Bickel and E. Levina. Covariance regularization by thresholding. The Annals of Statistics, 2008. To appear. M. Drton and M.D. Perlman. Model selection for gaussian concentration graphs. Biometrika, 91(3):591­602, 2004. In this section, we prove the following lemma, which implies the i.i.d case as in the corollary. Lemma 35 Using a boxcar kernel that weighs uniformly over n samples Zk N (0, (k )), k = 1, . . . , n, that are independently but not identically distributed, we have for small enough, for some c2 > 0, | . - P Sn (t, i, j ) - ESn (t, i, j )| > ex p c2 n 2 Corollary 36 For the i.i.d. case, for some c3 > 0, | . - P Sn (i, j ) - ESn (i, j )| > ex p c3 n 2 [ BL 0 8 ] [ D P0 4 ] Lemma 35 is implied by Lemma 37 for diagonal entries, and Lemma 38 for non-diagonal entries. A.1 Inequalities for Squared Sum of Independent Normals with Changing Variances 2 Throughout this section, we use i as a shorthand for ii 2 as before. Hence i (xk ) = Var (Zk,i ) = ii (xk ), k = 1, . . . , n. Ignoring the bias term as in (29), we wish to show that each of the diagonal entries of ii is close to 2 i (x0 ), i = 1, . . . , p. For a boxcar kernel that weighs uniformly over n samples, we mean strictly k (x0 ) = 1 n , k = 1, . . . , n, and h = 1 for (28) in this context. We omit the mention of i or t in all symbols from here on. The following lemma might be of its independent interest; hence we include it here. We omit the proof due to its similarity to that of Lemma 15. Lemma 37 We let z1 , . . . , zn represent a sequence of independent Gaussian random vaniables such that zk r 1 N (0, 2 (xk )). Let 2 = n k=1 2 (xk ). Using a boxcar kernel that weighs uniformly over n samples, < c 2 , for some c 2, we have > 1n , - k (3c - 5)n2 2 2 zk - ex p P 2 n 3c2 2 max =1 2 where max = maxk=1,...,n { 2 (xk )}. where, l (ak )l =5 .2 Proof of Lemma 17 We first use the Taylor expansions to obtain: l (ak )l a2 a3 a4 ln (1 - ak ) = -ak - k - k - k - , 2 3 4 l =5 l for ak < 1/2; Similarly, n (-1)l-1 (bk )l , where ln (1 + bk ) = l =1 =4 =5 1l 2 a5 a4 a5 k k k (ak )5 = 5 5(1 - ak ) 5 5 =5 1 Hence for bk ak 2 , k , l (-1)l (bk )l l (-1)n (bk )l > 0 an d < 0. l l ln 1 (1 - ak )(1 + bk ) 1k 2 n =1 B= 9 n t 1 + n t2 2 + n t3 3 + n t4 4 . 5 4 k n ak - b k a2 + b 2 a3 - b 3 9 a4 + b k k k k +k +k + 2 4 6 5 8 =1 A.2 Inequalities for Independent Sum of Products of Correlated Normals The proof of Lemma 38 follows that of Lemma 15. 2 2 n (2 (xk )j (xk )+ij (xk )) 1 Lemma 38 Let 2 = n k=1 i 2 and c4 = 203 2 . Using a boxcar kernel that weighs uni formly over n samples, for maxk (i (x2 )j (xk )) , k | . - P Sn (t, i, j ) - ESn (t, i, j )| > ex p c4 n 2 B Proofs for Large Deviation Inequalities B.1 Proof of Claim 16 We show one inequality; the other one is bounded similarly. k , we compare the k th elements 2,k , 4,k that appear in the sum for 2 and 4 respectively: 4,k (a4 + b4 )4t2 k =k 2,k (a2 + b2 )4t4 k k 2 x 2 k - x0 · K = i (xk )j (xk ) h h ( 8 2 1 + ij (xk ))4 + (1 - ij (xk ))4 (1 + 2j (xk )) i 2 2 x - x0 k max K · i (xk )j (xk ) k h h (1 + )4 + (1 - )4 = 2M 2 . max 01 4(1 + 2 ) .3 Proof of Claim 18 We replace the sum with the Riemann integral, and then use Taylor's Formula to replace i (xk ), j (xk ), and ij (xk ), x n 22 1k k - x0 2 2 2 K 2 (i, j ) = i (xk )j (xk ) + ij (xk ) n h2 h =1 u x0 d 22 - x0 2 2 2 u K i (u)j (u) + ij (u) 2 h xn h 0 2 d 2 2 2 = K 2 (v ) i (x0 + hv )j (x0 + hv ) + ij (x0 + hv ) v 1 h -h 2 0 2 (y1 )(hv )2 = K 2 (v ) i (x0 ) + hv i (x0 ) + i h -1 2 2 j (y2 )(hv )2 + j (x0 ) + hv j (x0 ) + 2 2 i (y3 )(hv )2 j dv ij (x0 ) + hv ij (x0 ) + 2 0 ( d 2 2 2 2 = K 2 (v ) 1 + ij (x0 ))i (x0 )k (x0 ) v + h -1 0 C2 v K 2 (v )dv + O(h) 2 2 C1 (1 + 2j (x0 ))i (x0 )j (x0 ) i h where y0 , y1 , y2 hv + x0 and C1 , C2 are some conB stants chosen so that all equalities hold. -1 =