Covariance Estimation for High Dimensional Data Vectors Using the Sparse Matrix Transform Guangzhi Cao Charles A. Bouman School of Electrical and Computer Enigneering Purdue University West Lafayette, IN 47907 {gcao, bouman}@purdue.edu Abstract Covariance estimation for high dimensional vectors is a classically difficult problem in statistical analysis and machine learning. In this paper, we propose a maximum likelihood (ML) approach to covariance estimation, which employs a novel sparsity constraint. More specifically, the covariance is constrained to have an eigen decomposition which can be represented as a sparse matrix transform (SMT). The SMT is formed by a product of pairwise coordinate rotations known as Givens rotations. Using this framework, the covariance can be efficiently estimated using greedy minimization of the log likelihood function, and the number of Givens rotations can be efficiently computed using a cross-validation procedure. The resulting estimator is positive definite and well-conditioned even when the sample size is limited. Experiments on standard hyperspectral data sets show that the SMT covariance estimate is consistently more accurate than both traditional shrinkage estimates and recently proposed graphical lasso estimates for a variety of different classes and sample sizes. 1 Introduction Many problems in statistical pattern recognition and analysis require the classification and analysis of high dimensional data vectors. However, covariance estimation for high dimensional vectors is a classically difficult problem because the number of coefficients in the covariance grows as the dimension squared [1, 2]. This problem, sometimes referred to as the curse of dimensionality [3], presents a classic dilemma in statistical pattern analysis and machine learning. In a typical application, one measures n versions of a p dimensional vector. If n < p, then the sample covariance matrix will be singular with p - n eigenvalues equal to zero. Over the years, a variety of techniques have been proposed for computing a nonsingular estimate of the covariance. For example, regularized and shrinkage covariance estimators [4, 5, 6, 7, 8, 9, 10] are examples of such techniques. In this paper, we propose a new approach to covariance estimation, which is based on constrained maximum likelihood (ML) estimation of the covariance [15]. In particular, the covariance is constrained to have an eigen decomposition which can be represented as a sparse matrix transform (SMT) [11]. The SMT is formed by a product of pairwise coordinate rotations known as Givens rotations [12]. Using this framework, the covariance can be efficiently estimated using greedy minimization of the log likelihood function, and the number of Givens rotations can be efficiently computed using a cross-validation procedure. The estimator obtained using this method is always positive definite and well-conditioned even when the sample size is limited. In order to validate our model, we perform experiments using a standard set of hyperspectral data [13], and we compare against both traditional shrinkage estimates and recently proposed graphical 1 lasso estimates [14] for a variety of different classes and sample sizes. Our experiments show that, for this example, the SMT covariance estimate is consistently more accurate. The SMT method also has a number of other advantages. It seems to be particularly good when estimating small eigenvalues and their associated eigenvectors. The cross-validation procedure used to estimate the SMT model order requires little additional computation, and the resulting eigen decomposition can be computed with very little computation (i.e. p2 operations). 2 Covariance estimation for high dimensional vectors In the general case, we observe a set of n vectors, y1 , y2 , · · · , yn , where each vector, yi , is p dimensional. Without loss of generality, we assume yi has zero mean. We can represent this data as the following p × n matrix Y = [y1 , y2 , · · · , yn ] . (1) If the vectors yi are identically distributed, then the sample covariance is given by S= 1 YYt , n (2) While S is an unbiased estimate of R it is also singular when n < p. This is a serious deficiency since as the dimension p grows, the number of vectors needed to estimate R also grows. In practical applications, n may be much smaller than p which means that most of the eigenvalues of R are erroneously estimated as zero. A variety of methods have been proposed to regularize the estimate of R so that it is not singular. Shrinkage estimators are a widely used class of estimators which regularize the covariance matrix by shrinking it toward some target structures [4, 5, 6, 7, 8]. Shrinkage estimators generally have the ^ form R = D + (1 - )S , where D is some positive definite matrix. Some popular choice for D are the identity matrix (or its scaled version) [5, 8, 6] and the diagonal entries of S , i.e. diag(S ) [5, 8]. In both cases, the shrinkage intensity can be estimated using cross-validation or boot-strap methods. Recently, a number methods have been proposed for regularizing the estimate by making either the covariance or its inverse sparse [9, 10, 14]. For example, the graphical lasso method enforces sparsity by imposing an L1 norm constraint on the inverse covariance [14]. Banding or thresholding can also be used to obtain a sparse estimates of the covariance []. 2.1 Maximum likelihood covariance estimation and S is an unbiased estimate of the true covariance matrix1 yt= R = E i yi E[S ] . (3) Our approach will be to compute a constrained maximum likelihood (ML) estimate of the covariance R, under the modeling assumption that eigenvectors of R may be represented as a sparse matrix transform (SMT). So we first decompose R as R = E E t , (4) where E is the orthonormal matrix of eigenvectors and is the diagonal matrix of eigenvalues. Then we will estimate the covariance by maximizing the likelihood of the data Y subject to the constraint that E is an SMT. By varying the order, K , of the SMT, we may then reduce or increase the regularizing constraint on the covariance. If we assume that the columns of Y are independent and identically distributed Gaussian random vectors with mean zero and positive-definite covariance R, then the likelihood of Y given R is given by - . 1 1 -n t -1 2 pR (Y ) = exp tr{Y R Y } (5) np |R| 2 (2 ) 2 1 If the sample mean is used as an estimate of the mean, then ance. n S n-1 is the unbiased estimate of the covari- 2 The log-likelihood of Y is then given by [15] n np n log(2 ) , log p(E ,) (Y ) = - tr{diag(E t S E )-1 } - log || - 2 2 2 (6) where R = E E t is specified by the orthonormal eigenvector matrix E and diagonal eigenvalue matrix . Jointly maximizing the likelihood with respect to E and then results in the ML estimates of E and given by [15] ( d ^ iag(E t S E ) 7) E = arg min E ^ = ^^ diag(E t S E ) , (8) where is the set of allowed orthonormal transforms. So we may compute the ML estimate by first solving the constrained optimization of (7), and then computing the eigenvalue estimates from (8). 2.2 ML estimation of eigenvectors using SMT model The ML estimate of E can be improved if the feasible set of eigenvector transforms, , can be constrained to a subset of all possible orthonormal transforms. By constraining , we effectively regularize the ML estimate by imposing a model. However, as with any model-based approach, the key is to select a feasible set, , which is as small as possible while still accurately modeling the behavior of the data. Our approach is to select to be the set of all orthonormal transforms that can be represented as an SMT of order K [11]. More specifically, a matrix E is an SMT of order K if it can be written as a product of K sparse orthornormal matrices, so that E= k=K -1 0 Ek = E0 E1 · · · EK -1 , (9) where every sparse matrix, Ek , is a Givens rotation operating on a pair of coordinate indices (ik , jk ) [12]. Every Givens rotation Ek is an orthonormal rotation in the plane of the two coordinates, ik and jk , which has the form Ek = I + (ik , jk , k ) , (10) where (ik , jk , k ) is defined as cos(k ) - 1 sin(k ) []ij = - sin(k ) 0 if i = j = ik or i = j = jk if i = ik and j = jk if i = jk and j = ik otherwise . (11) Figure 1 shows the flow diagram for the application of an SMT to a data vector y . Notice that each 2D rotation, Ek , plays a role analogous to a "butterfly" used in a traditional fast Fourier transform (FFT) [16]. However, unlike an FFT, the organization of the butterflies in an SMT is unstructured, and each butterfly can have an arbitrary rotation angle k . This more general structure allows an SMT to implement a larger set of orthonormal transformations. In fact, the SMT can be used to represent any orthonormal wavelet transform because orthonormal wavelets c,an be factorized into a p product of Givens rotations and delays [17]. More generally, when K = 2 the SMT can be used to exactly represent any p × p orthonormal transformation [15]. Using the SMT model constraint, the ML estimate of E is given by d . ^ iag(E t S E ) E = arg min k E= =K -1 0 (12) Ek Unfortunately, evaluating the constrained ML estimate of (12) requires the solution of an optimization problem with a nonconvex constraint. So evaluation of the globally optimal solutions is difficult. Therefore, our approach will be to use greedy minimization to compute a locally optimal solution to (12). The greedy minimization approach works by selecting each new butterfly E k to minimize the cost, while fixing the previous butterflies, El for l < k . 3 y0 y1 y2 y3 y4 y5 y6 y7 0 W8 y0 -1 -1 -1 -1 0 W8 2 W8 y0 y1 y2 y3 yp-4 yp-3 yp-2 yp-1 E2 E4 E6 E1 E3 E0 E5 y0 y1 y2 y3 yp-4 yp-3 yp-2 EK -1 yp-1 y1 -1 -1 W 0 8 1 W8 2 W8 3 W8 0 W8 y2 y3 -1 -1 -1 -1 0 W8 y4 y5 y6 y7 0 W8 0 W8 2 W8 -1 -1 (a) FFT (b) SMT Figure 1: (a) 8-point FFT. (b) The SMT implementation of y = E y . The SMT can be viewed as a generalization of FFT and orthonormal wavelets. This greedy optimization algorithm can be implemented with the following simple recursive procedure. We start by setting S0 = S to be the sample covariance, and initialize k = 0. Then we apply the following two steps for k = 0 to K - 1. d Et ( Ek = arg min iag k Sk Ek 13) Sk+1 = Ek t E k S k Ek . (14) The resulting values of Ek are the butterflies of the SMT. The problem remains of how to compute the solution to (13). In fact, this can be done quite easily by first determining the two coordinates, ik and jk , that are most correlated, 1 . [Sk ]2j i (ik , jk ) arg min - (15) [Sk ]ii [Sk ]j j (i,j ) Note that (ik , jk ) is the coordinate pair that would reduce the cost in (13) most among all possible pairs [15]. Once ik and jk are determined, we apply the Givens rotation Ek to minimize the cost in (13), which is given by (16) Ek = I + (ik , jk , k ) , where 1 (17) k = atan(-2[Sk ]ik jk , [Sk ]ik ik - [Sk ]jk jk ) . 2 By iterating the (13) and (14) K times, we obtain the constrained ML estimate of E given by ^ E= k=K -1 0 Ek . (18) The model order, K , can be determined by a simple cross-validation procedure. After partioning the data into three subsets, K is chosen to maximize the average likelihood of the left-out subsets given the estimated covariance using the other two subsets. Once K is determined, the proposed covariance estimator is re-computed using all the data and the estimated model order. The SMT covariance estimator obtained as above has some interesting properties. First, it is positive definite even for the limited sample size n < p. Also, it is permutation invariant, that is, the covariance estimator does not depend on the ordering of the data. Finally, the eigen decomposition E t y can be computed very efficiently by applying the K sparse rotations in sequence. 2.3 SMT Shrinkage Estimator In some cases, the accuracy of the SMT estimator can be improved by shrinking it towards the ^ sample covariance. Let RS M T represent the SMT covariance estimator. Then the SMT shrinkage estimate (SMT-S) can be obtained as ^ ^ RS M T -S = RS M T + (1 - )S , (19) where the parameter can be efficiently computed using cross validation [5]. 4 3 Experimental results The effectiveness of the SMT covariance estimation procedure depends on how well the SMT model can capture the behavior of real data vectors. Therefore in this section, we compare the effectiveness of the SMT covariance estimator to commonly used shrinkage and graphical lasso estimators. We do this comparison using hyperspectral remotely sensed data as our high dimensional data vectors. The hyperspectral data we use is available with the recently published book [13]. Figure 2(a) shows a simulated color IR view of an airborne hyperspectral data flightline over the Washington DC Mall. The sensor system measured the pixel response in 191 effective bands in the 0.4 to 2.4 µm region of the visible and infrared spectrum. The data set contains 1208 scan lines with 307 pixels in each scan line. The image was made using bands 60, 27 and 17 for the red, green and blue colors, respectively. The data set also provides ground truth pixels for five classes designated as grass, water, roof, street, and tree. In Figure 2, the ground-truth pixels of the grass class are outlined with a white rectangle. For each class, we computed the "true" covariance by using all the ground truth pixels to calculate the sample covariance. The covariance is computed by first subtracting the sample mean vector for each class, and then computing the sample covariance for the zero mean vectors. The number of pixels for the ground-truth classes of grass, water, roof, street, and tree are 1928, 1224, 3579, 416, and 388, respectively. In each case, the number of ground truth pixels was much larger than 191, so the true covariance matrices are nonsingular, and accurately represent the covariance of the hyperspectral data for that class. Figure 2(b) shows the spectrum of the grass pixels, and Fig. 2(c) shows multivariate Gaussian vectors that were generated using the measured sample covariance for the grass class. 3.1 Review of alternative estimators A popular choice of the shrinkage target is the diagonal of S [5, 8]. In this case, the shrinkage estimator is given by ^ R = diag (S ) + (1 - ) S . (20) We use an efficient algorithm implementation of the leave-one-out likelihood (LOOL) crossvalidation method to choose as suggested in [5]. An alternative estimator is the graphic lasso (glasso) recently proposed in [14] which is an L 1 regularized maximum likelihood estimate, i.e. , l ^ (21) R = arg max og(Y | R) - R-1 1 R where denotes the set of p × p positive definite matrices and the regularization parameter. We used the R code for glasso that is publically available online. We found cross-validation estimation of to be difficult, so in each case we manually selected the value of to minimize Kullback-Leibler distance to the known covariance. 3.2 Gaussian case First, we compare how different estimators perform when the data vectors are samples from an ideal multivariate Gaussian distribution. To do this, we first generated zero mean multivariate vectors with the true covariance for each of the five classes. Next we estimated the covariance using the four methods, SMT covariance estimation and the three shrinkage methods. In order to determine the effect of sample size, we also performed each experiment for a sample size of n = 80, 40, and 20. Every experiment was repeated 10 times. In order to get an aggregate accessment of the effectiveness of SMT covariance estimation, we compared the estimated covariance for each method to the true covariance using the Kullback-Leibler (KL) distance [15]. The KL distance is a measure of the error between the estimated and true distribution. Figure 3(a)(b) and (c) show plots of the KL distances as a function of sample size for the four estimators. The error bars indicate the standard deviation of the KL distance due to random variation in the sample statistics. Notice that the SMT shrinkage (SMT-S) estimator is consistently the best of the four. 5 (a) (b) (c) Figure 2: (a) Simulated color IR view of an airborne hyperspectral data over the Washington DC Mall [13]. (b) Ground-truth pixel spectrum of grass that are outlined with the white rectangles in (a). (c) Synthesized data spectrum using the Gaussian distribution. Figure 4(a) shows the estimated eigenvalues for the grass class with n = 80. Notice that the eigenvalues of the SMT and SMT-S estimators are much closer to the true values than the shrinkage or glasso methods. Notice that the SMT estimators generates good estimates for the small eigenvalues. Table 1 compares the computational complexity and CPU time for the four estimators. The CPU time is measured for the Guassian case of the grass class with n = 80. Notice that even with the cross validation, the SMT and SMT-S estimators are much faster than glasso. This is because the SMT transform is a sparse operator. In this case, the SMT uses an average of K = 495 rotations, which is equal to K/p = 495/191 = 2.59 rotations (or equivalently multiplies) per spectral sample. 3.3 Non-Gaussian case In practice, the sample vectors may not be from an ideal multivariate Gaussian distribution. In order to see the effect of the non-Gaussian statistics on the accuracy of the covariance estimate, we performed a set of experiments which used random samples from the ground truth pixels as input. Since these samples are from the actual measured data, their distribution is not precisely Gaussian. Using these samples, we computed the covariance estimates for the five classes using the four different methods with sample sizes of n = 80, 40, and 20. Plots of the KL distances for the non-Gaussian grass case2 are shown in Figure 3(d)(e) and (f); and Figure 4(b) shows the estimated eigenvalues for grass with n = 80. Note that the results are similar to those found for the ideal Guassian case. 4 Conclusion We have proposed a novel method for covariance estimation of high dimensional data. The new method is based on constrained maximum likelihood (ML) estimation in which the eigenvector transformation is constrained to be the composition of K Givens rotations. This model seems to capture the essential behavior of the data with a relatively small number of parameters. The constraint set is a K dimensional manifold in the space of orthonormal transforms, but since it is not a linear space, the resulting ML estimation optimization problem does not yield a closed form global optimum. However, we show that a recursive local optimization procedure is simple, intuitive, and yields good results. We also demonstrate that the proposed SMT covariance estimation method substantially reduces the error in the covariance estimate as compared to current state-of-the-art estimates for a standard hyperspectral data set. The MATLAB code for SMT covariance estimation is available at: https://engineering.purdue.edu/bouman/publications/pub smt.html. Acknowledgments This work was supported by the National Science Foundation under Contract CCR-0431024. 2 In fact, these are the KL distances between the estimated covariance and the sample covariance computed from the full set of training data, under the assumption of a multivariate Gaussian distribution. 6 260 240 220 200 KL distance KL distance 160 Shrinkage Estimator Glasso Estimator SMT Estimator SMT-S Estimator 140 120 KL distance 240 Shrinkage Estimator Glasso Estimator SMT Estimator SMT-S Estimator 220 200 180 160 140 120 100 80 Shrinkage Estimator Glasso Estimator SMT Estimator SMT-S Estimator 180 160 140 120 100 100 80 60 40 80 60 10 20 30 40 50 60 Sample size 70 80 90 20 10 20 30 40 50 60 Sample size 70 80 90 60 40 10 20 30 40 50 60 Sample size 70 80 90 (a) Grass 260 240 220 200 KL distance KL distance (b) Water 220 Shrinkage Estimator Glasso Estimator SMT Estimator SMT-S Estimator 200 180 160 140 120 100 80 60 40 KL distance (c) Street 240 Shrinkage Estimator Glasso Estimator SMT Estimator SMT-S Estimator 220 200 180 160 140 120 100 80 60 Shrinkage Estimator Glasso Estimator SMT Estimator SMT-S Estimator 180 160 140 120 100 80 60 10 20 30 40 50 60 Sample size 70 80 90 20 10 20 30 40 50 60 Sample size 70 80 90 40 10 20 30 40 50 60 Sample size 70 80 90 (d) Grass (e) Water (f) Street Figure 3: Kullback-Leibler distance from true distribution versus sample size for various classes: (a) (b) (c) Gaussian case (d) (e) (f) non-Gaussian case 10 8 Eigenvalue estimation True Eigenvalues Shrinkage Estimator Glasso Estimator SMT Estimator SMT-S Estimator 10 10 eigenvalues 10 10 10 10 8 10 6 6 eigenvalues 10 4 4 True Eigenvalues Shrinkage Estimator Glasso Estimator SMT Estimator SMT-S Estimator 10 2 2 10 0 0 10 -2 -2 0 50 100 index 150 200 0 50 100 index 150 200 (a) (b) Figure 4: The distribution of estimated eigenvalues for the grass class with n = 80. (a) Gaussian case (b) Non-Gaussian case. Shrinkage Est. glasso SMT SMT-S Complexity p3 p3 I Kp K p + p3 CPU time (seconds) 8.6 (with cross validation) 422.6 (without cross validation) 6.5 (with cross validation) 7.2 (with cross validation) Table 1: Comparison of computational complexity and CPU time for various covariance estimators. The complexity does not include the computation of the sample covariance that has an order of np 2 . This overhead is the same for each estimator. The CPU time is measured for the Guassian case of the grass class with n = 80. I is the number of iterations used in glasso. 7 References [1] C. Stein, B. Efron, and C. Morris, "Improving the usual estimator of a normal covariance matrix," Dept. of Statistics, Stanford University, Report 37, 1972. [2] K. Fukunaga, Introduction to Statistical Pattern Recognition. Boston, MA: Academic Press, 1990, 2nd Ed. [3] A. K. Jain, R. P. Duin, and J. Mao, "Statistical pattern recognition: A review," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4­37, 2000. [4] J. H. Friedman, "Regularized discriminant analysis," Journal of the American Statistical Association, vol. 84, no. 405, pp. 165­175, 1989. [5] J. P. Hoffbeck and D. A. Landgrebe, "Covariance matrix estimation and classification with limited training data," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 7, pp. 763­767, 1996. [6] M. J. Daniels and R. E. Kass, "Shrinkage estimators for covariance matrices," Biometrics, vol. 57, no. 4, pp. 1173­1184, 2001. [7] O. Ledoit and M. Wolf, "A well-conditioned estimator for large-dimensional covariance matrices," J. Multivar. Anal., vol. 88, no. 2, pp. 365­411, 2004. [8] J. Schafer and K. Strimmer, "A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics," Statistical Applications in Genetics and Molecular Biology, vol. 4, no. 1, 2005. [9] R. Furrer and T. Bengtsson, "Estimation of high-dimensional prior and posterior covariance matrices in kalman filter variants," J. Multivar. Anal., vol. 98, no. 2, pp. 227­255, 2007. [10] P. J. Bickel and E. Levina, "Regularized estimation of large covariance matrices," Annals of Statistics, vol. 36, no. 1, pp. 199­227, 2008. [11] G. Cao, C. A. Bouman, and K. J. Webb, "Non-iterative MAP reconstruction using sparse matrix representations," IEEE Trans. on Image Processing, submitted. [12] W. Givens, "Computation of plane unitary rotations transforming a general matrix to triangular form," Journal of the Society for Industrial and Applied Mathematics, vol. 6, no. 1, pp. 26­50, March 1958. [13] D. A. Landgrebe, Signal Theory Methods in Multispectral Remote Sensing. New York: WileyInterscience, 2005. [14] J. Friedman, T. Hastie, and R. Tibshirani, "Sparse inverse covariance estimation with the graphical lasso," Biostatistics, vol. 9, no. 3, pp. 432­441, Jul. 2008. [15] G. Cao and C. A. Bouman, "Covariance estimation for high dimensional data vectors using the sparse matrix transform," Purdue University, Technical Report ECE 08-05, 2008. [16] J. W. Cooley and J. W. Tukey, "An algorithm for the machine calculation of complex Fourier series," Mathematics of Computation, vol. 19, no. 90, pp. 297­301, April 1965. [17] A. Soman and P. Vaidyanathan, "Paraunitary filter banks and wavelet packets," Acoustics, Speech, and Signal Processing, 1992. ICASSP-92., 1992 IEEE International Conference on, vol. 4, pp. 397­400 vol.4, Mar 1992. 8