A Least Squares Formulation for Canonical Correlation Analysis Liang Sun sun.liang@asu.edu Shuiwang Ji shuiwang.ji@asu.edu Jieping Ye jieping.ye@asu.edu Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287 USA Abstract Canonical Correlation Analysis (CCA) is a well-known technique for finding the correlations between two sets of multi-dimensional variables. It pro jects both sets of variables into a lower-dimensional space in which they are maximally correlated. CCA is commonly applied for supervised dimensionality reduction, in which one of the multi-dimensional variables is derived from the class label. It has been shown that CCA can be formulated as a least squares problem in the binaryclass case. However, their relationship in the more general setting remains unclear. In this paper, we show that, under a mild condition which tends to hold for high-dimensional data, CCA in multi-label classifications can be formulated as a least squares problem. Based on this equivalence relationship, we propose several CCA extensions including sparse CCA using 1-norm regularization. Experiments on multi-label data sets confirm the established equivalence relationship. Results also demonstrate the effectiveness of the proposed CCA extensions. derived from the data and another view is derived from the class labels. In this setting, the data can be projected into a lower-dimensional space directed by the label information. Such formulation is particularly appealing in the context of dimensionality reduction for multi-label data (Yu et al., 2006). Multivariate linear regression (MLR) that minimizes the sum-of-squares cost function is a well-studied technique for regression problems. It can also be applied for classification with an appropriate class indicator matrix (Bishop, 2006; Hastie et al., 2001). The solution to least squares problems can be obtained by solving a linear system of equations. A number of algorithms, including the conjugate gradient algorithm, can be applied to solve it efficiently (Golub & Loan, 1996). Furthermore, the least squares formulation can be readily extended using the regularization technique. For example, 1-norm regularization can be incorporated into the least squares formulation to control model complexity and improve sparseness (Tibshirani, 1996). Sparseness often leads to easy interpretation and a good generalization ability. It has been used successfully in PCA (d'Aspremont et al., 2004) and SVM (Zhu et al., 2003). In contrast to least squares, CCA involves a generalized eigenvalue problem, which is computationally more expensive to solve. Furthermore, it is challenging to derive sparse CCA, as it involves a difficult sparse generalized eigenvalue problem. Convex relaxation of sparse CCA has been studied in (Sriperumbudur et al., 2007), where the exact sparse CCA formulation has been relaxed in several steps. On the other hand, interesting connection between least squares and CCA has been established in the literature. In particular, CCA has been shown to be equivalent to Fisher Linear Discriminant Analysis (LDA) for binary-class problems (Hastie et al., 1995). Meanwhile, it is well-known that LDA is equivalent to least squares in this case (Bishop, 2006; Hastie et al., 2001). Thus CCA can be formulated as a least squares problem for binary-class 1. Intro duction Canonical Correlation Analysis (CCA) (Hotelling, 1936) is commonly used for finding the correlations between two sets of multi-dimensional variables. It makes use of two views of the same set of ob jects and pro jects them into a lower-dimensional space in which they are maximally correlated. CCA has been applied successfully in various applications (Hardoon et al., 2004; Vert & Kanehisa, 2003). One popular use of CCA is for supervised learning, in which one view is Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). A Least Squares Formulation for Canonical Correlation Analysis problems. In practice, the multi-class and multi-label problems are more prevalent. It is therefore tempting to investigate the relationship between least squares and CCA in the more general setting. In this paper, we study the relationship between CCA and least squares for multi-label problems. We show that, under a mild condition which tends to hold for high-dimensional data, CCA can be formulated as a least squares problem by constructing a specific class indictor matrix. Based on this equivalence relationship, we propose several CCA extensions including sparse CCA using the 1-norm regularization. Furthermore, the entire solution path for sparse CCA can be readily computed by the Least Angle Regression algorithm (LARS) (Efron et al., 2004). We evaluate the established theoretical results using a collection of multilabel data sets. Our experiments confirm the equivalence relationship between these two models under the given assumption. Results also show that, even when the assumption does not hold, they achieve very similar performance. Our experiments also demonstrate the effectiveness of the proposed CCA extensions. Notations The number of training samples, the data dimensionality, and the number of labels are denoted by n, d, and k , respectively. xi Rd denotes the ith observation and yi Rk encodes the corresponding label information. Let X = [x1 , · · · , xn ] Rd×n be the data matrix and Y = [y1 , · · · , yn ] Rk×n be the class label matrix. We assume that both {xi }n and n n1 {yi }n are centered, i.e., i=1 xi = 0, and i=1 yi = 0. 1 We assume that Y Y T is nonsingular. It can be shown that wx can be obtained by solving the following optimization problem: max wx T wx X Y T sub ject to T wx X X T wx = 1. Y YT -1 Y X T wx (3) Both formulations in Eqs. (2) and (3) attempt to find the eigenvectors corresponding to top eigenvalues of the following generalized eigenvalue problem: X Y T (Y Y T )-1 Y X T wx = X X T wx , (4) where is the eigenvalue corresponding to the eigenvector wx . Multiple pro jection vectors under certain orthonormality constraints can be computed simultaneously by solving the following optimization problem (Hardoon et al., 2004): max W trace(W T X Y T (Y Y T )-1 Y X T W ) W T XXT W = I, (5) sub ject to where W Rd× is the pro jection matrix, is the number of pro jection vectors, and I is the identity matrix. The solution to the optimization problem in Eq. (5) consists of the top eigenvectors of the generalized eigenvalue problem in Eq. (4). In regularized CCA (rCCA), a regularization term I with > 0 is added to X X T in Eq. (5) to prevent the overfitting and avoid the singularity of X X T (Bach & Jordan, 2003). Specifically, rCCA solves the following generalized eigenvalue problem: X Y T (Y Y T )-1 Y X T wx = (X X T + I )wx . 2.2. Least Squares for Regression and Classification In regression, we are given a training set {(xi , ti )}n 1 , i= where xi Rd is the observation and ti Rk is the corresponding target. We assume that both the observations and the targets are centered. Thus the bias term can be ignored. In this case, the pro jection matrix W can be computed by minimizing the following sum-of-squares cost function: min W 2. Background and Related Work In this section we give a brief overview of CCA and least squares as well as several other related work. 2.1. Canonical Correlation Analysis In CCA two different representations of the same set of ob jects are given, and a pro jection is computed for each representation such that they are maximally correlated in the dimensionality-reduced space. For X Rd×n and Y Rk×n , CCA computes two projection vectors, wx Rd and wy Rk , such that the following correlation coefficient: = T wx X Y T wy ( T T wx X X T wx )(wy Y Y T wy ) (6) (1) is maximized. Since is invariant to the scaling of wx and wy , CCA can be formulated equivalently as wx ,wy in W T xi - ti 2 = W T X - T 2 , 2 F (7) =1 max T wx X Y T wy (2) where T = [t1 , · · · , tn ]. It is well known that the optimal pro jection matrix is given by WLS = (X X T ) X T T , (8) sub ject to T T wx X X T wx = 1, wy Y Y T wy = 1. A Least Squares Formulation for Canonical Correlation Analysis where the pseudo-inverse is used in case X X T is singular. To improve the generality ability of the model, a penalty term based on 2-norm or 1-norm regularization is commonly applied (Hastie et al., 2001). Least squares is also commonly applied for classification. In the general multi-class case, we are given a data set consisting of n samples {(xi , yi )}n 1 , where i= xi Rd , and yi {1, 2, · · · , k } denotes the class label of the i-th sample, and k > 2. To apply the least squares formulation to the multi-class case, the 1-of-k binary coding scheme is usually employed to apply a vector-valued class code to each data point (Bishop, 2006). The solution to the least squares problem depends on the choice of class indicator matrix. Several class indicator matrices have been proposed in the literature (Hastie et al., 2001). 2.3. Related Work The inherent relationship between least squares and several other models has been established in the past. In particular, LDA for binary-class problems can be formulated as a least squares problem (Duda et al., 2000; Bishop, 2006). Moreover, this equivalence relationship can be extended to the multi-class case using a specific class indicator matrix (Ye, 2007). CCA has been shown to be equivalent to LDA for multi-class problems (Hastie et al., 1995). Thus, CCA is closely related to least squares in the multi-class case. We show in the next section that, under a mild condition, CCA can be formulated as a least squares problem for multi-label classifications when one of the views used in CCA is derived from the labels. 3.1. Basic Matrix Prop erties In this subsection, we study the basic properties of the matrices involved in the following discussion. Following the definition of H in Eq. (9), we have: Lemma 1. Let H be defined as in Eq. (9) and let n {yi }n be centered, i.e., i=1 yi = 0. Then we have 1 (1). H has orthonormal columns, i.e., H T H = Ik ; (2). H T e = 0. Given H Rn×k with orthonormal columns, there exists D Rn×(n-k) such that [H, D] Rn×n is an orthogonal matrix (Golub & Loan, 1996), that is In = [H, D][H, D]T = H H T + DDT . It follows that CDD = CX X - CH H = X DDT X T . (13) It can be verified from Eqs. (10), (11), and (13) that the matrices CX X , CH H , and CDD are all positive semidefinite. Let the Singular Value Decomposition (SVD) of X be X = U V T = [U1 , U2 ] diag(r , 0) [V1 , V2 ]T = U1 r V1T , (14) 3. Relationship b etween CCA and Least Squares In this section we investigate the relationship between CCA and least squares in the multi-label case. We first define four matrices essential for our derivation: 1 where r = rank(X ), U and V are orthogonal matrices, Rd×n , U1 Rd×r , U2 Rd×(d-r) , V1 Rn×r , V2 Rn×(n-r) , and r Rr×r . Since U2 lies in the null space X T , we have: Lemma 2. Let H , X , U2 , and D be defined as above. Then H T X T U2 = 0 and DT X T U2 = 0. 3.2. Computing CCA via Eigendecomp osition Recall that the solution to CCA consists of the top eigenvectors of the matrix CX X CH H . We next show how to compute the eigenvectors of CX X CH H . Define the matrix A Rr×k by T A = -1 U1 X H. r H CX X CH H CDD = Y T (Y Y T )- 2 Rn×k , = X X T Rd×d , = X H H T X T Rd×d , = CX X - CH H Rd×d . (9) (10) (11) (12) (15) Note that we assume n k and rank(Y ) = k for 1 multi-label problems. Thus, (Y Y T )- 2 is well-defined. It follows from the definition above that the solution to CCA can be expressed as the eigenvectors corre sponding to top eigenvalues of the matrix CX X CH H . Let the SVD of A be A = P A QT , where P Rr×r and Q Rk×k are orthogonal, and A Rr×k is diagonal. Then AAT = P A T P T . A (16) A Least Squares Formulation for Canonical Correlation Analysis The matrix CX X CH H can be diagonalized as follows: T CX X CH H = U1 -2 U1 X H H T X T r -1 = U1 r AH T X T U U T I = U r -1 A[H T X T U1 , H T X T U2 ]U T r 0 U -1 T 0T r AA r =U 0 0 U PT -1 T r 0 T 0 0 A A rP =U 0 I I 0 0 0 Proof. Define the matrix J Rd×d as follows: . -1 0 rP J =U 0 Id-r (20) It follows from the definition of CX X , CH H , and CDD in Eqs. (10)-(12) that J T CX X J J T CH H J J CDD J T = diag(Ir , 0), = diag(A T , 0) A = diag(a1 , · · · , ar , 0, · · · , 0), = J T CX X J - J T CH H J = diag(b1 , · · · , br , 0, · · · , 0), (21) where the second equality follows since U is orthogonal, the fourth equality follows since H T X T U2 = 0 as shown in Lemma 2, and the last equality follows from Eq. (16). Thus the solution to CCA, which consists of the top eigenvectors of matrix CX X CH H , is given by WC C A = U1 -1 Pl , r where P contains the first columns of P . (17) where bi = 1 - ai , for i = 1, · · · , r. Note that since J is nonsingular, we have rank(CX X ) = rank(J T CX X J ) = r. It follows from our assumption that 3.3. Equivalence of CCA and Least Squares Recall from Eq. (8) that for a given class indicator matrix T , the solution to the least squares problem is given by (X X T ) X T T . ~ We define the class indicator matrix T as follows: 1 ~ T = (Y Y T )- 2 Y = H T . rank(J T CH H J ) + rank(J T CDD J ) = r + s. (22) Since both J T CH H J and J T CDD J are diagonal, there are a total of r + s nonzero elements in J T CH H J and ^ J T CDD J . Note that f = rank(A ) = rank(A ), thus a1 · · · af > 0 = af +1 = · · · = ar . It follows from Eq. (21) that ai + bi = 1, for 1 i r, br ··· b1 0. (23) (18) In this case, the solution to the least squares problem is given by WLS = T (X X T ) X H = U1 -2 U1 X H r = U1 -1 A = U1 -1 P A QT . r r This implies that at least one of ai or bi is positive for 1 i r. To satisfy the rank equality in Eq. (22), we therefore must have 1 0 = a1 = a2 = · · · = af -s > af -s+1 · · · af > af +1 = · · · = ar = 0, = b1 = b2 = · · · = bf -s < bf -s+1 · · · bf < bf +1 = · · · = br = 1. (19) It is clear from Eqs. (17) and (19) that the difference between CCA and least squares lies in A and QT . We next show that all diagonal elements of A are one under a mild condition, that is, rank(X ) = n - 1 and rank(Y ) = k . Note that the first condition is equivalent to requiring that the original data points are linearly independent before centering, which tends to hold for high-dimensional data. Before presenting the main result summarized in Theorem 1 below, we have the following lemma: Lemma 3. Assume rank(CX X ) + s = rank(CH H ) + rank(CDD ), for some non-negative integer s. Then for the matrix ^ A = A T = diag(a1 , a2 , · · · , ar ) Rr×r , we have A 1 = · · · = af -s > af -s+1 where f = rank(A ). ··· af > af +1 = · · · = 0. This completes the proof of the lemma. Theorem 1. Assume that rank(X ) = n - 1 and rank(Y ) = k for multi-label problems. Then we have rank(CX X ) = n - 1, rank(CH H ) = k , rank(CDD ) = n - k - 1. Thus s = 0, where s is defined in Lemma 3, and 1 = a1 = · · · = ak > ak+1 = · · · = ar = 0. This implies that al l diagonal elements of A are ones. (24) (25) (26) A Least Squares Formulation for Canonical Correlation Analysis Proof. Denote eT = [1, 1, · · · , 1] R1×n , H = [h1 , · · · , hk ], and D = [hk+1 , · · · , hn ]. Note that X n is column centered, i.e., i=1 xi = 0. It follows from Lemma 1 that H T e = 0, that is, hT e = 0, for 1 i i k. (27) Since [H, D] is an orthogonal matrix, {h1 , · · · , hn } form a basis for Rn . Thus we can represent e Rn as e= in i hi , where i R. (28) The only difference between WLS and WC C A lies in the orthogonal matrix QT in WLS . In practice, we can use both WC C A and WLS to pro ject the original data onto a lower-dimensional space before classification. For any classifiers based on Euclidean distance, the orthogonal transformation QT will not affect the classification performance since the Euclidean distance is invariant of any orthogonal transformations. Some well-known algorithms with this property include the K-Nearest-Neighbor (KNN) algorithm (Duda et al., 2000) based on the Euclidean distance and the linear Support Vector Machines (SVM) (Scholkopf & Smola, 2002). In the following, the least ¨ squares CCA formulation is called "LS-CCA". =1 It follows from the orthogonality of [H, D] and Eq. (27) n that e can be expressed as e = i=k+1 i hi , and =n n i i i (X hi ). (29) 0 = Xe = X i hi =k +1 =k +1 4. Extensions of CCA Based on the equivalence relationship established in the last section, the classical CCA formulation can be extended using the regularization technique. Since not all i 's are zero, the n - k columns of X D are linearly dependent, thus rank(X D) n - k - 1. According to the property of matrix rank, we have Regularization is commonly used to control the complexity of the model and improve the generalization rank(X D) rank(X) + rank(D) - n performance. Linear regression using the 2-norm reg= (n - 1) + (n - k ) - n = n - k - 1. (30) ularization, called ridge regression (Hoerl & Kennard, 1970), minimizes the penalized sum-of-squares Thus, rank(X D) = n - k - 1 holds. ~ cost function. By using the class indicator matrix T in Eq. (18), we obtain the 2-norm regularized least For matrix X H , we have squares CCA formulation (called "LS-CCA2 ") by minrank(X ) = rank(X [H, D]) rank(XH) + rank(X D) imizing the following ob jective function: n , n - 1 rank(XH) + n - k - 1 jk i T 2 ~ rank(X H ) k. L2 (W, ) = (x wj - Tij ) + wj 2 i 2 On the other hand, since X H R Thus we have rank(X H ) = k and rank(CX X ) = d×k , rank(X H ) k. =1 =1 rank(X ) = n - 1, where W = [w1 , · · · , wk ], and > 0 is the regularization parameter. In mathematical programming, it is known that sparseness can often be achieved by penalizing the L1 -norm of the variables (Donoho, 2006; Tibshirani, 1996). It has been introduced into the least squares formulation and the resulting model is called lasso (Tibshirani, 1996). Based on the established equivalence relationship between CCA and least squares, we derive the 1-norm least squares CCA formulation (called "LS-CCA1 ") by minimizing the following objective function: n . jk i T ~ij )2 + wj 1 L1 (W, ) = (x wj - T i =1 =1 The optimal wj , for 1 j k , is given by wj = arg min( wj rank(CH H ) = rank(X H ) = k , rank(CDD ) = rank(X D) = n - k - 1. It follows that s = 0. On the other hand, T f = rank(A) = rank(-1 U1 X H ) = rank(X H ) = k . r Hence, 1 = a1 = a2 = · · · = ak > 0 = ak+1 = · · · = ar , and all diagonal elements of A are ones. This completes the proof of the theorem. Since rank(A ) = k , CX X CH H contains k nonzero eigenvalues. If we choose = k , then WC C A = U1 -1 Pk . r (31) in ~ (xT wj - Tij )2 + wj 1), i (32) =1 A Least Squares Formulation for Canonical Correlation Analysis which can be reformulated as: wj = arg min wj 1 in ~ (xT wj - Tij )2 , i (33) =1 for some tuning parameter > 0 (Tibshirani, 1996). Furthermore, the solution can be readily computed by the Least Angle Regression algorithm (Efron et al., 2004). One key feature of LARS is that it computes the entire solution path for all values of , with essentially the same computational cost as fitting the model with a single value. If the value of is large enough, the constraint in Eq. (33) is not effective, resulting in an unconstrained optimization problem. We can thus consider from a finite range [0, ], for some > 0. Define = / ^ ^ ^ so that = with 0 1. The estimation of ^ is equivalent to the estimation of . Cross-validation is commonly used to estimate the optimal value from a large candidate set S = {1 , 2 , · · · , |S | }, where |S | denotes the size of S . If the value of is sufficiently small, many of the coefficients in W will become exactly zero, which leads to a sparse CCA model. We thus call the "sparseness coefficient". to pro ject the data into a lower-dimensional space in which a linear SVM is applied for classification for each label. The Receiver Operating Characteristic (ROC) score is computed for each label and the averaged performance over all labels is reported. 5.2. Gene Expression Pattern Image Data In this experiment we first evaluate the equivalence relationship between CCA and least squares. For all cases, we set the data dimensionality d larger than the sample size n, i.e., d/n > 1. The condition in Theorem 1 holds in all cases. We observe that for all splittings of all of the five data sets, rank(CX X ) equals rank(CH H ) + rank(CDD )), and the ratio of the maximal to the minimal diagonal element of A is 1, which implies that all diagonal elements of A are the same, i.e., ones. Our experimental evidences are consistent with the theoretical results presented in Section 3.3. 5.2.1. Performance Comparison In Table 1, we report the mean ROC scores over all terms and all splittings for each data set. The main observations include: (1) CCA and LS-CCA achieve the same performance for all data sets, which is consistent with our theoretical results; (2) The regularized CCA extensions including rCCA, LS-CCA2 , and LS-CCA1 perform much better than their counterparts CCA and LS-CCA without the regularization; and (3) LS-CCA2 is comparable to rCCA in all data sets, while LS-CCA1 achieves the best performance in all cases. These further justify the use of the proposed least squares CCA formulations for multi-label classifications. Table 1. Comparison of different CCA methods in terms of mean ROC scores. ntot denotes the total number of images in the data set, and k denotes the number of terms (labels). Ten different splittings of the data into training (of size n) and test (of size ntot - n) sets are applied for each data set. For the regularized algorithms, the value of the parameter is chosen via cross-validation. The proposed sparse CCA model (LS-CCA1 ) performs the best for this data set. ntot 863 1041 1138 1222 1349 k 10 15 20 25 30 CCA 0.542 0.534 0.538 0.540 0.548 LS-CCA 0.542 0.534 0.538 0.540 0.548 rC C A 0.617 0.602 0.609 0.603 0.606 LS-CCA2 0.619 0.603 0.610 0.605 0.608 LS-CCA1 0.722 0.707 0.714 0.704 0.709 5. Exp eriments We use a collection of multi-label data sets to experimentally verify the equivalence relationship established in this paper. We also evaluate the performance of various CCA extensions. 5.1. Exp erimental Setup We use two types of data in the experiment. The gene expression pattern image data1 describe the gene expression patterns of Drosophila during development (Tomancak & et al., 2002). Each image is annotated with a variable number of textual terms (labels) from a controlled vocabulary. We apply Gabor filters to extract a 384-dimensional feature vector from each image. We use five data sets with different numbers of terms (class labels). We also use the scene data set (Boutell et al., 2004) which contains 2407 samples of 294-dimension and 6 labels. In all the experiments, ten random splittings of data into training and test sets are generated and the averaged performance is reported. In the experiment, five methods including CCA and its regularized version rCCA in Eq. (6), as well as LS-CCA and its regularization versions LS-CCA2 and LS-CCA1 are compared. These CCA methods are used All images were extracted from the FlyExpress database at http://www.flyexpress.net. 1 5.2.2. Sensitivity Study In this experiment, we investigate the performance of LS-CCA and its variants in comparison with CCA when the condition in Theorem 1 does not hold, which A Least Squares Formulation for Canonical Correlation Analysis 0.8 1 CCA LS-CCA LS-CCA1 LS-CCA2 rCCA CCA LS-CCA 0.95 0.9 0.75 LS-CCA1 LS-CCA2 Mean ROC Score Mean ROC Score rCCA 0.85 0.8 0.75 0.7 0.65 0.6 0.7 0.65 0.6 0.55 0.55 0.5 0.5 100 204 303 400 497 602 697 802 900 102 198 287 407 497 607 700 806 897 Training Size Training Size Figure 1. Comparison of all algorithms on gene data set (left) and scene data set (right) in terms of mean ROC scores. is the case in many applications. Specifically, we use a gene data set with the dimensionality fixed at d = 384, while the size of the training set varies from 100 to 900 with a step size about 100. The performance of different algorithms as the size of training set increases is presented in Figure 1 (left graph). We can observe that in general, the performance of all algorithms increases as the training size increases. When n is small, the condition in Theorem 1 holds, thus CCA and LS-CCA are equivalent, and they achieve the same performance. When n further increases, CCA and LS-CCA achieve different ROC scores, although the difference is always very small in our experiment. Similar to the last experiment, we can observe from the figure that the regularized methods perform much better than CCA and LS-CCA, and LS-CCA2 is comparable to rCCA. The sparse formulation LS-CCA1 performs the best for this data set. 5.3. Scene Data Set We conduct a similar set of experiments on the scene data. As in the gene data set, the equivalence relationship holds when the condition in Theorem 1 holds. For the performance comparison and sensitivity study, we generate a sequence of training sets with the size n ranging from 100 to 900 with a step size around 100. The results are summarized in Figure 1 (right graph). Similar to the gene data set, CCA and LSCCA achieve the same performance when n is small, and they differ slightly when n is large. We can also observe from the figure that the regularized algorithms including rCCA, and LS-CCA2 , and LS-CCA1 perform much better than CCA and LS-CCA without regularization, and LS-CCA2 performs slightly better than others in this data set. 5.4. The Entire CCA Solution Path In this experiment, we investigate the sparse CCA model, i.e., LS-CCA1 using the scene data set. Recall that the sparseness of the weight vectors wi 's depends on the sparseness coefficient between 0 and 1. Figure 2 shows the entire collection of solution paths for a subset of the coefficients from the first weight vector w1 . The x-axis denotes the sparseness coefficient , and the y -axis denotes the value of the coefficients. The vertical lines denote (a subset of ) the turning point of the path, as the solution path for each of the coefficients is piecewise linear (Efron et al., 2004). We can observe from Figure 2 that when = 1, most of the coefficients are non-zero, i.e., the model is dense. When the value of the sparseness coefficient decreases (from the right to the left side along the xaxis), more and more coefficients become exactly zero. All coefficients become zero when = 0. 6. Conclusion and Future Work In this paper we show that CCA for multi-label classifications can be formulated as a least squares problem under a mild condition, which tends to hold for high-dimensional data. Based on the equivalence relationship established in this paper, we propose several CCA extensions including sparse CCA. We have conducted experiments on a collection of multi-label data A Least Squares Formulation for Canonical Correlation Analysis 0 2 3 46 8 10 * ** * * * ** * * 14 18 20 learning. New York: Springer. Boutell, M. R., Luo, J., Shen, X., & Brown, C. M. (2004). Learning multi-label scene classification. Pattern Recognition, 37, 1757­1771. d'Aspremont, A., Ghaoui, L., Jordan, M., & Lanckriet, G. (2004). A direct formulation for sparse PCA using semidefinite programming. NIPS (pp. 41­48). 49 36 40 0.15 * ** * * * * * * * ** ** ** ** ** ** * ** * Coefficients 0.10 0.05 * * *** * * ** ** *** * ** * * ** ** * ** ** ** * ** ** ** ** ** ** ** * ** ** ** ** * * * * ** * * * ** * * ** * * * * * ** * * ** * * * * ** * -0.05 * * * * 13 * ** * ** * 1 * * * * * * * * * * * * * * * * * * * 197 277 * * * * * * ** ** ** ** ** Donoho, D. (2006). For most large underdetermined systems of linear equations, the minimal 11-norm nearsolution approximates the sparsest near-solution. Communications on Pure and Applied Mathematics, 59, 907­ 934. Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification. New York: John Wiley and Sons, Inc. Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32, 407. Golub, G. H., & Loan, C. F. V. (1996). Matrix computations. Baltimore, MD: Johns Hopkins Press. Hardoon, D. R., Szedmak, S. R., & Shawe-taylor, J. R. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural Comput., 16, 2639­2664. Hastie, T., Buja, A., & Tibshirani, R. (1995). Penalized discriminant analysis. Annals of Statistics, 23, 73­102. Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55­67. Hotelling, H. (1936). Relations between two sets of variables. Biometrika, 28, 312­377. Sch¨lkopf, B., & Smola, A. J. (2002). Learning with kero nels: support vector machines, regularization, optimization, and beyond. Cambridge, MA: MIT Press. Sriperumbudur, B. K., Torres, D. A., & Lanckriet, G. R. G. (2007). Sparse eigen methods by D.C. programming. ICML (pp. 831­838). Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B, 58, 267­288. Tomancak, P., & et al. (2002). Systematic determination of patterns of gene expression during Drosophila embryogenesis. Genome Biology, 3. Vert, J.-P., & Kanehisa, M. (2003). Graph-driven feature extraction from microarray data using diffusion kernels and kernel cca. NIPS (pp. 1425­1432). Ye, J. (2007). Least squares linear discriminant analysis. ICML (pp. 1087­1094). Yu, S., Yu, K., Tresp, V., & Kriegel, H.-P. (2006). Multioutput regularized feature pro jection. IEEE Trans. Know l. Data Eng., 18, 1600­1613. Zhu, J., Rosset, S., Hastie, T., & Tibshirani, R. (2003). 1-norm support vector machines. NIPS (pp. 49­56). 0.00 ** * -0.10 * ** ** ** * 218 0.0 0.2 0.4 0.6 0.8 1.0 Sparseness Coefficient Figure 2. The entire collection of solution paths for a subset of the coefficients from the first weight vector w1 on the scene data set. The x-axis denotes the sparseness coefficient , and the y -axis denotes the value of the coefficients. sets to validate the proposed equivalence relationship. Our experimental results show that the performance of the proposed least squares formulation and CCA is very close even when the condition does not hold. Results also demonstrate the effectiveness of the proposed CCA extensions. The proposed least squares formulation facilitates the incorporation of the unlabeled data into the CCA framework through the graph Laplacian, which captures the local geometry of the data (Belkin et al., 2006). We plan to examine the effectiveness of this semi-supervised CCA model for learning from both labeled and unlabeled data. The proposed sparse CCA performs well for the gene data set. We plan to analyze the biological relevance of the features extracted via the sparse CCA model. Acknowledgments This research is sponsored in part by funds from the Arizona State University and the National Science Foundation under Grant No. IIS-0612069. References Bach, F. R., & Jordan, M. I. (2003). Kernel independent component analysis. J. Mach. Learn. Res., 3, 1­48. Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res., 7, 2399­2434. Bishop, C. M. (2006). Pattern recognition and machine