A DC Programming Approach for Sparse Eigenvalue Problem Mamadou Thiao mamadou.thiao@insa-rouen.fr Laboratory of Mathematics LMI, INSA de Rouen, 76801 Saint-Etienne-du-Rouvray Cedex, FRANCE Tao Pham Dinh pham@insa-rouen.fr Laboratory of Mathematics LMI, INSA de Rouen, 76801 Saint-Etienne-du-Rouvray Cedex, FRANCE Hoai An Le Thi lethi@univ-metz.fr Laboratory of Theoretical and Applied Computer Science, UFR MIM, Metz University, 57045 Metz, FRANCE Abstract We investigate the sparse eigenvalue problem which arises in various fields such as machine learning and statistics. Unlike standard approaches relying on approximation of the l0 norm, we work with an equivalent reformulation of the problem at hand as a DC program. Our starting point is the eigenvalue problem to which a constraint for sparsity requirement is added. The obtained problem is first formulated as a mixed integer program, and exact penalty techniques are used to equivalently transform the resulting problem into a DC program, whose solution is assumed by a customized DCA. Computational results for sparse principal component analysis are reported, which show the usefulness of our approach that compares favourably with some related standard methods using approximation of the l0 -norm. contains all the input variables (i.e. all the coefficients of the input variables in the linear combination are typically non-zero), what raises problem of interpretation and human understanding in several cases. To by-pass this problem several techniques, called sparse PCA techniques, were developed in the literature to obtain principal components with a small nonzero coefficients that explain most of the variance present in the data. The first work we are aware concerning sparse PCA is that of (Cadima & Jolliffe, 1995) who proposed a simple axis rotation and component thresholding for subset selection. Later (Jolliffe et al., 2003) proposed SCoTLASS by enforcing a sparsity constraint on the principal directions by bounding their l1 -norm, leading to a nonconvex procedure. Recent years have witnessed a flurry of research on algorithms and theory for sparse PCA, (Zou et al., 2006) proposed SPCA, a l1 -penalized regression algorithm for PCA using least regression, convex relaxed solutions leading to semidefinite programs (like DSPCA) are proposed by (d'Aspremont et al., 2007; 2008). (Moghaddam et al., 2005) proposed GSPA, a combinatorial optimization algorithm based on bidirectional greedy search, (Sriperumbudur et al., 2007) proposed DC-PCA, a DC programming algorithm (DCA) obtained by penalizing the approximation term proposed by (Weston et al., 2003). In this paper, we propose a new solution for the sparse eigenvalue problem using DC programming, which does not approximate the l0 -norm as usually, but uses it completely. Our approach sheds a new light on the use of l0 -norm and is particularly interesting for related problems for which approximating l0 -norm is not satisfactory. We first formulate the problem as a mixed integer program and by using an appropriate penalty function, we show that the problem can be reformu- 1. Introduction Eigenvalue problem is a popular problem and has many applications in science and engineering. One of the uses of eigenvalue problem is the principal component analysis (PCA) in statistics which is a powerful and popular tool for factor analysis and modeling of data. The main aim in PCA is to extract principal components corresponding to directions of maximal variance in data, each principal component being a linear combination of the input variables. Generally, in practice each principal component given by PCA Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). DC Programming Approach for Sparse Eigenvalue lated as a DC program (minimization of a DC function over a closed convex set) by exact penalty techniques in DC programming. The resulting DC program is handled by the DCA which consists of solving a sequence of quadratically constrained linear programs (QCLP) with a complexity O(n2 ) for each (QCLP). DC Programming and DCA were first introduced, in their preliminary form, by Pham Dinh Tao in 1985, and have been extensively developed since 1994 by Le Thi Hoai An and Pham Dinh Tao. It becomes now classic and increasingly popular (see e.g. (Pham Dinh & Le Thi, 1997; 1998; Le Thi et al., 1999; Le Thi & Pham Dinh, 2001; 2005) and http://lita.sciences.univmetz.fr/lethi/). DCA was applied successfully to many large scale (smooth and nonsmooth) nonconvex programs in various domains of applied sciences for which it proved to be very robust and very efficient(see e.g. (Le Thi & Pham Dinh, 2003; Weber et al., 2006; Pham Dinh et al., 2008)). For the sparse eigenvalue problem, we specialize a suitable DCA, taking into account its specific structure. Computational experiments demonstrate that DCA gives correct solution on the artificial data proposed in (Zou et al., 2006). On the well known standard pit props (Jeffers, 1967), benchmark dataset, very used to estimate the performances of the methods for the principal component analysis because of its lack of sparseness and subsequent difficulty of interpretation, DCA performs better in term of sparsity, explained variance and practical use than SPCA, DSPCA and DC-PCA. Finally, DCA gives better results than SPCA and DC-PCA while estimating its performance in high-dimension data by using the well known colon tumor data (Alon et al., 1999). The paper is organized as follows: In the following two sections we present the variational formulations for the eigenvalue and sparse eigenvalue problems respectively and state different reformulations in order to get appropriate equivalent DC programs for the original problems . We outline the DC programming and DCA in section 4, while section 5 is devoted to the description of the customized DCA for solving the related DC program. Some extensions are reported in section 6. Computational experiments on the datasets mentioned previously are reported in the last section where we analyse the performance of DCA with related standard methods using approximation of the l0 -norm. Notation. In this paper, e = (1, ..., 1)T n 1 R , and e , ..., en are the standard basis vectors of Rn . I denotes the n × n identity matrix, S n = n X Mn (R) : X = X T , S+ = {X S n : X positive semidefinite }. max (A) denotes the maximal eigenvalue of A and x 0 := card {i {1, ..., n} : xi = 0}. 2. Eigenvalue Problem The variational formulation for the eigenvalue problem is given by max xT Ax : xT x = 1 , (1) where x Rn , and A = (Aij )i,j=1,...,n a n-by-n symmetric real matrix. (1) is nonconvex, however efficient methods exist which can find a global solution in polynomial time. In the principal component analysis (PCA) setting, the goal is to extract the r leading eigenvectors of the sample covariance matrix, A0 symmetric positive semidefinite, as its eigenvectors are equivalent to the loadings of the first r principal components. Usually Hotelling deflation method (White, 1958; Saad, 1998) is used to sequentially extract these eigenvectors. On the t-th iteration, we extract the leading eigenvector of At-1 , xt arg max xT At-1 x : xT x = 1 , At = At-1 - xt xT At-1 xt xT , and the (t + 1)-st leading t t eigenvector of A0 is the leading eigenvector of At . 3. Sparse Eigenvalue Problem The variational formulation for the sparse eigenvalue problem is given by max xT Ax : xT x = 1, x 0 k , (2) where k N , 1 k < n. It is nonconvex, discontinuous, combinatorial and NP-hard. Because of the difficulty to directly handle the cardinality constraint, usually relaxed problems are considered. In SCoTLASS (Jolliffe et al., 2003) the l1 -approximation for the l0 -norm is used. Convex relaxations using semidefinite relaxation are proposed in the literature as the DSPCA of (d'Aspremont et al., 2007). There exists another methods in which the cardinality constraint is absorbed into the objective function as the SPCA of (Zou et al., 2006) and the DC-PCA of (Sriperumbudur et al., 2007). All these methods propose solutions of approximated problems which are not equivalent to the original problem. In this work we propose an exact reformulation of (2) by directly minimizing a quadratic objective function over closed convex constraints. n Without loss of generality we assume that A S+ , n A = 0 afterward, indeed if A S+ , then we choose / n µ > 0 such that µI +A S+ and we consider the problem max xT (µI + A)x : xT x = 1, x 0 k , which is equivalent to (2). By replacing the quadratic equality constraint xT x = DC Programming Approach for Sparse Eigenvalue 1 in (2) by the inequality constraint xT x 1, we obtain the following problem max xT Ax : xT x 1, x 0 k , (3) which is equivalent to (2) by the following proposition. Proposition 1 (2) and (3) are equivalent. Proof It suffices to show that xT x = 1, for any solu¯ ¯ tion x of (3). Assume by contradiction that xT x < 1. ¯ ¯ ¯ n As A S+ and A = 0, we have xT A¯ maxi Aii > 0, ¯ x ¯ thus x = 0 and x := xT is a feasible point of (3). ¯ ~ ¯ x ¯ x xT A~ = xxTA¯ > xT A¯ in contradiction with the fact ~ x ¯ x ¯ that x is a solution of (3). ¯ T x x ¯ ¯ Proof It suffices to show that q(¯) = 0, for any sou lution (¯, u) of (6). Assume by contradiction that x ¯ q(¯) > 0. We point out how to compute a feasible u point (x, u) of (6) such that ft (¯, u) > ft (x, u), in x ¯ contradiction with the fact that (¯, u) is a solution x ¯ of (6). Put J := {j {1, ...n} : 0 < uj < 1 - uj } and ¯ ¯ I := {j {1, ...n} : 0 < 1 - uj uj } and consider the ¯ ¯ following cases: Case J = . Choose j0 J and put xj := xj , uj := ¯ uj , j = j0 and xj0 = uj0 := 0. ¯ Case J = , and I = . If eT u < k, then choose ¯ i0 I and > 0 such that eT u + k and ui0 + 1 ¯ ¯ and put x := x, ui := ui , i = i0 and ui0 := ui0 + , ¯ ¯ ¯ else choose i such that Aii = maxj Ajj and put (x, u) := (ei , ei ). From formulation 1 we derive a second formulation by replacing the constraint eT u k by eT u = k and by removing eT u of the objective function. 3.0.2. Formulation 2 Consider the problem min(x,u) ft1 (x, u) := -xT Ax - tuT u s.t. xT x 1, |xj | uj , j = 1, ..., n, eT u = k, u [0, 1]n , with t > t0 (A). Proposition 3 (3) and (7) are equivalent in the same sense as in the proposition 2. Proof It suffices to show that (6) has a solution (¯, u) x ¯ that satisfies eT u = k. Let (~, u) be a solution of (6), ¯ x ~ n thus we have eT u k and u {0, 1} . To construct ~ ~ these solution (¯, u), we set x := x, uj := uj , j such x ¯ ¯ ~ ¯ ~ that uj = 1 and complete the rest of components of u ~ ¯ by 0 or 1 to obtain eT u = k. ¯ Even if the constraints are convex, the objective functions of (6) and (7) are nonconvex (concave). The particular structure of the objective functions suggest us to use the DC programming, which is one of the optimization tools to solve this type of problems, that we will introduce in the next section. A mixed-integer formulation for (3) is given by max(x,u) xT Ax s.t. xT x 1, |xj | uj , j = 1, ..., n, n eT u k, u {0, 1} . (4) In the aim of writing (4) under a continuous formulation, we define n (7) q(u) := j=1 uj (1 - uj ) = eT u - uT u. (5) Property 1 q is a finite nonnegative concave function n and u {0, 1} (q(u) 0, u [0, 1]n ). 3.0.1. Formulation 1 Let t > 0 and let us consider the penalty program min(x,u) ft (x, u) = -xT Ax + tq(u) s.t. xT x 1, (6) |xj | uj , j = 1, ..., n, eT u k, u [0, 1]n . The following proposition shows the equivalence between (3) and (6). Proposition 2 Let c := maxj=1,...,n 2 k=1,k=j |Akj | + Ajj , t > t0 (A) := 2 max {c, max (A) - maxj Ajj } , (3) and (6) are equivalent in the following sense: · if x is a solution of (3), then there exists u such ¯ ¯ that (¯, u) is a solution of (6). x ¯ · if (¯, u) is a solution of (6), then x is a solution of x ¯ ¯ (3). n 4. D.C. Programming and DCA Let 0 (Rn ) denote the convex cone of all lower semicontinuous proper convex functions on Rn . The vector space of DC functions, DC(Rn ) = 0 (Rn )-0 (Rn ), is quite large to contain many real-life objective functions and is closed under all operations usually considered in optimization. Consider the standard DC program (Pdc ) = inf {f (x) := g(x) - h(x) : x Rn } , DC Programming Approach for Sparse Eigenvalue where g, h 0 (Rn ). A DC program (Pdc ) is called polyhedral DC program when either g or h is polyhedral convex function(i.e. the pointwise supremum of a finite collection of affine functions). Note that a polyhedral convex function is almost always differentiable, say, it is differentiable everywhere except on a set of measure zero. Let C be a nonempty closed convex set. Then, the problem inf {F (x) := g(x) - h(x) : x C} , (8) A point that x verifies the generalized Kuhn-Tucker condition h(x ) g(x ) = (15) is called a critical point of g - h. It follows that if h is polyhedral convex, then a critical point of g - h is almost always a local solution to (Pdc ). The transportation of global solutions between (Pdc ) and (Ddc ) is expressed by: Property 2 y D can be transformed into an unconstrained DC program by using the indicator function of C (C (x) = 0 if x C, + otherwise), i.e., inf {f (x) := (x) - h(x) : x Rn } , where := g + C is in 0 (R ). Let g (y) := sup y T x - g(x) : x Rn n h(x ) D, (16) x P g (y ) P, (9) (10) be the conjugate function of g. The dual problem of (Pdc ) is defined by (Ddc ) = inf {h (y) - g (y) : y Rn } . (11) where P and D denote the solution sets of (Pdc ) and (Ddc ) respectively. The first inclusion becomes equality if h is subdifferentiable in P domg, and the second inclusion becomes equality if g is subdifferentiable in D domh . Under certain technical conditions, this property also holds for the local solutions of (Pdc ) and (Ddc ). For example see (Pham Dinh & Le Thi, 1997; 1998; Le Thi et al., 1999; Le Thi & Pham Dinh, 2001; 2005) for more informations. Property 3 Let x be a local solution to (Pdc ) and let y h(x ). If g is differentiable at y then y is a local solution to (Ddc ). Similarly, let y be a local solution to (Ddc ) and let x g (y ). If h is differentiable at x then x is a local solution to (Pdc ). Based on local optimality conditions and duality in DC programming, the DC Algorithm (DCA) consists in constructing of two sequences xl and y l of trial solutions of the primal and dual programs, respectively, such that the sequences g(xl ) - h(xl ) and h (y l ) - g (y l ) are decreasing, and xl (resp. y l ) converges to a primal feasible solutions x (resp. a dual ~ feasible solution y ) satisfying the local optimality con~ dition and x g (~), ~ y y h(~). ~ x (17) Under the natural convention in DC programming that is +-(+) = +, and by using the fact that every function h 0 (Rn ) is characterized as a pointwise supremum of affine functions, more precisely h(x) := sup y T x - h (y) : y Rn , (12) it can be proved that = . There is a perfect symmetry between primal and dual DC programs: the dual of (Ddc ) is (Pdc ). Recall that, for 0 (Rn ) and x0 , dom := {x Rn : (x) < +}, the subdifferential of at x0 , denoted by (x0 ), is defined as (x0 ) := y Rn : (x) (x0 ) + (x - x0 )T y : x Rn (13) which is a closed convex subset of Rn . It generalizes the derivative in the sense that is differentiable at x0 if and only if (x0 ) is reduced to a singleton which is exactly { (x0 )}. The necessary local optimality condition for the primal DC program, (Pdc ), is h(x ) g(x ). (14) The DCA then yields the next simple scheme: y l h(xl ); xl+1 g (y l ). (18) In other words, these two sequences xl and y l are determined in the way that xl+1 and y l+1 are solutions of the convex primal program (Pl ) and dual program (Dl+1 ), respectively. These are defined as (Pl ) min g(x) - h(xl ) - (x - xl )T y l : x Rn , (19) (Dl+1 ) min h (y) - g (y l ) - (y - y l )T xl+1 : y Rn . (20) The condition (14) is also sufficient for many important classes of DC programs, for example, for DC polyhedral programs, or when the function f is locally convex at x (Pham Dinh & Le Thi, 1997; 1998; Le Thi et al., 1999; Le Thi & Pham Dinh, 2001; 2005). DC Programming Approach for Sparse Eigenvalue At each iteration, the DCA performs a double linearization with use of the subgradients of h and g . In fact, at each iteration, one replaces in the primal DC program, (Pdc ), the second component h by its affine minorization hl (x) := h(xl ) + (x - xl )T y l to construct the convex program (Pl ) whose the solution set is nothing but g (y l ). Likewise, the second DC component g of the dual DC program, (Ddc ), is replaced by its affine minorization gl (y) := g (y l ) + (y - y l )T xl+1 to obtain the convex program (Dl+1 ) whose h(xl+1 ) is the solution set. One sees thus the DCA works with the convex DC components g and h but not with the DC function f itself. Moreover, a DC function f has infinitely many DC decompositions which have crucial impacts on the performance of the DCA in terms of speed of convergence, robustness, efficiency, and globality of computed solutions. Convergence properties of the DCA and its theoretical basis are described in (Pham Dinh & Le Thi, 1997; 1998; Le Thi et al., 1999; Le Thi & Pham Dinh, 2001; 2005). However, it is worthwhile to summarize the following properties for the shake of completeness : · DCA is a descent method (without line search). The sequences (g(xl )-h(xl )) and (h (y l )-g (y l )) are decreasing such that g(xl+1 )-h(xl+1 ) h (y l )-g (y l ) g(xl )-h(xl ). (21) · If g(xl+1 ) - h(xl+1 ) = g(xl ) - h(xl ), then xl is a critical point of g - h and y l is a critical point of h - g . In this case, DCA terminates at lth iteration. · If the optimal value of problem (Pdc ) is finite and the infinite sequences xl and y l are bounded, then every limit point x(resp. y ) of the ~ ~ sequence xl (resp. y l ) is a critical point of g - h (resp. h - g ). · DCA has a linear convergence for general DC programs. Especially, for polyhedral DC programs the sequences xl and y l contain finitely many elements and the algorithm convergences to a solution in a finite number of iterations. We shall apply all DC enhancement features to solve (7). Algorithm 1 DCA for sparse eigenvalue problem n Input: A S+ , 1 k < n, t > 0, (x0 , u0 ) n n R × R+ and the tolerance Output: (x, u) Initialize l := 0 repeat X l := 2Axl , U l := 2tul Compute (xl+1 , ul+1 ) solution of (25) until ft1 (xl+1 , ul+1 ) - ft1 (xl , ul ) x := xl and u := ul . is represented by the difference of two convex functions. Then the original problem becomes a DC program in which the DC function is minimized over a convex set. In this section, we introduce a DC reformulation and then present the corresponding DCA. According to the previous section a DC formulation of (7) is given by min {G(x, u) - H(x, u) : (x, u) Rn × Rn } , (22) where G(x, u) := C (x, u), H(x, u) := xT Ax + tuT u and C is the feasible set of (7). Then performing DCA for problem (22) amounts to computing the two sequences (xl , ul ) and (X l , U l ) defined by (X l , U l ) H(xl , ul ), (xl+1 , ul+1 ) G (X l , U l ). (23) In other words, we have to compute the subdifferentials H and G . (X l , U l ) H(xl , ul ) X l = 2Axl , U l = 2tul , (24) and G (X l , U l ) is the solution set of the following convex program min -(X l )T x - (U l )T u : (x, u) C , (25) of which a solution can be computed in polynomial complexity O(n2 ) by using the KKT conditions (Thiao et al., 2009). The algorithm 1 summarizes the DCA applied to (7) and the following proposition shows that for k = n, our DCA algorithm is reduced to the power iteration algorithm. Proposition 4 Let k = n. Then our DCA algorithm is reduced to the power method for eigenvalue computation. Proof As eT ul = k = n and ul [0, 1]n , we have ul = e for all l 1, thus (25) is reduced to min -(Axl )T x : xT x 1 , and by applying the KKT conditions to this problem we obtain xl+1 = Axl / Axl 2 . 5. Sparse eigenvalue by D.C. Programming and DCA We consider a new approach based on DC programming and DCA to solve (3). The DCA requires a reformulation of the problem so that the objective function DC Programming Approach for Sparse Eigenvalue 6. Extensions In this section we present exact penalty techniques for some sparse eigenvalue formulations. First, consider the formulation proposed in (El Ghaoui, 2006), in which the l0 -norm term is absorbed in the objective function max xT Ax - x 0 Table 1. Loadings and explained variance for the first two principal components of the artificial example for DCA. PC 1 2 X7 .5 0 X1 0 .5 X8 .5 0 X2 0 .5 X9 0 0 X3 0 .5 X10 0 0 X4 0 .5 X5 X6 .5 .5 0 0 Explained variance 40.9% 39.5% DCA DCA PC 1 2 : xT x = 1 , (26) with V1 , V2 and independent. Afterward, 10 observed variables are generated as follows: Xi = Vj + j j i, i where > 0. By a similar reasoning, as in the section 3, we show that (26) is equivalent to min(x,u) -xT Ax + eT u + tq(u) s.t. xT x 1, |xj | uj , (27) j = 1, ..., n, u [0, 1]n , in the sense as in proposition 2 for all n t > 2 max , max 2 |Akj | + Ajj . j=1,...,n k=1,k=j N (0, 1), with j = 1 for i = 1, ..., 4, j = 2 for i = 5, ..., 8, and j = 3 for i = 9, 10 and j independent for j = 1, 2, 3, i i = 1, ..., 10. The variance of the three underlying factors is 290, 300 and 283.8, respectively. The number of variables associated with the three factors are 4, 4 and 2. Therefore V2 and V1 are almost equally important, and they are much more important than V3 . The first two principal components together explain more than 99% of the total variance. These facts suggest that we only need to consider two derived variables with sparse representations. Ideally, the first derived variable should recover the factor V2 only using (X5 , X6 , X7 , X8 ), and the second derived variable should recover the factor V1 only using (X1 , X2 , X3 , X4 ). Quite as DSPCA, SPCA and SCoTLASS (see (d'Aspremont et al., 2007)) and by taking k = 4 for the first two principal components, our DCA algorithm finds the correct sparse principal components of the two first principal components and the results are summarized in Table 1. 7.2. Pit Props Data The pit props dataset (Jeffers, 1967) has become a standard benchmark example to test sparse PCA algorithms. The first six principal components (PCs) capture 87% of the total variance and so all these other methods compare their explanatory power using six sparses principal components. As it was shown in (Sriperumbudur et al., 2007) that DC-PCA gave a better result than DSPCA and SPCA with 13 non-zero loadings and 77.1% of the total variance for the six first sparse principal components, we are going to use it for our comparison. Table 2 shows the first 3 PCs for SPCA, DSPCA, DC-PCA and the first 6 PCs for our DCA algorithm for sparse PCA. With the same cardinality pattern (k1 , k2 , k3 , k4 , k5 , k6 ) := (6, 2, 2, 1, 1, 1) with 13 non-zero loadings, our DCA captures almost the same variance (77.05%). We observe that all of the principal components C1 , ..., C6 generated by DCA Second, even if we used the quadratic concave penalty n function q(u) = j=1 uj (1 - uj ) in our previous reformulations (6) and (27), another polyhedral concave n penalty function p(u) := j=1 min(uj , 1 - uj ) could be used to obtain similar reformulations. 7. Experiments and Results In this section, we illustrate the effectiveness of the proposed method in the context of sparse principal component analysis. We present experiments on an artificial data and different real-life datasets. In the sparse PCA setting, usually, the sparse eigenvectors of the covariance matrix A are obtained by applying algorithm on the sequence of deflated matrices with the same (or different) k depending on the sparsity requirement. Here we use an orthogonalized projection deflation technique (see (Mackey, 2008) for more details in deflation techniques for sparse PCA): T T A0 = A, vi = (I - Vi-1 Vi-1 )xi / (I - Vi-1 Vi-1 )xi , T T Ai = (I - vi vi )Ai-1 (I - vi vi ), where v1 = x1 , and v1 ,...,vi-1 form the columns of Vi-1 . Since v1 ,...,vi-1 form an orthonormal basis for the space spanned by x1 ,...,xi-1 ((x1 , u1 ),...,(xi-1 , ui-1 ) are the output of the Algorithm DCA with A0 ,...,Ai-1 ), we have Pi-1 = T Vi-1 Vi-1 , the orthogonal projection. The cummulaT tive variance is then calculated as i vi Ai vi . 7.1. Artificial Data We consider the simulation example proposed in (Zou et al., 2006), in this example three hidden factors are created: V1 N (0, 290), V2 N (0, 300), V3 = -0.3V1 + 0.925V2 + , N (0, 1), DC Programming Approach for Sparse Eigenvalue Table 2. Pit Props: Loadings for the first three principal components for SPCA, DSPCA, and DC-PCA and the six PCs for our DCA algorithm. SPCA, DSPCA, and DC-PCA loadings are taken from (Zou et al., 2006), (d'Aspremont et al., 2007) and (Sriperumbudur et al., 2007) respectively. Table 4. Colon Cancer: variation of the explained variance following k for the first principal component. k Variance % 200 7.70 400 14.35 k Variance % 600 20.22 1600 41.30 800 25.51 1800 43.76 1000 30.25 2000 44.96 1200 34.41 1400 38.11 SPCA DSPCA DC-PCA DCA PC C1 C2 C3 C1 C2 C3 C1 C2 C3 C1 C2 C3 C4 C5 C6 SPCA DSPCA DC-PCA DCA x1 x2 x3 x4 x5 x6 x7 -.477 -.476 0 0 .177 0 -.250 0 0 .785 .620 0 0 0 0 0 0 0 .640 .589 .492 -.560 -.583 0 0 0 0 -.263 0 0 .707 .707 0 0 0 0 0 0 0 0 -.793 -.610 .449 .459 0 0 0 0 .374 0 0 .707 .707 0 0 0 0 0 0 0 0 .816 .578 -.444 -.453 0 0 0 0 -.379 0 0 .707 .707 0 0 0 0 0 0 0 .694 .721 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PC x8 x9 x10 x11 x12 x13 C1 -.344 -.416 -.400 0 0 0 C2 -.021 0 0 0 .013 0 C3 0 0 0 0 0 -.015 C1 -.099 -.371 -.362 0 0 0 C2 0 0 0 0 0 0 C3 0 0 0 0 0 .012 C1 .332 .403 .419 0 0 0 C2 0 0 0 0 0 0 C3 0 0 0 0 0 0 C1 -.341 -.403 -.419 0 0 0 C2 0 0 0 0 0 0 C3 0 0 0 0 0 0 C4 0 0 0 0 0 1 C5 0 0 0 1 0 0 C6 0 0 0 0 -1 0 6000 for the cumulative cardinality, whereas our DCA algorithm explains 65.94% of cumulative variance with 5000 for cumulative cardinality with the cardinality pattern (k1 , k2 , k3 , k4 , k5 ) := (1800, 800, 800, 800, 800). In Table 4 we represent the variation of the explained variance following k for the first principal component. (a) 75 70 65 Cumulative variance (%) Cumulative cardinality 60 55 50 45 40 35 PCA DCA DC-PCA SPCA 1 2 3 4 Number of principal components 5 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 1 (b) PCA DCA DC-PCA SPCA 2 3 4 Number of principal components 5 Table 3. Pits Props: variation of the explained variance following k for the first principal component. k Variance % k Variance % 1 7.69 2 15.03 3 19.04 4 22.56 5 26.20 6 29.00 7 30.74 8 31.30 9 31.83 10 32.10 11 32.37 12 32.44 13 32.45 Figure 1. Colon Cancer (a) cumulative variance (b) cumulative cardinality for the first 5 sparse principal components. T satisfy the orthonormality property (Ci Ci = 1 and T Ci Cj = 0, i = j) which is not the case for DC-PCA. Another advantage is that in DCA the sparsity requirement is explicitly mentioned and that in DC-PCA it is difficult to set the regularization parameter to attain a given sparsity. Table 3 shows the variation of the explained variance following k for the first principal component. 8. Summary and Future Work We have proposed a sparse eigenvalue algorithm using the DC programming and DCA. Our method differs substantially from previous works and approaches in sparse PCA. A difference that begins with the exact reformulation of the problem at hand as a DC program (minimization of a DC function over a convex closed set), that is original knowing that, because of the discontinuity of the the l0 -norm, standard methods must resort to its approximation (without having equivalence between the problem and its approximated one). The simplicity of the reformulations and the better results provided by DCA on different well-known data for testing methods for sparse PCA, show that DCA is efficient and promising for sparse PCA. We have also proposed some reformulations which could allow various approaches. 7.3. Colon Cancer Data The colon cancer data (Alon et al., 1999) consist of 62 tissue samples (22 normal and 40 cancerous) with the gene expression profiles of n = 2000 genes extracted from DNA micro-array data. We consider its first 5 principal components which explain 70% of the total variance. The figure 1 shows that DCA gives much better results than SPCA and DC-PCA. DC-PCA explains only 62% of cumulative variance with more than DC Programming Approach for Sparse Eigenvalue We are extending this technique to other problems involving l0 -norm and investigating global optimization techniques. Mackey, L. Deflation methods for sparse pca. In Proceedings of the Conference Neural Information Processing Systems (NIPS 2008), 2008. Moghaddam, B., Weiss, Y., and Avidan, S. Spectral bounds for sparse pca: Exact and greedy algorithms. In Proceedings of the Conference Neural Information Processing Systems (NIPS 2005), 2005. Pham Dinh, T. and Le Thi, H.A. Convex analysis approaches to dc programming: Theory, algorithms and applications. Acta Mathematica Vietnamica, 22 (1):287­367, 1997. Pham Dinh, T. and Le Thi, H.A. D.c. optimization algorithms for solving the trust region subproblem. SIAM J. Optim., pp. 476­505, 1998. Pham Dinh, T., Le Thi, H.A., and Akoa, F. Combining dca and interior point techniques for large-scale nonconvex quadratic programming. Optimization, Methods and Softwares, 23(4):609­629, 2008. Saad, Y. Projection and deflation methods for partial pole assignment in linear state feedback. IEEE Trans. Automat. Contr., 33:290­297, 1998. Sriperumbudur, B. K., Torres, D. A., and Lanckriet, G.R.G. Sparse eigen methods by dc programming. In Proceedings of the 24th International Conference on Machine Learning (ICML 2007), pp. 831­838, 2007. Thiao, M., Pham Dinh, T., and Le Thi, H.A. Solutions of a linear program with an additional euclidean unit ball constraint by a customized polynomial algorithm. Technical report, Laboratory of Mathematics, LMI, Insa of Rouen, Saint-Etiennedu-Rouvray cedex, France, 2009. Weber, S., Nagy, A., Sch¨le, T., Schn¨rr, C., and u o Kuba, A. A benchmark evaluation of large-scale optimization approaches to binary tomography. In Proceedings of the Conference on Discrete Geometry on Computer Imagery (DGCI 2006 ), volume 4245, 2006. Weston, J., Elisseeff, A., Sch¨lkopf, B., and M., Tipo ping. Use of the zero-norm with linear models and kernel methods. Journal of Machine Learning Research, 3:1439­1461, 2003. White, P. The computation of eigenvalues and eigenvectors of a matrix. Journal of the Society for Industrial and Applied Mathematics, 6(4):393­437, 1958. Zou, H., Hastie, T., and Tibshirani, R. Sparse principal component analysis. J. Comput. Graphical Statist., 15:265­286, 2006. References Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., and Levine, A. J. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon cancer tissues. Cell Biology, 96:6745­6750, 1999. Cadima, J. and Jolliffe, I. T. Loadings and correlations in the interpretation of principal components. Applied Statistics, pp. 203­214, 1995. d'Aspremont, A., El Ghaoui, L., Jordan, M. I., and Lanckriet, G. R. G. A direct formulation for sparse pca using semidefinite programming. SIAM Review, 49(3):434­448, 2007. d'Aspremont, A., Bach, F., and Ghaoui, L. Optimal solutions for sparse principal component analysis. Journal of Machine Learning Research, 9:1269­ 1294, 2008. El Ghaoui, L. On the quality of a semidefinite programming bound for sparse principal component analysis. arXive.org, 2006. Jeffers, J. Two case studies in the application of principal components. Applied Statistics, 16:225­236, 1967. Jolliffe, I. T., Trendafilov, N. V., and Uddin, M. A modified principal component technique based on the lasso. Journal of computational and Graphical Statistics, 12:531­547, 2003. Le Thi, H. A. and Pham Dinh, T. The dc (difference of convex functions) programming and dca revisited with dc models of real world non convex optimization problems. Annals of Operations Research, 133: 23­46, 2005. Le Thi, H.A. and Pham Dinh, T. A continuous approach for globally solving linearly constrained quadratic zero-one programming problems. Optimization, 50(1-2):93­120, 2001. Le Thi, H.A. and Pham Dinh, T. Large scale molecular optimization from distance matrices by a dc optimization approach. SIAM Journal on Optimization, 4(1):77­116, 2003. Le Thi, H.A., Pham Dinh, T., and Le Dung, M. Exact penalty in dc programming. Vietnam Journal of Mathematics, 27(2):169­179, 1999.