Expectation-Maximization for Sparse and Non-Negative PCA

Christian D. Sigg Joachim M. Buhmann Institute of Computational Science, ETH Zurich, 8092 Zurich, Switzerland

chrsigg@inf.ethz.ch jbuhmann@inf.ethz.ch

Abstract
We study the problem of finding the dominant eigenvector of the sample covariance matrix, under additional constraints on the vector: a cardinality constraint limits the number of non-zero elements, and nonnegativity forces the elements to have equal sign. This problem is known as sparse and non-negative principal component analysis (PCA), and has many applications including dimensionality reduction and feature selection. Based on expectation-maximization for probabilistic PCA, we present an algorithm for any combination of these constraints. Its complexity is at most quadratic in the number of dimensions of the data. We demonstrate significant improvements in performance and computational efficiency compared to other constrained PCA algorithms, on large data sets from biology and computer vision. Finally, we show the usefulness of non-negative sparse PCA for unsupervised feature selection in a gene clustering task.

pro jected data, while the second PC again maximizes the variance, under the constraint that it is orthogonal to the first, and so on. Constrained PCA and its Applications. We consider problem (1) under two additional constraints on w: Sparsity w 0  K 1 and non-negativity w 0. Constraining PCA permits a trade-off between maximizing statistical fidelity on the one hand, and facilitating interpretability and applicability on the other (d'Aspremont et al., 2007). Although it is often the case that PCA provides a good approximation with few PCs, each component is usually a linear combination of all original features. Enforcing sparsity facilitates identification of the relevant influence factors and is therefore an unsupervised feature selection method. In applications where a fixed penalty is associated with each included dimension (e.g. transaction costs in finance), a small loss in variance for a large reduction in cardinality can lead to an overall better solution. Enforcing non-negativity renders PCA applicable to domains where only positive influence of features is deemed appropriate (e.g. due to the underlying physical process). Moreover, the total variance is explained additively by each component, instead of the mixed sign structure of unconstrained PCA. Often non-negative solutions already show some degree of sparsity, but a combination of both constraints enables precise control of the cardinality. Sparse PCA has been successfully applied to gene ranking (d'Aspremont et al., 2007), and nonnegative sparse PCA has been compared favorably to non-negative matrix factorization for image parts extraction (Zass & Shashua, 2006). Related Work. Problem (1) is a concave programming problem, and is NP-hard if either sparsity or nonnegativity is enforced (Horst et al., 2000). Although an efficient global optimizer is therefore unlikely, local optimizers often find good or even optimal solutions in practice, and global optimality can be tested
1 See final paragraph of this section for a definition of our notation.

1. Intro duction
Principal component analysis (PCA) provides a lower dimensional approximation of high dimensional data, where the reconstruction error (measured by Euclidean distance) is minimal. The first principal component (PC) is the solution to arg max w
w C

w, sub ject to w

2

= 1,

(1)

where C  RD×D is the positive semi-definite covariance matrix of the data. It is straightforward to show that the first PC is the dominant eigenvector of C, i.e. the eigenvector corresponding to the largest eigenvalue. The first PC maximizes the variance of the
Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).


Expectation-Maximization for Sparse and Non-Negative PCA

in O(D3 ) (d'Aspremont et al., 2007), where D is the dimensionality of the data. As is evident from writing the ob jective function of (1) as w
C

w=

iD jD
=1 =1

Cij wi wj ,

(2)

versus a sequential approach that computes one component after another. Orthonormality of the components is enforced by a penalty in the ob jective function (see section 4 for a discussion about orthogonality for non-negative components), and the desired sparsity is again expressed in terms of the whole set of L components. Our Contribution. To our knowledge, there is no algorithm either for sparse or non-negative sparse PCA that achieves competitive results in less than O(D3 ). In this paper, we propose an O(D2 ) algorithm that enforces sparsity, or non-negativity or both constraints simultaneously in the same framework, which is rooted in expectation-maximization for a probabilistic generative model of PCA (see next section). As for the combinatorial algorithms, the desired cardinality can be expressed directly as K = |S |, instead of a bound B on the l1 norm of w (which requires searching for the appropriate value). Although computing the full regularization path is also of order O(D3 ), our method directly computes a solution for any K in O(D2 ), in contrast to forward greedy search which needs to build up a solution incrementally. As is the case with SPCA, our method works on the data matrix X  RN ×D (N is the number of samples), instead of the covariance matrix C. To summarize, the low complexity combined with an efficient treatment of the D N case enables an application of our method to large data sets of high dimensionality. Notation. Vectors are indexed as w(t) , and elements i of vectors as wi . w 1 = |wi | and w 0 = |S |, where S = {i|wi = 0}. w 0 is also called the cardinality of w. I is the identity matrix, 0 a vector of zero elements, and w 0  i : wi  0. x  y denotes element-wise multiplication of x and y, and i tr(X) = Xii is the trace of matrix X. E[.] is the expectation operator, and N denotes a Gaussian distribution.

setting wk to zero excludes the k -th column and row of C from the summation. For a given sparsity pattern S = {i|wi = 0}, the optimal solution is the dominant eigenvector of the corresponding submatrix of C. For sparse PCA, the computationally hard part is therefore to identify the optimal sparsity pattern, and any solution can potentially be improved by keeping S only and recomputing the weights, a process called variational renormalization by Moghaddam et al. (2006). Sparse PCA methods can be characterized by the following two paradigms: 1. Relaxation of the hard cardinality constraint w 0  K into a convex constraint w 1  B , thus approximating the combinatorial problem by continuous optimization of (1) on a convex feasible region. 2. Direct combinatorial optimization of S . Due to the potentially exponential runtime of exact methods, heuristics such as greedy search have to be employed for large values of D. Cadima and Jolliffe (1995) proposed thresholding the (D - K ) smallest elements of the dominant eigenvector to zero, which has complexity O(D2 ). Better results have been achieved by the SPCA algorithm of Zou et al. (2004), which is based on iterative elastic net regression. Combinatorial optimization was introduced by Moghaddam et al. (2006), who derived an exact branch-and-bound method and a greedy algorithm, that computes the full sparsity path 1  K  D in O(D4 ). Based on a semi-definite relaxation of the sparse PCA problem, d'Aspremont et al. (2007) proposed PathSPCA, which reduces the complexity of each greedy step to O(D2 ), and renders computation of the full regularization path possible in O(D3 ). Finally, Sriperumbudur et al. (2007) formulate sparse PCA as a d.c. program (Horst et al., 2000) and provide an iterative algorithm called DC-PCA, where each iteration consists of solving a quadratically constrained QP with complexity O(D3 ). Non-negative (sparse) PCA was proposed by Zass and Shashua (2006). In contrast to the methods discussed so far, their algorithm (called NSPCA) optimizes the cumulative variance of L components jointly,

2. EM for Probabilistic PCA
Tipping and Bishop (1999) and independently Roweis (1998) proposed a generative model for PCA, where the full covariance matrix   RD×D of the Gaussian distribution is approximated by its first L eigenvectors (in terms of magnitude of the respective eigenvalues). The latent variable y  RL (in the principal component subspace) is distributed according to a zero mean, unit covariance Gaussian p(y) = N (0, I). (3)

The observation x  RD , conditioned on the value of the latent variable y, is linear-Gaussian distributed


Expectation-Maximization for Sparse and Non-Negative PCA

according to p(x|y) = N (Wy + µ,  I),
2

(4)

where the matrix W  RD×L spans the principal subspace, and µ  RD is the mean of the data. To simplify the presentation, we will assume centered data from now on. The EM equations for probabilistic PCA have the following form. The E-step keeps track of E[y(n) ] = E[y(n) y(n) ] =
1 M-) W(t) x(n) (t 1 2 t M-) + E[y(n) ]E[y(n) ] (t ,

(5) (6)

Algorithm 1 Iterative Computation of First PC Input: Data X  RN ×D , initial estimate w(1) ,  Algorithm: t1 rep eat y = Xw(t) N w(t+1) = arg minw n=1 x(n) - yn w 2 2 w(t+1)  w(t+1) / w(t+1) 2 tt+1 until |w(t+1) w(t) | > 1 -  Output: w

where M  RL×L is defined as M=W
W

3. Constrained PCA
(7) Consider the minimization step in algorithm 1, which can be written as w = arg min J (w) := hw
w w

+  2 I.

The M-step equations are N n W(t+1) = x(n) E[y(n) ]
=1 2 t+1

-1 n
N =1

- 2f

w

,

(12)

E[y(n) y(n) ]

(8)

= +

N x 1n W (n) 2 - 2E[y(n) ] 2 (t+1) x(n) N D =1 E . tr [y(n) y(n) ]W(t+1) W(t+1) (9)

N N 2 with h = n=1 yn and f = n=1 yn x(n) . Eq. (12) is a quadratic program (QP), and is convex due to the non-negativity of h. Furthermore, because the Hessian is a scaled identity matrix, the problem is also isotropic. The unique global optimum is found by analytical differentiation of the ob jective function J = 0  w =
!

In order to efficiently incorporate constraints into the EM algorithm (see next section), we make three simplifications: take the limit  2  0, consider a onedimensional subspace and normalize w(t) 2 to unity. The first simplification reduces probabilistic PCA to standard PCA. Computing several components will be treated in section 4, and the unity constraint on w(t) 2 is easily restored after each EM iteration. The E-step now amounts to E[yn ] = w(t) x(n) , and the M-step is N w(t+1) =
n=1 x(n) E[yn ] . N 2 n=1 E[yn ]

f , h

(13)

which of course is identical to eq. (11). 3.1. Sparsity It is well known (Tibshirani, 1996) that solving a QP under an additional constraint on w 1 favors a sparse solution. This constraint corresponds to restricting the feasible region to an l1 diamond: h ( w = arg min w w - 2f w 14)
w

(10)

(11)

s.t.

w

1

 B,

These two equations have the following interpretation (Roweis, 1998): The E-step orthogonally pro jects the data onto the current estimate of the subspace, while the M-step re-estimates the pro jection to minimize squared reconstruction error for fixed subspace coordinates. We summarize this result in algorithm 1, which iteratively computes the solution to eq. (1). Due to the fact that so far only w 2 = 1 is enforced, convergence to the global optimum doesn't depend on the initial estimate w(1) . This will no longer be the case for additional constraints.

where the upper bound B is chosen such that w has the desired cardinality. The l1 constrained QP is again convex, and because the ob jective function is isotropic, it implies that w is the feasible point minimizing l2 distance to the unconstrained optimum w . We derive an efficient and optimal algorithm for eq. (14), where the desired cardinality can be specified directly by the number K of non-zero dimensions. Observe that w must have the same sign structure as f , therefore we can transform the problem such that both w and w come to lie in the non-negative orthant. The algorithm (illustrated in fig. 1) approaches w


Expectation-Maximization for Sparse and Non-Negative PCA

with axis-aligned steps in the direction of the largest element of the negative gradient - J(w)  w - w, (15)

w2

until the boundary of the feasible region is hit or the gradient vanishes. Because the elements of w become positive one after another, and their magnitude increases monotonically, B is set implicitly by terminating the gradient descent once the cardinality of the solution vector is K . Finally, the solution is transformed back into the original orthant of w . Prop osition 3.1 Axis-aligned gradient descent with infinitesimal stepsize terminates at the optimal feasible point w . Proof. Optimality is trivial if w lies within the feasible region, so we consider the case where the l1 constraint is active. The ob jective function in eq. (14) is equivalent to w - w 2 = 2 dD
=1  (wd - wd ) . 2

w* w° B w1

Figure 1. Starting at the origin, w is approached by axisaligned steps in the direction of the largest element of the negative gradient. As dimensions enter the solution vector one after another, and the corresponding weights wi increase monotonically, the bound B is set implicitly by terminating once w 0 = K .

(16)

The gradient descent procedure invests all available coefficient weight B into decreasing the largest term(s) of this sum, which follows from eq. (15). We show equivalence of w to the gradient descent solution v by contradiction. Suppose the computation of w follows a different strategy, so at least one summation    term (wl - wl )2 is larger than maxd (wd - vd )2 . How ever, subtracting a small amount from ws (s = l) and   adding it to wl doesn't change w 1 but decreases I the ob jective, which is a contradiction. mplementation of axis-aligned gradient descent amounts to sorting the elements of - J(w) in descending order (an O(D log D) operation), and iterating over its first K elements. At each iteration k  {1, . . . , K }, the first k elements of w are manipulated, resulting in complexity O(K 2 ) for the whole loop. Algorithm 2 provides a full specification of the method. Because EM is a local optimizer, the initial direction w(1) must be chosen carefully to achieve good results. For sparse PCA, initialization with the unconstrained first principal component gave best results (see section 5). Initialization is therefore the most expensive operation of the algorithm with its O(D2 ) complexity. For the D N case, it can be reduced to O(N 2 ) by working with XX instead of X X. As initialization is independent of K , w(1) can be cached and re-used when varying the sparsity parameter. The number of EM iterations t until convergence also depends on D and K , but our experiments (see section 5)

suggest that dependence is weak and sub-linear. On average, t < 10 iterations were sufficient to achieve convergence. 3.2. Non-Negativity Enforcing non-negativity is achieved in the same way as sparsity. Here, the the feasible region is constrained to the non-negative orthant, which is again a convex domain: h ( 17) w = arg min w w - 2f w
w

s.t.

w

0.

Eq. (17) implies that choosing wi = 0 for fi < 0 is optimal. The non-negativity constraint can then be dropped, and optimization for the other elements of w proceeds as before. The first PC is invariant to a change of sign. However, this symmetry is broken if the non-negativity constraint is enforced. As an extreme example, nonnegative EM fails if the initial pro jection w(1) is a dominant eigenvector that only consists of non-positive elements - the minimum of eq. (17) is the zero vector. But changing the sign of w(1) implies that the nonnegativity constraint becomes inactive, and the algorithm terminates immediately with the optimal solution. We choose to initialize EM for non-negative PCA with a random unit vector in the non-negative orthant, which exploits the benefit of random restarts. For non-negative sparse PCA, the feasible region is defined as the intersection of the non-negative orthant


Expectation-Maximization for Sparse and Non-Negative PCA

Algorithm 2 EM for Sparse PCA Input: X  RN ×D , K  {1, . . . , D},  Algorithm: t1 w(t)  first principal component of X rep eat y  Xw(t) N N 2 w  n=1 yn x(n) / n=1 yn  s  elements |wi | sorted in descending order   indices of sorting order w(t+1)  0 for k = 1 to K do Add (sk - sk+1 ) to elements 1, . . . , k of w(t+1) end for Permute elements of w(t+1) according to  -1 w(t+1)  w(t+1)  sign(w )/ w(t+1) 2 tt+1 until |w(t+1) w(t) | > 1 -  Output: w

each iteration, this inaccuracy is not a serious problem as long as the desired number of components L is small compared to r (which is true in many applications of PCA). Desiring non-negativity and orthogonality implies that each feature can be part of at most one component: wi > 0  wi
(l ) (m)

=0

(19)

for m = l, i.e. the sparsity patterns have to be disS (l) joint: Sl m = , for l = m and Sl = {i|wi > 0}. This constraint might be too strong for some applications, where it can be relaxed to require a minimum angle between components. This quasi -orthogonality is enforced by adding a quadratic penalty term w
V

V

w

, c

(20)

and the l1 diamond. As the intersection of two convex sets is again convex, the combined constraints can be treated in the same framework. We establish convergence of our method in the following proposition: Prop osition 3.2 EM for sparse and non-negative PCA converges to a local minimum of the l2 reconstruction error. Proof. Given a feasible w(t) (either by proper initialization or after one EM iteration), both the E-step and the M-step never increase l2 reconstruction error. Orthogonal pro jection y = Xw in the E-step is the l2 optimal choice of subspace coordinates for given w. Error minimization w.r.t. w in the M-step either recovers w(t) as it is feasible, or provides an improved 4 (t+1) . w

ontains to eq. (17), where V = (1) w(2) · · · w(l-1) previously identified components as columns, and  is a tuning parameter. Because VV is also positive semi-definite, the QP remains convex, but the Hessian is no longer isotropic. We have used the standard Matlab QP solver, but there exist special algorithms for this case in the literature (Sha et al., 2007).

w

5. Exp erimental Results
We report performance and efficiency of our method in comparison to three algorithms: SPCA2 and PathSPCA3 for cardinality constrained PCA, and NSPCA4 for non-negative sparse PCA. SPCA was chosen because it has conceptual similarities to our algorithm: both are iterative methods that solve an l1 constrained convex program, and both use the data matrix instead of the covariance matrix. PathSPCA was chosen because it is (to our knowledge) the most efficient combinatorial algorithm. We are not aware of any other non-negative PCA algorithm besides NSPCA. The data sets considered in the evaluation are the following: 1. CBCL face images (Sung, 1996): 2429 gray scale images of size 19 × 19 pixels, which have been used in the evaluation of (Zass & Shashua, 2006). 2. Leukemia data (Armstrong et al., 2002): Expression profiles of 12582 genes from 72 patients. Sim2 We use the Matlab implementation of SPCA by Karl Sjostrand, available at http://www2.imm.dtu.dk/~kas/ ¨ software/spca/index.html. 3 Available from the authors at http://www.princeton. edu/~aspremon/PathSPCA.htm. 4 Available from the authors at http://www.cs.huji. ac.il/~zass/.

. Several Comp onents
A full eigen decomposition of the covariance matrix C provides all r PCs, where r is the rank of C. Sorted in descending order of eigenvalue magnitude, each eigenvector maximizes the variance of the pro jected data, under the constraint that it is orthogonal to all other components considered so far. For sparse PCA, we compute more than one component by means of iterative deflation: having identified the first component w(1) , pro ject the data to its orthogonal subspace using P = I - w(1) w(1) , (18)

re-run EM to identify w(2) , and so on. Although deflation suffers from numerical errors that accumulate over


Expectation-Maximization for Sparse and Non-Negative PCA
120 emPCA emPCAopt SPCA SPCAopt PathSPCA Variance 140 120 100 80 60 40 20 0 emPCA SPCA PathSPCA Thresholding

10

2

100

10 CPU Time [s]

1

80 Variance

60

10

0

40

10

-1

20

emPCA SPCA PathSPCA
0 50 100 150

0

0

50

Cardinality

100

150

10

-2

Cardinality

0

50

100

150

200 250 Cardinality

300

350

400

Figure 2. Left: Variance versus cardinality trade-off curves for the face image data. "opt" subscripts denote variance after recomputing optimal weights for a given sparsity pattern (which is not necessary for PathSPCA). Middle: Variance versus cardinality trade-off curves for the gene expression data. Performance of simple thresholding was included for reference. Right: Running times of Matlab implementations on the gene expression data, which include renormalization for SPCA and emPCA.

ilar data sets have been used in the evaluation of (Zou et al., 2004) and (d'Aspremont et al., 2007). The two data sets cover the N > D and D N case and are large enough such that differences in computational complexity can be established with confidence. Both were standardized such that each dimension has zero mean and unit variance. 5.1. Sparse PCA Figure 2 (left) plots explained variance versus cardinality for SPCA, PathSPCA and our algorithm (called emPCA) on the face image data set. Variational renormalization is necessary for SPCA and emPCA to close the performance gap to PathSPCA, which computes optimal weights for a specific sparsity pattern by construction. Figure 2 (middle) shows analogous results for the gene expression data. As a reference, we have also plotted results for simple thresholding (after renormalization). To complement theoretical analysis of computational complexity, we have also measured running times of reference Matlab implementations, provided by their respective authors (SPCA is a re-implementation of the author's R code in Matlab). CPU time was measured using Matlab's tic and toc timer constructs, running on an Intel Core 2 Duo processor at 2.2GHz with 3GB of RAM. Our focus is not to report absolute numbers, but rather demonstrate the dependency on the choice of K . Figure 2 (right) plots the running times versus cardinality on the gene expression data. The PathSPCA curve is well explained by the incremental forward greedy search. SPCA is harder to analyze, due to its active set optimization scheme: at each iteration of the algorithm, active features are reexamined and possibly excluded, but might be added

again later on. emPCA is only marginally affected by the choice of K , but shows an increased number of EM iterations for 10  K  25, which was observed on other data sets as well. 5.2. Non-Negative PCA The impact of the non-negativity constraint on the explained variance depends on the sign structure of w . Because the first principal component for the face image data happens to lie in the non-negative orthant, we pro jected the data onto its orthogonal subspace such that the constraint becomes active. Figure 3 (left) shows the variance versus cardinality trade-off curves for non-negative sparse PCA. For NSPCA, the sparsity penalty  was determined for each K using bisection search, which was aborted when the relative length of the parameter search interval was below a threshold of 10-5 . Both the variance achieved and the number of cardinalities for which a solution was found strongly depend on the value of , which corresponds to a unit norm penalty (for the case of a single component). For smaller values of  the performance of NSPCA is comparable to emPCA, but only solutions close to the full cardinality are found. Increasing the magnitude of  makes it possible to sweep the whole cardinality path, but the performance degrades. Because both algorithms are initialized randomly, we chose the best result after ten restarts. Running times for both methods showed no strong dependency on K . Average times for K  {1, . . . , 100} were 0.4s for emPCA (0.15s standard deviation) and 24s for NSPCA (14.7s standard deviation). We already motivated in section 4 that requiring orthogonality between several non-negative components can be restrictive. If the first PC happens to lie in the


Expectation-Maximization for Sparse and Non-Negative PCA

non-negative orthant, the constraints have to be modified such that more than one component can satisfy them. We have explored the following two strategies: 1. Enforcing orthogonality, but constraining the cardinality of each component. 2. Relaxing the orthogonality constraint, by enforcing a minimum angle between components instead. There is a methodological difficulty in comparing the performance of NSPCA and emPCA. The former maximizes cumulative variance of all components jointly, while our algorithm computes them sequentially, maximizing the variance under the constraint that subsequent components are orthogonal to previous ones (see section 4). We therefore expect emPCA to capture more variance in the first components, while NSPCA is expected to capture larger cumulative variance. Figure 3 (middle) shows the results of applying the first strategy to the face image data. The NSPCA sparsity penalty  was tuned to achieve a joint cardinality of 200 for all components. For emPCA we distributed the active features evenly among components by setting K = 20 for all of them. As in figure 3 (left), emPCA captures significantly more of the variance, suggesting that the way NSPCA incorporates sparsity seriously degrades performance. This observation was confirmed for various values of K and L. Finally, figure 3 (right) reports results for the second strategy, where a minimum angle of 85 degrees was enforced between components. Here, the complementary ob jectives of NSPCA and emPCA match with our prior expectations. Again, various values for L and minimum angle lead to essentially the same behavior. 5.3. Unsup ervised Gene Selection We apply emPCA to select a subset of genes of the leukemia data, and measure subset relevance by following the evaluation methodology of Varshavsky et al. (2006). For each gene subset, we cluster the data using k -means (k = 3), and compare the cluster assignments to the true labeling of the data, which differentiates between three types of leukemia (ALL, AML and MLL). Agreement is measured using Jaccard scores (Varshavsky et al., 2006), where a value of one signifies perfect correspondence between cluster assignment and label. We compare emPCA to simple ranking of the CE criterion as proposed by the authors, which has shown competitive performance to other popular gene selection methods. Figure 4 shows that selecting 70 genes according to the first non-negative sparse PC

0.9 emPCA emPCA (nn) CE + SR

0.8

0.7 Jaccard Score

0.6

0.5

0.4

0.3

0.2

50

100 150 Number of Features

200

250

Figure 4. Mean and standard deviation for Jaccard scores after subset selection and k-means clustering (k = 3), averaged over 100 random initializations of the centroids (see text). A small amount of jitter has been added to better distinguish error bars.

results in a significantly better Jaccard score than a clustering of the full data set.

6. Conclusions
We have presented a novel algorithm for constrained principal component analysis, based on expectationmaximization for probabilistic PCA. Our method is applicable to a broad range of problems: it includes sparsity, non-negativity or both kinds of constraints, it has an efficient formulation for N > D and D N type of data, and it enforces either strict or quasiorthogonality between successive components. Desired sparsity is directly specified in the number of nonzero elements, instead of a bound on the l1 norm of the vector. We have demonstrated on popular data sets from biology and computer vision that our method achieves competitive results for sparse problems, and that it shows significant improvements for non-negative sparse problems. Its unmatched computational efficiency enables a constrained principal component analysis of substantially larger data sets and lower requirements on available computation time. Although our algorithm is rooted in expectationmaximization for a generative model of PCA, constraints are added at the optimization stage. In the future, we will study how to include them in the model itself, which would enable a Bayesian analysis and datadriven determination of the proper sparsity and number of components. Secondly, we intend to examine whether our algorithm can be extended to the related


Expectation-Maximization for Sparse and Non-Negative PCA
20 18 16 Cumulative Variance 14 12 Variance 10 8 6 4 2 0 20 40 60 80 100 120 Cardinality emPCA (nn) NSPCA ( = 1e5) NSPCA ( = 1e6) NSPCA ( = 1e7) 140 160 180 200 50 Cumulative Variance 40 30 20 10 0 70 60 emPCA emPCA (nn) NSPCA ( = 1e8) 90 80 70 60 50 40 30 20 10 emPCA (nn) NSPCA

1

2

3

4 5 6 7 Number of Components

8

9

10

1

2

3

4 5 6 7 Number of Components

8

9

10

Figure 3. Left: Variance versus cardinality trade-off curves for non-negative sparse PCA methods on face image data. For NSPCA, the sparsity penalty  was determined using bisection search (see text). Values indicate better result after ten random restarts. Middle: Cumulative variance versus number of orthogonal components. For NSPCA,  was tuned to achieve a joint cardinality of 200 for all components. For emPCA, we set K = 20 for every component. emPCA (without non-negativity constraints) is plotted for reference. Right: Cumulative variance versus number of quasi-orthogonal components. A minimum angle of 85 degrees was enforced between components.

problem of constrained linear discriminant analysis. A Matlab implementation of emPCA is available at http://www.inf.ethz.ch/personal/chrsigg/ icml2008.

Roweis, S. (1998). EM algorithms for PCA and sensible PCA. Advances in Neural Information Processing Systems. Sha, F., Lin, Y., Saul, L., & Lee, D. (2007). Multiplicative Updates for Nonnegative Quadratic Programming. Neural Computation, 19, 2004­2031. Sriperumbudur, B., Torres, D., & Lanckriet, G. (2007). Sparse eigen methods by d.c. programming. Proceedings of the International Conference on Machine Learning. Sung, K.-K. (1996). Learning and example selection for object and pattern recognition. Doctoral dissertation, MIT, Artificial Intelligence Laboratory and Center for Biological and Computational Learning, Cambridge, MA. Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. Journal of the Royal statistical society, series B, 58, 267­288. Tipping, M., & Bishop, C. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 21, 611­622. Varshavsky, R., Gottlieb, A., Linial, M., & Horn, D. (2006). Novel Unsupervised Feature Filtering of Biological Data. Bioinformatics, 22. Zass, R., & Shashua, A. (2006). Nonnegative sparse PCA. Advances in Neural Information Processing Systems. Zou, H., Hastie, T., & Tibshirani, R. (2004). Sparse principal component analysis. Journal of Computational and Graphical Statistics.

Acknowledgements
We thank Wolfgang Einh¨user-Treyer, Peter Orbanz a and the anonymous reviewers for their valuable comments on the manuscript. This work was in part funded by CTI grant 8539.2;2 ESPP-ES.

References
Armstrong, S., Staunton, J., Silverman, L., Pieters, R., den Boer, M., Minden, M., Sallan, S., Lander, E., Golub, T., & Korsmeyer, S. (2002). MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genetics, 30, 41­47. Cadima, J., & Jolliffe, I. (1995). Loadings and correlations in the interpretation of principal components. Applied Statistics, 203­214. d'Aspremont, A., Bach, F., & El Ghaoui, L. (2007). Full regularization path for sparse principal component analysis. Proceedings of the International Conference on Machine Learning. Horst, R., Pardalos, P., & Thoai, N. (2000). Introduction to global optimization. Kluwer Acad. Publ. Moghaddam, B., Weiss, Y., & Avidan, S. (2006). Spectral bounds for sparse PCA: Exact and greedy algorithms. Advances in Neural Information Processing Systems.