Supervised Exponential Family Principal Component Analysis via Convex Optimization Yuhong Guo yuhongguo.cs@gmail.com Abstract Recently, supervised dimensionality reduction has been gaining attention, owing to the realization that data labels are often available and indicate important underlying structure in the data. In this paper, we present a novel convex supervised dimensionality reduction approach based on exponential family PCA, which is able to avoid the local optima of typical EM learning. Moreover, by introducing a sample-based approximation to exponential family models, it overcomes the limitation of the prevailing Gaussian assumptions of standard PCA, and produces a kernelized formulation for nonlinear supervised dimensi onality reduction. A training algorithm is then devised based on a subgradient bu ndle method, whose scalability can be gained using a coordinate descent procedure. The advantage of our global optimization approach is demonstrated by empiri cal results over both synthetic and real data. 1 Introduction Principal component analysis (PCA) has been extensively used for data analysis and processing. It provides a closed-form solution for linear unsupervised dimensionality reduction through singular value decomposition (SVD) on the data matrix [8]. Probab ilistic interpretations of PCA have also been provided in [9, 16], which formulate PCA using a latent variable model with Gaussian distributions. To generalize PCA to better suit non-Gaussian data, many extensions to PCA have been proposed that relax the assumption of a Gaussian data distribution. Exponential family PCA is the most prominent example, where the underlying dimensi onality reduction principle of PCA is extended to the general exponential family [4, 7, 13]. Previous work has shown that improved quality of dimensionality reduction can be obtained by using exponential family models appropriate for the data at hand [4, 13]. Given data from a non-Gaussian distribution these techniques are better able than PCA to capture the intrinsic low dimensiona l structure. However, most existing non-Gaussian dimensionality reduction methods rely on iterative local optimization procedures and thus suffer from local optima, with the sole exception of [7] which shows a general convex form can be obtained for dimensionality reduction with exponential family models. Recently, supervised dimensionality reduction has begun to receive increased attention. As the goal of dimensionality reduction is to identify the intrinsic structure of a data set in a low dimensional space, there are many reasons why supervised dimensionalit y reduction is a meaningful topic to study. First, data labels are almost always assigned based on some important intrinsic property of the data. Such information should be helpful to suppress noise and capture the most useful aspects of a compact representation of the data. Moreover, there are many high dimensional data sets with label information available, e.g., face and digit images, and it is unwise to ignore them. A few supervised dimensionality reduction methods based on exponential family models have been proposed in the literature. For example, a supervised probabilistic PCA (SPPCA) model was proposed in [19]. SPPCA extends probabilistic PCA by assuming that both features and labels have Gaussian distributions and are generated independently from the lat ent low dimensional space through linear 1 transformations. The model is learned by maximizing the marginal likelihood of the observed data using an alternating EM procedure. A more general supervise d dimensionality reduction approach with generalized linear models (SDR GLM) was proposed in [12]. SDR GLM views both features and labels as exponential family random variables and optimizes a weighted linear combination of their conditional likelihood given latent low dimensional variables using an alternating EM-style procedure with closed-form update rules. SDR GLM is able to deal with different data types by using different exponential family models. Similar to SDR GLM, the linear supervised dimensionality reduction method proposed in [14] also takes advantage of exponential family models to deal with different data types. However, it optimizes the conditional likelihood of labels given observed features within a mixture model framework using an EM-style optimization procedure. Beyond the PCA framework, many other supervised dimensionality reduc tion methods have been proposed in the literature. Linear (fisher) discriminant analysis (LDA ) is a popular alternative [5], which maximizes between-class variance and minimizes within-class variance. Moreover, a kernelized fisher discriminant analysis (KDA) has been studied in [10]. Another notable nonlinear supervised dimensionality reduction approach is the colored maximum varian ce unfolding (MVU) approach proposed in [15], which maximizes the variance aligning with the side information (e.g., label information), while preserving the local distance structures from the dat a. However, colored MVU has only been evaluated on training data. In this paper, we propose a novel supervised exponential fam ily PCA model (SEPCA). In the SEPCA model, observed data x and its label y are assumed to be generated from the latent variables z via conditional exponential family models; dimensionality reduction is conducted by optimizing the conditional likelihood of the observations (x, y ). By exploiting convex duality of the sub-problems and eigenvector properties, a solvable convex formulation of the problem can be derived that preserves solution equivalence to the original. This convex formulation allows efficient global optimization algorithms to be devised. Moreover, by introduc ing a sample-based approximation to exponential family models, SEPCA does not suffer from the limitations of implicit Gaussian assumptions and is able to be conveniently kernelized to achieve nonlinearity. A training algorithm is then devised based on a subgradient bundle method, whose scalability can be gained through a coordinate descent procedure. Finally, we present a simple formulation to project new testing data into the embedded space. This projection can be used for other supervised dimensionality reduction approach as well. Our experimental results over both synthe tic and real data suggest that a more global, principled probabilistic approach, SEPCA, is better able to capture subtle structure in the data, particularly when good label information is present. The remainder of this paper is organized as follows. First, i n Section 2 we present the proposed supervised exponential family PCA model and formulate a convex nondifferentiable optimization problem. Then, an efficient global optimization algorithm is presented in Section 3. In Section 4, we present a simple projection method for new testing points. We then present the experimental results in Section 5. Finally, in Section 6 we conclude the paper. 2 Supervised Exponential Family PCA We assume we are given a t × n data matrix, X , consisting of t observations of n-dimensional feature vectors, Xi: , and a t × k indicator matrix, Y , with each row to indicate the class label for each k observation Xi: ; thus j =1 Yij = 1. For simplicity, we assume features in X are centered; that is, their empirical means are zeros. We aim to recover a d-dimensional re-representation, a t × d matrix Z , of the data (d < n). This is typically viewed as discovering a latent low dimensional manifold in the high dimensional feature space. Since the label information Y is exploited in the discovery process, this is called supervised dimensionality reducti on. For recovering Z , a key restriction that one would like to enforce is that the features used for coding , Z:j , should be linearly independent; that is, one would like to enforce the constraint Z Z = I , which ensures that the codes are expressed by orthogonal features in the low dimensional representati on. Given the above setup, in this paper, we are attempting to address the problem of supervised dimensionality reduction using a probabilistic latent variable model. Our intuition is that the important intrinsic structure (underlying feature representation) of the data should be able to accurately generate/predict the original data features and labels. 2 In this section, we formulate the low-dimensional principal component discovering problem as a conditional likelihood maximization problem based on exponential family model representations, which can be reformulated into an equivalent nondifferenti able convex optimization problem. We then exploit a sample-based approximation to unify exponential family models for different data types. 2.1 Convex Formulation of Supervised Exponential Family PCA As with the generalized exponential family PCA [4], we attempt to find low-dimensional representation by maximizing the conditional likelihood of the observation matrix X and Y given the latent matrix Z , log P (X, Y |Z ) = log P (X |Z ) + log P (Y |Z ). Using the general exponential family representation, a regularized version of this maximizatio n problem can be formulated as = Z :Z Z =I max W,,b max log P (X |Z, W ) - Z W X Y Z :Z Z =I max W,,b max tr +tr w Z W + t + bb r log P (Y |Z, , b) - tr W 2 2 -i ( W (A(Zi: , W ) - log P0 (Xi: )) - tr 1) W 2 i + t + bb r A(Zi: , , b) - 1 Yb- 2 here W is a d × n parameter matrix for conditional model P (X |Z ); is a d × k parameter matrix for conditional model P (Y |Z ) and b is a k × 1 bias vector; 1 denotes the vector of all 1s; A(Zi: , W ) and A(Zi: , , b) are the log normalization functions to ensure valid probabi lity distributions: e A(Zi: , W ) = log xp (Zi: W x) P0 (x) dx . (2 ) A(Zi: , , b) = log k ex p Z i: 1 + 1 b =1 ( 3) where 1 denotes a zero vector with a single 1 in the th entry. Note that the class variable y is discrete, thus maximizing log P (Y |Z, , b) is a discriminative classification training. In fact, the second part of the objective function in (1) is simply a multi-class logistic regression. That is why we have incorporated an add itional bias term b into the model. Theorem 1 The optimization problem (1) is equivalent to +i ( 1 (A (Uix ) + log P0 (Xi: )) + tr X - U x )(X - U x ) M miny max : x ,U U 2 M :I M 0, tr(M )=d i ( ( 1 A (Uiy ) + tr Y - U y )(Y - U y ) (M + E ) 4) : 2 where E is a t × t matrix with all 1s; U x is a t × n matrix; U y is a t × k matrix; A (Uix ) and : A (Uiy ) are the Fenchel conjugates of A(Zi: , W ) and A(Zi: , , b) respectively; M = Z Z and Z : can be recovered by taking the top d eigenvectors of M ; and the model parameters W, , b can be recovered by W= 1 Z (X - U x ), = 1 Z (Y - U y ), b= 1 (Y - U y ) 1 Proof: The proof is simple and based on standard results. Due to spac e limitation, we only provide a summarization of the key steps here. There are three steps. The first step is to derive the Fenchel conjugate dual for each log partition function, A(Z, .), following [18, Section 3.3.3]; which can be used to yield Z :Z Z =I U x ,U y max min g (Z, U x , U y ) 3 (5 ) that is equivalent to the original problem (1), where +i ( 1 tr X - U x )(X - U x ) Z Z (A (Uix ) + log P0 (Xi: )) + g (Z, U x , U y ) = : 2 i T ( 1 A (Uiy ) + tr Y - U y )(Y - U y ) (Z Z + E ) : 2 he second step is based on exploiting the strong min-max property [2], which allows one to further establish the equivalence of (5) to U x ,U y Z :Z Z =I min max g (Z, U x , U y ) A final substitution of Z Z with matrix M is based on the result of [11]. Note that (4) is a min-max optimization problem. Moreover, f or each fixed M , the outer minimization problem is obviously convex, since the Fenchel conjugates, A (Uix ) and A (Uiy ), are convex : : functions of U x and U y respectively [2]; that is, the objective function for the outer minimization is a pointwise supremum over an infinite set of convex functions. Thus the overall min-max optimization is convex [3], but apparently not necessarily differentiab le. We will address the nondifferentiable training issue in Section 3. 2.2 Sample-based Approximation In the previous section, we have formulated our supervised exponential family PCA as a convex optimization problem (4). However, before attempting to devise a training algorithm to solve it, we have to provide some concrete forms for the Fenchel conjugat e functions A (Uix ) and A (Uiy ). For : : different exponential family models, the Fenchel conjugate functions A are different; see [18, Table 2]. For example, since the y variable in our model is a discrete class variable, it takes a multinomial distribution. The Fenchel conjugate of (3) has been derived in [6]: , y y where yj = P (Yij = 1|Zi: , , b) (6) A (Uiy ) = A (y: ) = tr i : i i: log i: The specific exponential family model is determined by the data type and distribution. PCA and SPPCA use Gaussian models, thus their performances might be degraded when the data distribution is non-Gaussian. However, it is tedious and sometimes hard to choose the most appropriate exponential family model to use for each specific application problem. Moreover, the log normalization function A and its Fenchel conjugate A might not be easily computable. For these reasons, we propose to use a sample-based approximation to the integral (2) and achieve an empirical approximation to the true underlying exponential family model as jfollowZ If one replaces the integral definition (2) s. / exp i: W Xj : t, then the conjugate function with an empirical definition, A(Zi: , W ) = log can be given by , (7 ) A (Uix ) = A (x: ) = tr x: log x: where xj = P (Xj : |Zi: , W ) i : i i i With this sample-based approximation, problem (4) can be expressed as ( 1 min max tr (x log x ) + tr I - x )K (I - x ) M x ,y M :I M 0, tr(M )=d 2 s ( 1 + tr (y log y ) + tr Y - y )(Y - y ) (M + E ) 2 ubject to x 0 , x 1 = 1; y 0 , y 1 = 1 ( 8) (9 ) One benefit of working with this sample-based approximation is that it is automatically kernelized, K = X X , to enable non-linearity to be conveniently introduced. 3 Efficient Global Optimization The optimization (8) we derived in the previous section is a convex-concave min-max optimization problem. The inner maximization of (8) is a well known proble m with a closed-form solution [11]: 4 ( , M = Z Z and Z = Qd ax I - x )K (I - x ) + (Y - y )(Y - y ) where Qd ax (D) m m denotes the matrix formed by the top d eigenvectors of D. However, the overall outer minimization problem is nondifferentiable with respect to x and y . Thus the standard first-order or secondorder optimization techniques that rely on the standard gra dients can not be applied here. In this section, we deploy a bundle method to solve this nondifferentiable min-max optimization. 3.1 Bundle Method for Min-Max Optimization The bundle method is an efficient subgradient method for nondifferentiable convex optimization; it relies on the computation of subgradient terms of the object ive function. A vector g is a subgradient of function f at point x, if f (y) f (x) + g (y - x), y. To adapt standard bundle methods to our specific min-max problem, we need to first address the critica l issue of subgradient computation. Proposition 1 Consider a joint function h(x, y) defined over x X and y Y , satisfying: (1) h(·, y) is convex for all y Y ; (2) h(x, ·) is concave for all x X . Let f (x) = maxy h(x, y), and q (x0 ) = arg maxy h(x0 , y). Assume that g is a gradient of h(·, q (x0 )) at x = x0 , then g is a subgradient of f (x) at x = x0 . Proof: f (x) = max h(x, y) h(x, q (x0 )) y h(x0 , q (x0 )) + g (x - x0 ) (since h(·, y) is convex for all y Y ) = f (x0 ) + g (x - x0 ) (by the definitions of f (x) and q (x0 )) Thus g is a subgradient of f (x) at x = x0 according to the definition of subgradient. According to Proposition 1, the subgradients of our outer mi nimization objective function f in (8) over x and y can be given by l l , ( 1 1 x f og x + 1 - M (I - x )K y f og y + 1 - M (Y - y ) 10) where M is the optimal inner maximization solution at the current po int [x , y ]. Algorithm 1 illustrates the bundle method we developed to solve the infinite min-max optimization (8), where the linear constraints (9) over x and y can be conveniently incorporated into the quadratic bound optimization. One important issue in this algorithm is how to manage the size of the linear lower bound constraints formed from the active set B (defined in Algorithm 1), as it incrementally increases with new points being explored. To solve this problem, we noticed the Lagrangian dual parameters for the lower bound constraints obtained by the quadratic op timization in step 1 is a sparse vector, indicating that many lower bound constra ints can be turned off. Moreover, any constraint that is turned off will mostly stay off in the later steps. Therefore, for the bundle method we developed, whenever the size of B is larger than a given constant b, we will keep the active points of B that correspond to the first b largest values, and drop the remaining ones. 3.2 Coordinate Descent Procedure An important factor affecting the running efficiency is the size of the problem. The convex optimization (8) works in the dual parameter space, where the siz e of the parameters = {x , y }, t × (t + k ), depends only on the number of training samples, t, not on the feature size, n. For high dimensional small data sets (n t), our dual optimization is certainly a good option. However, with the increase of t, our problem size will increase in an order of O(t2 ). It might soon become too large to handle for the quadratic optimization step of the bu ndle method. On the other hand, the optimization problem (8) possesses a nice semi-decomposable structure: one equality constraint in (9) involves only one row of the ; that is, the can be separated into rows without affecting the equality constraints. Based on this observation, we develop a coordinate descent procedure to obtain scalability of the bundle method over large data sets. Specifically, we put an outer loop above the bundle method. Within each of this outer loop iteration, we randomly separate the parameters into m groups, with each group containing a subset rows of ; and we then use bundle method to sequentially optimize each subproblem defined on one group of parameters while keeping the remaining rows of fixed. Although coordinate descent with a nondifferentiable convex objective is not guaranteed to converge to a minimum in general [17], we have found that this procedure performs quite well in practi ce, as shown in the experimental results. 5 Algorithm 1 Bundle Method for Min-Max Optimization in (8) ¯ Input: > 0, m (0, 1), b I , µ I N R Initial: Find an initial point satisfying the linear constraints in (9); compute f ( ). Let = 1, = , compute g f by (10); e = f ( ) - f ( ) - g ( - ). ^ Let B = {(e , g )}, = Inf, g = 0; = + 1. ^ repeat ^ 1. Solve quadratic minimization for solution , and Lagrangian dual parameters w.r.t. the lower bound linear constraints in B [1]: µ ^ = arg min () + - 2, subject to the linear constraints in (9) 2 2 - ^ where () = f ( ) + max + g ( - ), imax {-ei + gi ( - )} ^^ i (e , g ) B ^ ^ . Define = f ( ) - [ () + µ - 2 0. If < , return. 2^ ^ 3. Conduct line search to minimize f ( ) with = + (1 - ), for 0 < < 1. 4. Compute g f by (10); e = f ( ) - f ( ) - g ( - ); update B = B {(e , g )}. 5. If f ( ) - f ( ) m , then take a serious step: (1) update: ei = ei + f ( ) - f (i) + gi ( - i ); ^ i ei ; i gi , = ^ (2) update the aggregation: g = (3) update the stored solution: = , f ( ) = f ( ). 6. If |B| > b, reduce B set according to . 7. = + 1. until maximum iteration number is reached 4 Projection for Testing Data One important issue for supervised dimensionality reducti on is to map new testing data into the dimensionality-reduced principal dimensions. We deploy a simple procedure for this purpose. After training, we obtain a low-dimensional representation Z for X , where Z can be viewed as a linear projection of X in some transformed space (X ) through a parameter matrix U; such that Z = (X )U = (X ) (X ) K + (X )U , where K + denotes the pseudo inverse of K = (X ) (X ) . Then a new testing sample x can be projected by z = (x ) (X ) K + (X )U = k (x , X )K + Z (1 1 ) 5 Experimental Results In order to evaluate the performance of the proposed supervi sed exponential family PCA (SEPCA) approach, we conducted experiments over both synthetic and real data, and compared to supervised dimensionality reduction with generalized linear models ( SDR GLM), supervised probabilistic PCA (SPPCA), linear discriminant analysis (LDA), and colored maximum variance unfolding (MVU). The projection procedure (11) is used for colored MVU as well . In all the experiments, we used µ = 1 for Algorithm 1, and used = 0.0001 for SDR GLM as suggested in [12]. 5.1 Experiments on Synthetic Data Two synthetic experiments were conducted to compare the five approaches under controlled conditions. The first synthetic data set is formed by first genera ting four Gaussian clusters in a twodimensional space, with each corresponding to one class, and then adding the third dimension to each point by uniformly sampling from a fixed interval. This experiment attempts to compare the performance of the five approaches in the situation where the data distribution does not satisfy the Gaussian assumption. Figure 1 shows the projection results for each approach in a two dimensional space for 120 testing points after being trained on a set with 80 points. In this case, SEPCA and LDA outperform all the other three approaches. The second synthetic experiment is designed to test the capability of performing nonlinear dimensionality reduction. The synthetic data is formed by first ge nerating two circles in a two dimensional space (one circle is located inside the other one), with each circle corresponding to one class, and 6 SEPCA 0.25 60 0.2 50 SDR-GLM 1 SPPCA 2 1.5 LDA 4 Colored-MVU 2 0.15 40 0.5 1 0.1 30 0.05 20 0 10 -0.05 0 -0.1 -10 -0.5 0 0.5 0 0 -2 -0.5 -1 -4 -1.5 -6 -0.15 -1 -2 -0.2 -20 -2.5 -8 -7 -6 -5 -4 -3 -2 -1 0 1 x 10 2 -7 -0.25 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 -30 -30 -20 -10 0 10 20 30 -1.5 -1 -0.5 0 0.5 1 1.5 -3 -4 -3 -2 -1 0 1 2 Figure 1: Projection results on test data for synthetic expe riment 1. Each color indicates one class. SEPCA 0.15 0.01 SPPCA 0.02 KDA 2.5 2 Colored-MVU 0.1 0.005 0.01 1.5 0.05 0 0 1 0.5 0 -0.005 -0.01 0 -0.02 -0.5 -0.01 -0.05 -0.1 -0.015 -0.03 -1 -1.5 -0.04 -2 -0.15 -0.2 -0.25 -0.02 -20 -15 -10 -5 0 x 10 5 -3 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 -0.05 -0.35 -0.3 -0.25 -0.2 -0.15 -0.1 -0.05 0 -2.5 5.85 5.9 5.95 6 6.05 6.1 6.15 6.2 6.25 6.3 6.35 Figure 2: Projection results on test data for synthetic expe riment 2. Each color indicates one class. then the third dimension sampled uniformly from a fixed inter val. As SDR GLM does not provide a nonlinear form, we conducted the experiment with only the r emaining four approaches. For LDA, we used its kernel variant, KDA. A Gaussian kernel with = 1 was used for SEPCA, SPPCA and KDA. Figure 2 shows the projection results for each approach in a two dimensional space for 120 testing points after being trained on a set with 95 points. Again, SEPCA and KDA achieve good class separations and outperform the other two approaches. 5.2 Experiments on Real Data To better characterize the performance of dimensionality r eduction in a supervised manner, we conducted some experiments on a few high dimensional multi-class real world data sets. The left side of Table 1 provides the information about these data sets. Our experiments were conducted in the following way. We randomly selected 35 examples from each class to form the training set and used the remaining examples as the test set. For each approac h, we first learned the dimensionality reduction model on the training set. Moreover, we also train ed a logistic regression classifier using the projected training set in the reduced low dimensiona l space. (Note, for SEPCA, a classifier was trained simultaneously during the process of dimension ality reduction optimization.) Then the test data were projected into the low dimensional space according to each dimensionality reduction model. Finally, the projected test set for each approach wer e classified using each corresponding logistic regression classifier. The right side of Table 1 shows the classification accuracies on the test set for each approach. To better understand the quality of th e classification using projected data, we also included the standard classification results, indicated as 'FULL', using the original high dimensional data. (Note, we are not able to obtain any result for SD R GLM on the newsgroup data as it is inefficient for very high dimensional data.) The results rep orted here are averages over 20 repeated runs, and the projection dimension d = 10. Still the proposed SEPCA presents the best performance among the compared approaches. But different from the synth etic experiments, LDA does not work well on these real data sets. The results on both synthetic and real data show that SEPCA outperforms the other four approaches. This might be attributed to its adaptive exponential family model approximation and its global optimization, while SDR GLM and SPPCA apparently suffer from local optima. 6 Conclusions In this paper, we propose a supervised exponential family PCA (SEPCA) approach, which can be solved efficiently to find global solutions. Moreover, SEPCA overcomes the limitation of the Gaussian assumption of PCA and SPPCA by using a data adaptive approximation for exponential 7 Table 1: Data set statistics and test accuracy results (%) SDR GL M 5 8 .8 1 9 .0 6 3 .5 7 7 .9 ­ colored M VU 2 1 .1 2 .8 4 0 .2 7 5 .8 1 0 .4 Dataset Yale YaleB 11 Tumor Usps3456 Newsgroup #Data 165 2414 174 120 19928 #Dim 4096 1024 12533 256 25284 #Class 15 38 11 4 20 FULL 6 5 .3 4 7 .0 7 7 .6 8 2 .1 3 2 .1 SEPCA 6 4 .4 2 0 .5 8 8 .9 7 9 .7 1 6 .9 SPPCA 5 1 .6 9 .8 6 3 .0 7 8 .5 6 .9 LDA 3 1 .0 6 .2 2 3 .7 7 4 .3 1 0 .0 family models. A simple, straightforward projection metho d for new testing data has also been constructed. Empirical study suggests that this SEPCA outperforms other supervised dimensionality reduction approaches, such as SDR GLM, SPPCA, LDA and colored MVU. References [1 ] [2 ] [3 ] [4 ] [5 ] [6 ] [7 ] [8 ] [9 ] [1 0 ] [1 1 ] [1 2 ] A. Belloni. Introduction to bundle methods. Technical report, MIT, 2005. J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2000. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge U. Press, 2004. M. Collins, S. Dasgupta, and R. Schapire. A generalization of principal component analysis to the exponential family. In Advances in Neural Information Processing Systems (NIPS), 2001. R. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7:179­188, 1936. Y. Guo and D. Schuurmans. Convex relaxations of latent variable training. In Advances in Neural Information Processing Systems (NIPS), 2007. Y. Guo and D. Schuurmans. Efficient global optimization for exponential family PCA and low-rank matrix factorization. In Allerton Conf. on Commun., Control, and Computing, 2008. I. Jolliffe. Principal Component Analysis. Springer Verlag, 2002. N. Lawrence. Probabilistic non-linear principle compo nent analysis with gaussian process latent variable models. Journal of Machine Learning Research, 6:1783­1816, 2005. S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K. Muller. Fisher discriminant analysis with kernels. In IEEE Neural Networks for Signal Processing Workshop, 1999. M. Overton and R. Womersley. Optimality conditions and duality theory for minimizing sums of the largest eigenvalues of symmetric matrices. Math. Prog., 62:321­357, 1993. I. Rish, G. Grabarnilk, G. Cecchi, F. Pereira, and G. Gor don. Closed-form supervised dimensionality reduction with generalized linear models. In Proceedings of International Conference on Machine Learning (ICML), 2008. Sajama and A. Orlitsky. Semi-parametric exponential family PCA. In Advances in Neural Information Processing Systems (NIPS), 2004. Sajama and A. Orlitsky. Supervised dimensionality red uction using mixture models. In Proceedings of the International Conference on Machine Learning (ICML), 2005. L. Song, A. Smola, K. Borgwardt, and A. Gretton. Colored maximum variance unfolding. In Advances in Neural Information Processing Systems (NIPS), 2007. M. Tipping and C. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society, B, 6(3):611­622, 1999. P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109:457­494, 2001. M. Wainwright and M. Jordan. Graphical models, exponential families, and variational inference. Technical Report TR-649, UC Berkeley, Dept. Statistics, 2003. S. Yu, K. Yu, V. Tresp, H. Kriegel, and M. Wu. Supervised probabilistic principal component analysis. In Proceedings of 12th ACM SIGKDD International Conf. on KDD, 2006. [1 3 ] [1 4 ] [1 5 ] [1 6 ] [1 7 ] [1 8 ] [1 9 ] 8