Closed-Form Sup ervised Dimensionality Reduction with Generalized Linear Mo dels Irina Rish, Genady Grabarnik, Guillermo Cecchi {rish,genady,gcecchi}@us.ibm.com IBM T.J. Watson Research Center, Yorktown Heights, NY 10598 USA Francisco Pereira CSBMB, Green Hall, Princeton University, Princeton, NJ 08540 USA fpereira@princeton.edu Geoffrey J. Gordon ggordon@cs.cmu.edu Carnegie Mellon University, School of Computer Science, Machine Learning Dept., Pittsburgh, PA 15213 USA Abstract We propose a family of supervised dimensionality reduction (SDR) algorithms that combine feature extraction (dimensionality reduction) with learning a predictive model in a unified optimization framework, using data- and class-appropriate generalized linear models (GLMs), and handling both classification and regression problems. Our approach uses simple closed-form update rules and is provably convergent. Promising empirical results are demonstrated on a variety of high-dimensional datasets. problem of supervised dimensionality reduction can be viewed as finding a predictive structure, such as a low-dimensional representation, which captures the information about the class label contained in the highdimensional feature vector while ignoring the "noise". However, existing SDR approaches are often restricted to specific settings, and can be viewed as jointly learning a particular mapping (most commonly, a linear one) from the feature space to a low-dimensional hidden-variable space, together with a particular predictor that maps the hidden variables to the class label. For example, SVDM (Pereira & Gordon, 2006) learns a linear mapping from observed to hidden variables, effectively assuming Gaussian features when minimizing sum-squared reconstruction loss; on the prediction side, it focuses on SVM-like binary classification using hinge loss. SDR-MM method of (Sa jama and Alon Orlitsky, 2005) treats various types of features (e.g., binary and real-valued) but is limited to discrete classification problems, i.e. is not suitable for regression. Recent work on distance metric learning (Weinberger et al., 2005; Weinberger & Tesauro, 2007) treats both classification and regression settings, but is limited, like SVDM, to Gaussian features and linear mappings when learning Mahalanobis distances. This paper approaches SDR in a more general framework that views both features and class labels as exponential-family random variables, and allows to mix-and-match data- and label-appropriate generalized linear models, thus handling both classification and regression, with both discrete and real-valued data1 . It can be viewed as a discriminative learn1 In other words, the proposed framework provides nonlinear dimensionality reduction methods based on minimizing Bregman divergences that correspond to particular 1. Introduction Dimensionality reduction (DR) is a popular dataprocessing technique that serves the following two main purposes: it helps to provide a meaningful interpretation and visualization of the data, and it also helps to prevent overfitting when the number of dimensions greatly exceeds the number of samples, thus working as a form of regularization. When our goal is prediction rather than an (unsupervised) exploratory data analysis, supervised dimensionality reduction (SDR) that combines DR with simultaneously learning a predictor can significantly outperform a simple combination of unsupervised DR with a subsequent learning of a predictor on the resulting low-dimensional representation (Pereira & Gordon, 2006; Sa jama and Alon Orlitsky, 2005). The Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). Supervised Dimensionality Reduction ing based on minimization of conditional probability of class given the hidden variables, while using as a regularizer the conditional probability of the features given the low-dimensional hidden-variable "predictive" representation. The main advantage of our approach, besides being more general, is using simple closed-form update rules when performing its alternate minimization procedure. This method yields a short Matlab code, fast performance, and is guaranteed to converge. The convergence property, as well as closed form update rules, result from using appropriate auxiliary functions bounding each part of the ob jective function (i.e., reconstruction and prediction losses). We exploit the additive property of auxiliary functions in order to combine bounds on multiple loss functions. We perform a variety of experiments, both on simulated and real-life problems. Results on simulated datasets convincingly demonstrate that our SDR approach can discover underlying low-dimensional structure in high-dimensional noisy data, while outperforming SVM and SVDM, often by far, and practically always beating the unsupervised DR followed by learning a predictor. On real-life datasets, SDR approaches continue to beat the unsupervised DR by far, while often matching or somewhat outperforming SVM and SVDM. the exponential family used for different dimensions2 , and that the noise is applied independently to each coordinate of xn . Namely, it is assumed that N × D parameter matrix can be represented by a linear model in an L-dimensional (L < D) space: nd = lL =1 Unl Vld + Xd , where the rows of the L × D matrix V correspond to the basis vectors, the columns of the N × L matrix U correspond to the coordinates of the "true points" n , n = 1, ...N in the L-dimensional space spanned by those basis vectors, and X is the bias vector (corresponding to the empirical mean in PCA). However, to simplify the notation, we will include X as the (L + 1)'s row of the matrix V (i.e., VL+1 = X ), and add the (L + 1)'s column of 1's to U , so that we can write X as a product of two matrices, Xnd = L+1 (U V )nd = l=1 Unl Vld . Given the natural parameter nd , an exponentialfamily noise distribution is defined for each Xnd by log P (Xnd |Xnd ) = Xnd Xnd -G(Xnd )+log P0 (Xnd ), where Gx (nd ) is the cumulant or log-partition function that ensures that P (Xnd |nd ) sums (or integrates) to 1 over the domain of Xnd . This function uniquely defines a particular member of the exponential family, e.g., Gaussian, multinomial, Poisson, etc. We can now view each row Un as a "compressed" representation of the corresponding data sample xn that will be used to predict the class labels. We will again assume a noisy linear model for each class label Yk (column-vector in Y ) where the natural parameter is represented by linear combination Ynk = lL =1 2. SDR-GLM: Hidden-Variable Mo del Let X be an N × D data matrix with entries denoted Xnd where N is the number of i.i.d. samples, and nth sample is a D-dimensional row vector denoted xn . Let Y be an N × K matrix of class labels for K separate prediction problems (generally, we will consider K > 1), where j -th column, 1 j K , provides a set of class labels for the j -th prediction problem. Our supervised dimensionality approach relies on the assumption that each data point xn , n = 1, ..., N , is a noisy version of some "true" data point n which lives in a low-dimensional space, and that this hidden representation of the noisy data is actually predictive about the class label. We will also assume, following ePCA (Collins et al., 2001), (GL)2 M (Gordon, 2002), logistic PCA (Schein et al., 2003) and related extensions of PCA, that noise in the features follows exponential-family distributions with natural parameters n , with different members of members of the exponential family of distribution, including linear (PCA-like) dimensionality reduction as a particular case for Gaussian variables. Unl Wlk + Yk with linear coefficients w = (w1 , ..., wL ) and Kdimensional bias vector Y . As for Xnd , we will simplify the notation by including Y as the (L + 1)'s row of the matrix W , and write Ynk = L+1 (U W )nk = l=1 Unl Wlk . Using an appropriate type of exponential-family noise P (Ynk |Ynk ), we can handle both classification and regression problems. For 2 It is important to note that in case of standard (linear) PCA corresponding to Gaussian noise the parameters n live in a linear subspace in the original data space, but for other types of exponential-family noise the low-dimensional space of parameters is typically a nonlinear surface in the original data space. Supervised Dimensionality Reduction example, in case of binary classification, we can model Ynk as a Bernoulli variable with parameter pnk and the p corresponding natural parameter nk = log( 1-nk k ), pn 1 and use logistic function (x) = 1+e-x to write the Bernoulli distribution for Ynk as P (Ynk |Ynk ) = (Ynk )Ynk (-Ynk )1-Ynk . In case of regression, Ynk will be a Gaussian variable with the expectation parameter coinciding with the natural parameter Ynk . In other words, we will use a generalized linear model - (GLM) E (Xd ) = fd 1 (U Vd ) for each feature Xd (d-th column in X , 1 d D), and yet another set of - GLMs E (Yk ) = fk 1 (U W ) for each class label Yk (k -th column in Y , 1 k K ), with possibly different link functions fd and fk (e.g., the logit link function f (µ) = µ ln 1-µ = -1 () is used for binary classification, and identity link function f (µ) = µ is used for real-valued regression with Gaussian output). SDR-GLM optimization problem. We formulate SDR as joint optimization of two ob jective functions corresponding to the reconstruction accuracy (data likelihood) and the prediction accuracy (class likelihood): n LX (X ) = log P (Xnd |Xnd ), (1) d exponential family, similarly to ePCA(Collins et al., 2001) and G2 L2 M (Gordon, 2002), replace hinge loss by the loglikelihoods corresponding to exponentialfamily class variables, and provide closed-form update rules rather than perform optimization at each iteration, which results into a significant speed up when compared with SVDM. 3. Combining Auxiliary Functions Since the above problem (eq. 3) is not jointly convex in all parameters, finding a globally optimal solution is nontrivial. Instead, we can employ the auxiliary function approach commonly used in EM-style algorithms, and using auxiliary function of a particular form, derive closed-form iterative update rules that are guaranteed to converge to a local minimum. We show that an auxiliary function for the ob jective in eq. 3 can be easily derived for an arbitrary pair of LX and LY provided that we know their corresponding auxiliary functions. Auxiliary functions. Given a function L() to be ^ maximized, a function Q(, ) is called an auxiliary ^ ^ function for L() if L() = Q(, ) and L() Q(, ) ^. It is easy to see that L() is non-decreasing for all under the update ^ t+1 = arg max Q(, t ), ^ LY (Y ) = n k log P (Ynk |Ynk ), (2) where X = U V , Y = U W , and the likelihoods above correspond to particular members of exponential family. Then the SDR optimization problem is U,V ,W i.e., L(t+1 ) L(t ), and thus an iterative application of such updates is guaranteed to converge to a local maximum of L under mild conditions on L and Q. We will make use of the following properties of auxiliary functions: ^ ^ Lemma 1 Let Q1 (, ) and Q2 (, ) be auxiliary functions for L1 () and L2 (), respectively. Then ^ ^ ^ Q(, ) = 1 Q1 (, ) + 2 Q2 (, ) (4) max LX (U V ) + LY (U W ) (3) where the data likelihood can be viewed as a regularization added on top of the class likelihood maximization ob jective, with the regularization constant controlling the trade-off between the two likelihoods 3 . Comparison with SVDM. Note that the idea of combining loss functions for SDR was also proposed before in SVDM (Pereira & Gordon, 2006), where, similarly to SVD, quadratic loss ||X - U V ||2 was used 2 for data reconstruction part of the ob jective, while the hinge loss was used for the prediction part (using U as the new data matrix). Herein, we generalize SVDM's sum-squared reconstruction loss to log-likelihoods of 3 In our implementation, an L2 -norm regularization was also added on all matrices U, V , W with small regularization constants (0.01), effectively corresponding to assuming Gaussian priors on those matrices. is an auxiliary function for L() = 1 L1 () + 2 L2 (), where i > 0 for i = 1, 2. ^ ^ ^ ^ Pro of. Q(, ) = 1 Q1 (, ) + 2 Q2 (, ) 1 L1 () + ^) = L(), and Q(, ) = 1 Q1 (, ) + ^ 2 L2 ( 2 Q2 (, ) = 1 L1 () + 2 L2 () = L(). Also, it is obvious that a function is an auxiliary for ^ ^ itself, i.e. Q(, ) = L() is an auxiliary function for L(). This observations allows us to combine various dimensionality reduction approaches with appropriate predictive loss functions, given appropriate auxiliary functions for both (next section discusses two such combinations). When optimization of an auxiliary functions yields an analytical solution (e.g., for Supervised Dimensionality Reduction quadratic functions), it is easy to obtain closed-form update rules for alternating minimization. procedure here, similarly to the approach of (Collins et al., 2001) and related work; c can be ignored since it will be subsumed by the parameter . Using the above auxiliary functions, we can combine them into joint auxiliary functions as in eq. 5 for various combinations of Bernoulli and Gaussian variables Xnd and Ynk . Namely, assuming all Xnd are Bernoulli, we get (Schein et al., 2003): n ^ QB er (Xnd , Xnd ) = log cosh(Xnd /2) + X d 4. Iterative Up date Rules Recall that natural parameters for Xnd L and Ynk +1 variables are represented by Xnd = l=1 Unl Vld L+1 ^ Let QX (X , X ) and Ynk = l=1 Unl Wlk . ^ Y , Y ) be the auxiliary functions for and QY ( the corresponding loglikelihoods in eq. 2, then ^ ^ Q(X , X , Y , Y ) = ^ ^ = QX (X , X ) + QY (Y , Y ) (5) + T 2 nd X 4 + is an auxiliary function for the combined log-likelihood in eq. 3 when we fix two out of three parameters and optimize over the remaining one. The update rules for ^ ^ ^ Unl , Vld and Wnk are obtained by solving Q ^ Unl Q ^ Unl = = 0, QX QY = 0, = 0, where ^ ^ Vl d Wnk d k Q Y ^ ^ Q Xnd nk + . ^ ^ ^ ^ Xnd Unl Ynk Unl ^ ^ T 2 nd (2Xnd - 1)Xnd X - , 2 4 (8) while assuming all Xnd are Gaussian, we get n ^ ^ QGauss (Xnd , Xnd ) = - (Xnd - Xnd )2 . X d (9) ^ Note that QY (Ynk , Ynk ) for all-Bernoulli or allGaussian Ynk is obtained by replacing X with Y , V with W , and d with k in eq. 8 and 9, respectively. Due to lack of space, we omit the derivation of the iterative update rules for the four versions of SDR-GLM that we experimented with (for more detail, see (Rish et al., 2008)): Gaussian-SDR, that assumes Gaussian Xnd and Bernoulli Ynk , Bernoul li-SDR in case of Bernoulli Xnd and Bernoulli Ynk , Gaussian-SDRRegression and Bernoul li-SDR-Regression in case of Gaussian Ynk (appropriate for real-valued label, i.e. for the regression problem). Prediction step. Once we learn the parameters U , V and W , and thus X and Y , we can use them on test data in order to predict labels for previously unseen data. Given the test data X test , we only need to find the corresponding low-dimensional representation U test by performing the corresponding iterative updates only with respect to U , keeping V and W fixed. Once U test is found, we predict the class labels as Y test = U test W . Note that we get simpler expressions for V and W since they appear only in QX or only in QY parts of eq. 5, respectively. Auxiliary functions for Bernoulli and Gaussian log-likeliho o ds. For a Bernoulli variable s with natural (log odds) parameter we use the variational bound on log-likelihood L() = log P (s|) proposed by (Jaakkola & Jordan, 1997) and used later by (Schein et al., 2003) ^ L() ^ Q(, ) = log 2 - log cosh(/2) + ^ ^ T 2 (2s - 1) T 2 + + - , 4 2 4 (6) ( where T = tanh /2) . Note that the auxiliary function is quadratic in and taking its derivatives leads to simple closed-form iterative update rules for U , V and W . For multinomial variables, one can use a recent extension of the above bound to multinomial logistic regression proposed by (Bouchard, 2007). 5. Empirical Evaluation We evaluated our SDR-GLM methods on both simulated and real-life datasets, in both classification and regression settings. We varied the lowdimensionality parameter L from 2 to 10, and evaluated a range of regularization parameters = 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, selecting those giving the best average error among several dimensions4 . Selecting the best separately for each dimension can only improve the results, but would be more computationally intensive. 4 For a Gaussian variable s with natural parameter that coincides with the mean parameter (identity link function) we do not really have to construct an auxiliary function, since we can simply use the negative squared loss proportional to the Gaussian loglikelihood as an auxiliary function itself, i.e. ^ ^ L() = Q(, ) = -c(s - )2 , (7) where c = (2 )-1 is a constant, assuming a fixed standard deviation that will not be a part of our estimation Supervised Dimensionality Reduction 5.1. Classification Problems In the classification setting, we compared Bernoul liSDR and Gaussian-SDR versus linear SVM5 and versus unsupervised dimensionality reduction followed by SVM and by logistic regression, which we refer to as SVM-UDR and Logistic-UDR, respectively. In both cases, unsupervised dimensionality reduction is performed first using the data-appropriate DR method (i.e., PCA for real-valued data and Logistic-PCA for binary data; this is equivalent to removing the prediction loss in eq. 3); then SVM or logistic regression are performed on the low-dimensional data representation U . For datasets with real-valued features, we also compared the above methods to SVDM (Pereira & Gordon, 2006). We performed k -fold cross-validation with k = 10 on the datasets with less than 1000 dimensions, and with k = 5 otherwise. Simulated data. In order to test our methods first on the data with controllable low-dimensional structure, we simulated high-dimensional datasets that indeed were noisy versions of some underlying easily-separable low-dimensional data. Particularly, a set of N = 100 samples from two classes was generated randomly in two-dimensional space so that the samples were linearly separable with a large margin. Next, we simulated two sets of exponential-family random variables Xnd , a Bernoulli set and a Gaussian set, using the coordinates of the above points for natural parameters nd , where the number of samples was N = 100 and the dimensionality of a "noisy" dataset varied from D = 100 to D = 1000. We then combined the data with the labels generated in the lowdimensional space and provided them as an input to our algorithms. Simulated data: Bernoul li noise. Figures 1a summarize the results for Bernoulli-noise dataset. Supervised DR methods (Bernoulli-SDR and Gaussian-SDR) versus SVM and versus unsupervised DR followed by learning a predictor (Logistic-UDR and SVM-UDR); the reduced dimensionality parameters is set to L = 2 and the regularization constant is = 0.0001 (the choice of those parameters is discussed below). We can see that Bernoulli-SDR significantly outperforms all other methods, including SVM, and both Bernoulli-SDR and Gaussian-SDR also outperform the unsupervised DR followed by either logistic regression or SVM. Apparently, Bernoulli-SDR is able to reconstruct correctly the underlying separable twodimensional dataset and is robust to noise, as its error We used the SVM code by A. Schwaighofer available at http://ida.first.fraunhofer.de/~anton/software.html. 5 remains zero for up to 700 dimensions, and only increases slightly up to 0.05 for 1000 dimensions. On the other hand, SVM has zero-error prediction only in the lowest-dimensional case (D = 100), and is much more sensitive to noise when the dimensionality of the data increases, making incorrect predictions in up to 14% to 21% cases when the dimensionality increases above D = 300. Apparently, SVM was not able to exploit the underlying separable low-dimensional structure disturbed by high-dimensional noise, while supervised dimensionality reduction easily detected this structure. Also, using the Bernoulli model instead of Gaussian when features are binary is clearly beneficial, and thus, as noted previously, proper extensions of PCA to exponential-family data must be used (Collins et al., 2001; Schein et al., 2003). However, previous work on logistic PCA by (Schein et al., 2003) demonstrated advantages of using the Bernoulli vs Gaussian assumption only for reconstruction of the original data, while this paper investigates the impact of such assumptions on the generalization error in supervised learning case. This is less obvious, since a good fit to the training data does not necessarily imply a good generalization ability, as shown by our experiments with unsupervised dimensionality reduction followed by learning a classifier. Regarding the choice of the regularization parameter , we experimented with a range of values from 0.0001 to 10, and concluded that the smallest value was the most beneficial for both SDR algorithms; this is intuitive since it effectively puts most of the weight on the predictive loss. There is a clear trend (Figure 1b) in error decrease with the parameter decrease, where sufficiently low values of 0.1 yield quite similar low errors, but increasing further, especially above 1, leads to a drastic increase in the classification error, especially in higher-dimensional cases. Note, however, that such tendency is not present in the other datasets we experimented with, where the effect of regularization constant can be non-monotonic, and thus underscores the importance of using cross-validation or other parameter-tuning approaches6 . Regarding the choice of the reduced-dimensionality parameter L, for low values of , we did not observe any significant variation in the results with increasing this parameter up to L = 10, e.g. the results for different L were practically identical when 0.1, while for higher values variance was more significant. Bayesian approach to selecting regularization parameter may prove beneficial, as shown, for example, in (Y. Lin and D. Lee, 2006) 6 Supervised Dimensionality Reduction 0.7 0.7 0.7 0.6 Classification Error Classification Error Classification Error 0.5 Bernoulli-SDR Gaussian-SDR Logistic-UDR SVM-UDR SVM 0.6 Bernoulli-SDR (D=200, L=2) Bernoulli-SDR (D=700, L=2) Gaussian-SDR (D=200, L=2) Gaussian-SDR (D=400, L=2) 0.6 0.5 0.5 SVDM Gaussian-SDR Logistic-UDR SVM-UDR SVM 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 100 200 300 400 500 600 700 800 900 1000 D (data dimensionality) 0 -4 10 10 -3 10 -2 10 -1 10 0 10 1 0 100 200 300 400 500 600 700 Regularization parameter D (data dimensionality) (a) (b) (c) Figure 1. (a) Results for Bernoulli noise. (b) Effects of the regularization parameter (the weight on the data reconstruction loss) - Bernoulli noise. (c) Results for Gaussian noise. Simulated data: Gaussian noise. Figure 1c shows the results for the Gaussian noise: supervised dimensionality reduction provides a clear advantage over unsupervised ones, although the performance of SDR versus SVM is somewhat less impressive, as GaussianSDR is comparable to SVM. However, Gaussian-SDR seems to outperform considerably another supervised dimensionality method, SVDM, proposed by (Pereira & Gordon, 2006). SVDM was used with its regularization constant set to 100 since it provided the best SVDM performance among the same values of as before. Interestingly, the performance of SVDM does not show any monotonic dependence on this parameter. Real-life datasets. First, we considered several datasets with binary features, such as a 41-dimensional Sensor network dataset where the data represent connectivity (0 or 1) between all pairs of sensor nodes in a network of 41 light sensors (see Table 1). Note that Bernoulli-SDR with L = 10 and regularization parameter = 0.1 outperformed SVM, Gaussian-SDR with L = 8, 10 and same = 0.1 matched SVM performance, while both the unsupervised dimensionality reduction methods followed by SVM and logistic regression - SVM-UDR and Logistic-UDR, respectively performed worse. (Best results for each method are shown in the boldface.) Also, using the Bernoulli model for binary data instead of Gaussian seems to pay off: Bernoulli-SDR performs somewhat better than Gaussian-SDR. It is interesting to note that for really low dimensionality L = 2, all of the above methods have same error of 0.2, while increasing the dimensionality allows for much better performance, although this effect is non-monotonic. Another dataset related to network performance management, of somewhat larger dimensionality Table 1. Results for Sensor network dataset (N = 41, D = 41): classification errors of different methods for different reduced dimension parameter, L. method\ L Bernoul li-SDR Gaussian-SDR Logistic-UDR SVM-UDR SVM 2 0.20 0.20 0.20 0.20 4 0.24 0.22 0.22 0.27 6 0.27 0.22 0.20 0.22 0.17 8 0.15 0.17 0.20 0.27 10 0.12 0.17 0.24 0.24 Table 2. Results for PlanetLab dataset (N = 169, D = 168): classification errors of different methods for different reduced dimension parameter, L. method\ L Bernoul li-SDR Gaussian-SDR Logistic-UDR SVM-UDR SVM 2 0.10 0.10 0.11 0.10 4 0.07 0.11 0.10 0.09 6 0.10 0.08 0.08 0.08 0.10 8 0.12 0.07 0.09 0.08 10 0.15 0.08 0.10 0.07 (N = 169, D = 168), contains pairwise endto-end network latency (ping round-trip times) collected by the PlanetLab measurement pro ject (http://www.pdos.lcs.mit.edu/simstrib/pl app) discretized by applying a threshold as follows: above the average latency is considered "bad" (1) while the below-average latency is considered "good" (0). We selected the first column (latencies of all 169 nodes towards the node 1) as the label, and predicted them given the remaining data. The results are shown in Table 2. The regularization parameters selected by cross-validation were different here for different SDR methods: for Bernoulli-SDR, = 100 turned out to be the best, while Gaussian-SDR performed better with = 1. Overall, the results for different methods and Supervised Dimensionality Reduction Table 3. Results for Mass-spectrometry dataset (N = 50, D = 38573): classification errors of different methods for different reduced dimension parameter, L. method\ L Gaussian-SDR Logistic-UDR SVM-UDR SVDM SVM 2 0.04 0.5 0.54 0.42 4 0.02 0.18 0.2 0.04 6 0.02 0.08 0.02 0.02 0.02 8 0.02 0.04 0.06 0.06 10 0.02 0.02 0.02 0.04 Table 4. Results for fMRI dataset (N = 84, D = 14043): classification errors of different methods for different reduced dimension parameter, L. method\ L Gaussian-SDR Logistic-UDR SVM-UDR SVDM SVM 5 0.21 0.44 0.49 0.32 10 0.26 0.42 0.52 0.25 15 0.23 0.29 0.56 0.21 0.21 20 0.20 0.30 0.57 0.23 25 0.23 0.26 0.55 0.23 Another truly high-dimensional dataset we used contained the fMRI recordings of sub ject's brain activity (measured using changes in the brain oxygenation levels) while the sub ject was observing on a screen words of two types, representing tools or buildings (see (Pereira & Gordon, 2006) for details). The task was to learn a mind-reading classifier that would predict, given the fMRI data, what type of the word the sub ject was looking at. The features here correspond to fMRI voxels (D = 14043 voxels were selected after some preprocessing of the data, as described in (Pereira & Gordon, 2006)), and there are N = 84 samples (i.e., word instances presented to a sub ject). This dataset was clearly more challenging then the previous one (both SVM's and SDR's errors were around 0.2). The results for all methods are shown in Table 4; for Gaussian-SDR we used = 0.0001, while SVDM was best at 0.001 (as before, we used the average error over all dimensions to select best ). For those values of parameter, Gaussian-SDR matches SVM's errors of 0.21 using just five dimensions (L = 5), while SVDM reaches same error at L = 15 dimensions. Again, learning a predictor on "compressed" data obtained via unsupervised dimensionality reduction was consistently worse than both supervised methods. 5.2. Regression Problems Finally, we did some preliminary experiments with the regression version of our SDR approach that makes Gaussian assumption about the label (response variable) Y and thus uses sum-squared predictive LY (U W ) loss in equation 3, comparing it with the state-of-art sparse-regression technique called the Elastic Net, which was shown to improve over both Lasso and ridge regression using a convex combination of the L1- and L2-norm regularization terms (Zou & Hastie, 2005). We used the fMRI data from the 2007 Pittsburgh Brain Activity Interpretation Competition (PBAIC)(Pittsburgh EBC Group, 2007), where the fMRI data were recorded while sub jects were playing a videogame, and the task was to predict several real-valued response variables, such as level of anxiety the sub ject was experiencing etc. We used the data for the two runs (games) by the first sub ject, and for the Instructions response variable, learning from run 1 and predicting on run 2. The dataset contained N = 704 samples (measurements over time) and approximately D = 33, 000 features (voxels). Figure 2 compares the performance of Elastic Net and Gaussian-SDR-Regression, where the L parameter denotes the number of active variables (voxels selected) varying dimensions L were surprisingly similar to each other, with both SDR methods achieving the lowest error of 0.07 for L = 4 and L = 8, respectively, that was also achieved by SVM-UDR at L = 10, slightly outperforming SVM (0.10 error). Very similar results (not shown here due to lack of space) were also obtained on the Advertisement dataset from UCI ML repository. We experimented next with several extremely highdimensional datasets from biological experiments that had real-valued features. The first dataset, called here Proteomics data, containing mass-spectrometry data for D = 38573 proteins (features), showing their "expression levels" in N = 50 female sub jects (examples), 25 of which were pregnant (class 1), and the others were not (class 0). The results are shown in Table 3, comparing Gaussian-SDR with = 0.001 (that yields lowest average SVDM error (among all dimensions) on this dataset) versus SVM, Logistic-UDR, SVM-UDR (both using the Gaussian assumption, i.e. PCA, for dimensionality reduction), and SVDM with its bestperforming parameter 0.01. Despite its very high dimensionality, this dataset turned out to be easy: both SVM and Gaussian-SDR achieved an error of 0.02, and Gaussian-SDR used only L = 4 dimensions to achieve it. On the contrary, unsupervised DR followed by predictor learning (Logistic-UDR and SVM-UDR), suffered from really high errors at low dimensions, and only managed to achieve same low error at L = 10. SVDM was able to reach its lowest error (same 0.02) a bit earlier (L = 6), although for L = 2 it incurred a huge error of 0.42, while Gaussian-SDR had 0.04 error at that same level of reduced dimensionality. Supervised Dimensionality Reduction by the sparse EN regression. We can see that SDR regression is comparable with the state-of-the art EN (or even slightly better) when the number of hidden dimensions is not too low. 0.8 Correlation with Response 0.7 multitask learning, as well as semi-supervised learning, including only the reconstruction loss part of the objective for unlabeled date, while keeping both reconstruction and prediction losses for the labeled ones. Finally, a more principled selection of the regularization constant (e.g., using Bayesian approaches) is another open research direction. 0.6 0.5 References Bouchard, G. (2007). Efficient bounds for the softmax and applications to approximate inference in hybrid models. In Nips workshop on approximate inference in hybrid models. Collins, M., Dasgupta, S., & Schapire, R. (2001). A generalization of principal component analysis to the exponential family. NIPS2001. 0 50 100 150 200 250 0.4 0.3 0.2 ElasticNet Gaussian-SDR-Reg 0.1 0 Reduced dimension L Gordon, G. (2002). Generalized2 Linear2 Models. NIPS2002 (pp. 577­584). Jaakkola, T., & Jordan, M. (1997). A variational approach to bayesian logistic regression problems and their extensions. Sixth International Workshop on Artificial Intelligence and Statistics. Lee, D. D., & Seung, S. H. (2000). Algorithms for nonnegative matrix factorization. NIPS (pp. 556­562). Pereira, F., & Gordon, G. (2006). The Support Vector Decomposition Machine. ICML2006 (pp. 689­696). Pittsburgh EBC Group (2007). PBAIC Homepage: http://www.ebc.pitt.edu/2007/competition.html. Rish, I., Grabarnik, G., Cecchi, G., Pereira, F., & Gordon, G. (2008). Closed-Form Supervised Dimensionality Reduction with Generalized Linear Models (Technical Report). IBM T.J. Watson Research Center. Sa jama and Alon Orlitsky (2005). Supervised dimensionality reduction using mixture models. ICML '05: Proceedings of the 22nd International Conference on Machine learning (pp. 768­775). New York, NY, USA: ACM. Schein, A., Saul, L., & Ungar, L. (2003). A generalized linear model for principal component analysis of binary data. Ninth International Workshop on Artificial Intelligence and Statistics. Weinberger, K., Blitzer, J., & Saul, L. (2005). Distance Metric Learning for Large Margin Nearest Neighbor Classification. NIPS2005. Weinberger, K., & Tesauro, G. (2007). Metric learning for kernel regression. In M. Meila and X. Shen (Eds.), Eleventh international conference on artificial intel ligence and statistics, 608­615. Puerto Rico: Omnipress. Y. Lin and D. Lee (2006). Bayesian L1-norm sparse learning. International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2006). Zou, H., & Hastie, T. (2005). Regularization and variable selection via the Elastic Net. Journal of the Royal Statistical Society, Series B, 67, 301­320. Figure 2. Results for the fMRI dataset from PBAIC. Performance is measured by correlation between the actual and predicted response variable (Instructions, in this case). 6. Conclusions and Future Work This paper proposes a family of SDR algorithms that use generalized linear models to handle various types of features and labels, thus generalizing previous approaches in a unifying framework. Our SDR-GLM approach is fast and simple: it uses closed-form update rules at each iteration of alternating minimization procedure, and is always guaranteed to converge. Experiments on a variety of datasets show that this approach is clearly promising, although more empirical investigation is needed in case of SDR-regression. Although we only tested our approach with Gaussian and Bernoulli variables, it can be easily extended to multinomial variables (and thus to multiclass classification) using a recent extension of the variational bound proposed in (Jaakkola & Jordan, 1997) to multinomial logistic regression (soft-max)(Bouchard, 2007). Deriving closed-form update rules for other members of the exponential family (e.g., Poisson) remains a direction of future work. Another possible extension is to obtain closed-form SDR update rules for alternative DR methods, such as non-negative matrix factorization (NMF)(Lee & Seung, 2000), simply plugging in NMF's auxiliary function instead of (unconstrained) PCA-like quadratic-loss. Other potential applications of our approach include dimensionality reduction on mixed-type data, weighted dimensionality reduction schemes (e.g., assigning different weight to reconstruction error of different coordinates in PCA and similar DR techniques),