Nonparametric Regression and Classification with Joint Sparsity Constraints Han Liu John Lafferty Larry Wasserman Carnegie Mellon University Pittsburgh, PA 15213 Abstract We propose new families of models and algorithms for high-dimensional nonparametric learning with joint sparsity constraints. Our approach is based on a regularization method that enforces common sparsity patterns across different function components in a nonparametric additive model. The algorithms employ a coordinate descent approach that is based on a functional soft-thresholding operator. The framework yields several new models, including multi-task sparse additive models, multi-response sparse additive models, and sparse additive multi-category logistic regression. The methods are illustrated with experiments on synthetic data and gene microarray data. 1 Introduction Many learning problems can be naturally formulated in terms of multi-category classification or multi-task regression. In a multi-category classification problem, it is required to discriminate between the different categories using a set of high dimensional feature vectors--for instance, classifying the type of tumor in a cancer patient from gene expression data. In a multi-task problem, it is of interest to form several regression or classification estimators for related data sets that share common variables--for instance, predicting test scores across different school districts. In other areas, such as multi-channel signal processing, it is of interest to simultaneously decompose multiple signals in terms of a large overcomplete dictionary. In each case, while the details of the estimators vary from instance to instance, across categories, or across signals, they may share a common sparsity pattern of relevant variables selected from a high-dimensional feature vector. In the parametric setting, progress has been recently made on such problems using regularization based on the sum of supremum norms (Turlach et al., 2005; Tropp et al., 2006; Zhang, 2006). For p (k ) (k ) (k ) (k ) k) example, consider the K -task linear regression problem yi = 0 + j =1 j xij + ( where i the superscript k indexes the tasks, and the subscript i = 1, . . . , nk indexes the instance within a task. Using quadratic loss, Zhang (2006) suggests the estimator 2 kK jp ink j p (k ) ( k ) 1 ( ( (k ) = arg min yi k) - 0k) - + max |j | (1) j xij k =1 2nk =1 =1 =1 where maxk |j | = j is the sup-norm of the vector j (j , . . . , j )T of coefficients for the j th feature across different tasks. The sum of sup-norms regularization has the effect of "grouping" the elements in j such that they can be shrunk towards zero simultaneously. The problems of multi-response (or multivariate) regression and multi-category classification can be viewed as a special case of the multi-task regression problem where tasks share the same design matrix. Turlach et al. (2005) and Fornasier and Rauhut (2008) propose the same sum of sup-norms regularization as in (1) for such problems in the linear model setting. In related work, Zhang et al. (2008) propose the sup-norm support vector machine, demonstrating its effectiveness on gene data. 1 (k ) (1) (K ) In this paper we develop new methods for nonparametric estimation for such multi-task and multicategory regression and classification problems. Rather than fitting a linear model, we instead estimate smooth functions of the data, and formulate a regularization framework that encourages joint functional sparsity, where the component functions can be different across tasks while sharing a common sparsity pattern. Building on a recently proposed method called sparse additive models, or "SpAM" (Ravikumar et al., 2007), we propose a convex regularization functional that can be viewed as a nonparametric analog of the sum of sup-norms regularization for linear models. Based on this regularization functional, we develop new models for nonparametric multi-task regression and classification, including multi-task sparse additive models (MT-SpAM), multi-response sparse additive models (MR-SpAM), and sparse multi-category additive logistic regression (SMALR). The contributions of this work include (1) an efficient iterative algorithm based on a functional soft-thresholding operator derived from subdifferential calculus, leading to the multi-task and multiresponse SpAM procedures, (2) a penalized local scoring algorithm that corresponds to fitting a sequence of multi-response SpAM estimates for sparse multi-category additive logistic regression, and (3) the successful application of this methodology to multi-category tumor classification and biomarker discovery from gene microarray data. 2 Nonparametric Models for Joint Functional Sparsity We begin by introducing some notation. If X has distribution PX , and f is a function of x, its X2 L2 (PX ) norm is denoted by f 2 = f (x)dPX = E(f 2 ). If v = (v1 , . . . , vn )T is a vector, den 1 2 2 fine v n = n j =1 vj and v = maxj |vj |. For a p-dimensional random vector (X1 , . . . , Xp ), let Hj denote the Hilbert subspace L2 (PXj ) of PXj -measurable functions fj (xj ) of the single scalar variable Xj with zero mean, i.e. E[fj (Xj )] = 0. The inner product on this space is defined as fj , gj = E [fj (Xj )gj (Xj )]. In this paper, we mainly stjudy multivariate functions f (x1 , . . . , xp ) that have an additive form, i.e., f (x1 , . . . , xp ) = + fj (xj ), with fj Hj for j = 1, . . . , p. With H {1} H1 H2 . . . Hp denoting the direct sum Hilbert space, we have that f H. 2.1 Multi-task/Multi-response Sparse Additive Models (k ) (k ) In a K -task regression problem, we have observations {(xi , yi ), i = 1, . . . , nk , k = 1, . . . , K }, (k ) (k ) (k ) where xi = (xi1 , . . . , xip )T is a p-dimensional covariate vector, the superscript k indexes tasks and i indexes the i.i.d. samples for each task. In the following, for notational simplicity, we assume that n1 = . . . = nK = n. We also assume different tasks are comparable and each Y (k) and (k ) Xj has been standardized, i.e., has mean zero and variance one. This is not really a restriction of the model since a straightforward weighting scheme can be adopted to extend our approach to Y = (k ) (k ) handle noncomparable tasks. We assume the true model is E (k) | X (k) = x(k) f (x ) p (k ) (k ) (k ) to be zero. Let j =1 fj (xj ) for k = 1, . . . , K , where, for simplicity, we take all intercepts Qf (k) (x, y ) = (y - f (k) (x))2 denote the quadratic loss. To encourage common sparsity patterns across different function components, we define the regularization functional K (f ) by jp (k ) K (f ) = max fj . (2) =1 k=1,...,K The regularization functional K (f ) naturally combines the idea of the sum of sup-norms penalty for parametric joint sparsity and the regularization idea of SpAM for nonparametric functional sparsity; if K = 1, then 1 (f ) is just the regularization term introduced for (single-task) sparse additive (k ) models by Ravikumar et al. (2007). If each fj is a linear function, then K (f ) reduces to the sum of sup-norms regularization term as in (1). We shall employ K (f ) to induce joint functional sparsity in nonparametric multi-task inference. Using this regularization functional, the multi-task sparse additive model (MT-SpAM) is formulated as a penalized M-estimator, by framing the following optimization problem 1nK ( ik (k ) ( k ) f (1) , . . . , f (K ) = arg min Qf (k) (xi , yi ) + K (f ) 3) 2n =1 f (1) ,...,f (K ) =1 2 where fj Hj for j = 1, . . . , p and k = 1, . . . , K , and > 0 is a regularization parameter. The multi-response sparse additive model (MR-SpAM) has exactly the same formulation as in (3) except that a common design matrix is used across the K different tasks. 2.2 Sparse Multi-Category Additive Logistic Regression In a K -category classification problem, we are given n examples (x1 , y1 ), . . . , (xn , yn ) where xi = (1) (K -1) T (xi1 , . . . , xip )T is a p-dimensional predictor vector and yi = (yi , . . . , yi ) is a (K - 1)dimensional response vector in which at most one element can be one, with all the others being (k ) zero. Here, we adopt the common "1-of-K " labeling convention where yi = 1 if xi has category (k ) k and yi = 0 otherwise; if all elements of yi are zero, then xi is assigned the K -th category. The multi-category additive logistic regression model is 1 f exp (k) (x) (k ) P(Y = 1 | X = x) = f , k = 1, . . . , K - 1 K -1 + k =1 exp (k ) (x) where f (k) (x) = (k) + p j =1 (k ) (k ) (k ) (k ) (4) fj (xj ) has an additive form. We define f = (f (1) , . . . , f (K -1) ) to be a discriminant function and pf (x) = P(Y (k) = 1 | X = x) to be the conditional probability of category k given X = x. The logistic regression classifier hf (·) induced by f , which is a mapping (k ) from the sample space to the category labels, is simply given by hf (x) = arg maxk=1,...,K pf (x). If a variable Xj is irrelevant, then all of the component functions fj are identically zero, for each k = 1, 2, . . . , K - 1. This motivates the use of the regularization functional K -1 (f ) to zero out (1) (K -1) entire vectors fj = (fj , . . . , fj ). Denoting a f (x, y ) = K -1 k =1 (k ) 1 y (k ) ( k ) f (x) - log + K -1 k =1 exp f (k (x) ) s the multinomial log-loss, the sparse multi-category additive logistic regression estimator (SMALR) is thus formulated as the solution to the optimization problem ( -n 1i (1) (K -1) f ,...,f = arg min f (xi , yi ) + K -1 (f ) 5) n =1 f (1) ,...,f (K -1) where fj (k ) Hj (k ) for j = 1, . . . , p and k = 1, . . . , K - 1. 3 Simultaneous Sparse Backfitting We use a blockwise coordinate descent algorithm to minimize the functional defined in (3). We first formulate the population version of the problem by replacing sample averages by their expectations. We then derive stationary conditions for the optimum and obtain a population version algorithm for computing the solution by a series of soft-thresholded univariate conditional expectations. Finally, a finite sample version of the algorithm can be derived by plugging in nonparametric smoothers for these conditional expectations. l (1) (K ) (k ) (k ) (k ) For the j th block of component functions fj , . . . , fj , let Rj = Y (k) - =j fl (Xl ) denote the residual. Assuming all but the functions in the j th block to be fixed, the optimization problem is reduced to 1 KR 2+ k (k ) (k ) (k ) (k ) (1) (K ) f ,...,f E max fj . (6) = arg min j - fj (Xj ) j j k=1,...,K 2 (K ) (1) f ,...,f j j =1 The following result characterizes the solution to (6). 3 Theorem 1. Let Pj (k ) sj 1 (k ) R =E (k ) j | Xj (k ) a nd sj (k ) = Pj (k ) , and order the indices according to (k ) sj 2 ... (k ) sj K . fj (ki ) Then the solution to (6) is given by Pj(ki ) for i > m + (ki ) m Pj = 1i (k sj i ) - for i m . m (k ) si = 1 j (7) where m = arg maxm 1 m m i =1 sj ( ki ) a - nd [·]+ denotes the positive part. Therefore, the optimization problem in (6) is solved by a soft-thresholding operator, given in equation (7), which we shall denote as (fj , . . . , fj (1) (K ) ) = Soft [Rj , . . . , Rj () (1) (K ) ]. (8) While the proof of this result is lengthy, we sketch the key steps below, which are a functional extension of the subdifferential calculus approach of Fornasier and Rauhut (2008) in the linear setting. ^ First, we formulate an optimality condition in terms of the Gateaux derivative as follows. Lemma 2. The functions fj (k ) are solutions to (6) if and only if fj (k ) - Pj (k ) + uk vk = 0 (almost (k ) surely), for k = 1, . . . , K , where uk are scalars and vk are measurable functions of Xj , with " (k ) (u1 , . . . , uK )T · and vk fj , k = 1, . . . , K. T fj (1) ,..., fj (K ) " Here the former one denotes the subdifferential of the convex functional · evaluated at (1) (K ) T ( fj , . . . , fj ) , it lies in a K -dimensional Euclidean space. And the latter denotes the subdifferential of fj , which is a set of functions. Next, the following proposition from Rockafellar and Wets (1998) is used to characterize the subdifferential of sup-norms. Lemma 3. The subdifferential of · on RK is B1 (1) · (x) = conv{sign(xk )ek : |xk | = x (k ) } if x = 0 otherwise. where B 1 (1) denotes the 1 ball of radius one, conv(A) denotes the convex hull, and ek is the k -th canonical unit vector in RK . Using Lemma 2 and Lemma 3, the proof of Theorem 1 proceeds by considering three cases for the (1) (K ) T (k ) sup-norm subdifferential evaluated at ( fj , . . . , fj ) : (1) fj = 0 for k = 1, . . . , K ; (2) there exists a unique k , such that fj , (k ) fj (k fj ) (k ) = maxk =1,...,K fj (m) fj (k ) = 0; (3) there exists at least two k = k such that = = maxm=1,...,K = 0. The derivations for cases (1) and (2) are relatively straightforward, but for case (3) we prove the following. Lemma 4. The sup-norm is attained precisely at m entries only if sj m . (ki ) (k ) 1 - m + 1 entries and sj m+1 < m =1 sj i (k1 ) , . . . , sj (km+1 ) are the largest The proof of Theorem 1 then follows from the above lemmas and some calculus. Based on this result, the data version of the soft-thresholding operator is obtained by replacing the conditional (k ) (k ) (k ) (k ) (k ) (k ) expectation Pj = E(Rj | Xj ) by Sj Rj , where Sj is a nonparametric smoother for variable Xj , e.g., a local linear or spline smoother; see Figure 1. The resulting simultaneous sparse backfitting algorithm for multi-task and multi-response sparse additive models (MT-SpAM and MR-SpAM) is shown in Figure 2. The algorithm for the multi-response case (MR-SpAM) has (1) (K ) Sj = . . . = Sj since there is only a common design matrix. 4 (k ) SOFT-THRESHOLDING O P E R AT O R S O F T [Rj , . . . , Rj (k ) (k ) () (1) (K ) ; Sj , . . . , Sj (1) (K ) ]: DATA V E R S I O N Input: Smoothing matrices Sj , residuals Rj for k = 1, . . . , K , regularization parameter . h i (k ) (k ) (k ) (k ) (k ) b (k ) (1) Estimate Pj = E Rj | Xj by smoothing: Pj = Sj Rj ; b (2) Estimate norm: sj = Pj n and order the indices according to sj b b " "P (ki ) m 1 (3) Find m = arg maxm m - and calculate i =1 sj 8 >Pj(ki ) >b " < # m b (k ) Pj i = 1 X (ki ) > sj b - >m (k ) : si b i =1 +j (k ) (k ) (k1 ) sj b (k2 ) . . . sj b (kK ) ; for i > m for i m ; b(k ) fj i b b fj - mean(fj ) for k = 1, . . . , K . (k ) b Output: Functions fj for k = 1, . . . , K . (k ) (k ) b (4) Center fj Figure 1: Data version of the soft-thresholding operator. M U LT I - TA S K A N D M U LT I - R E S P O N S E S PA M Input: Data (xi , yi ), i = 1, . . . , n, k = 1, . . . , K and regularization parameter . b(k) = 0 and compute smoothers S (k) for j = 1, . . . , p and k = 1, . . . , K ; Initialize: Set f j j (k ) (k ) Iterate until convergence: For each j = 1, . . . , p: P (k ) b(k) (1) Compute residuals: Rj = y (k) - k =j fk for k = 1, . . . , K ; b(1) b(K ) Soft() [R(1) , . . . , R(K ) ; S (1) , . . . , S (K ) ]. (2) Threshold: f , . . . , f j j j j j j b Output: Functions f (k) for k = 1, . . . , K . Figure 2: The simultaneous sparse backfitting algorithm for MT-SpAM or MR-SpAM. For the multiresponse case, the same smoothing matrices are used for each k . 3.1 Penalized Local Scoring Algorithm for SMALR We now derive a penalized local scoring algorithm for sparse multi-category additive logistic regression (SMALR), which can be viewed as a variant of Newton's method in function space. At each iteration, a quadratic approximation to the loss is used as a surrogate functional with the regularization term added to induce joint functional sparsity. However, a technical difficulty is that the approximate quadratic problem in each iteration is weighted by a non-diagonal matrix in function space, thus a trivial extension of the algorithm in (Ravikumar et al., 2007) for sparse binary nonparametric logistic regression does not apply. To tackle this problem, we use an auxiliary function to lower bound the log-loss, as in (Krishnapuram et al., 2005). The population version of the log-loss is L(f ) = E[ f (X, Y )] with f = (f (1) , . . . , f (K -1) ). A second-order Lagrange form Taylor expansion to L(f ) at f is then L +1 ( ( E f - f )T H (f )(f - f ) L(f ) = L(f ) + E (f )T (f - f ) 9) 2 for some function f , where the gradient is L(f ) = Y - pf (X ) with pf (X ) = (pf (Y (1) = b b pb+ 1 | X ), . . . , pf (Y (K -1) = 1 | X ))T , and the Hessian is H (f ) = -diag f (X ) pf (X )pf (X )T . b e e e f ), i.e., H (f ) - B is Defining B = -(1/4)IK -1 , it is straightforward to show that B H( positive-definite. Therefore, we have that L +1 ( . L(f ) L(f ) + E (f )T (f - f ) E f - f )T B (f - f ) (10) 2 5 SMALR: S PA R S E M U LT I - C AT E G O RY A D D I T I V E L O G I S T I C R E G R E S S I O N Input: Data (xi , yi ), i = 1, . . . , n and regularization parameter . "P ." "" P P (k ) (k ) n b(k) Initialize: fj = 0 and (k) = log b n - n 1 K -1 yi , k = 1, . . . , K - 1 i=1 yi i= k =1 Iterate until convergence: (1) Compute pf (xi ) P(Y (k) = 1 | X = xi ) as in (4) for k = 1, . . . , K - 1; b " " P (k ) (k ) (k ) b(k) b (2) Calculate the transformed responses Zi = 4 yi - pf (xi ) + (k) + p=1 fj (xij ) j b for k = 1, . . . , K - 1 and i = 1, . . . , n; " " (k ) b b (3) Call subroutines (f (1) , . . . , f (K -1) ) MR-SpAM (xi , Z )n 1 , 2 ; i= i (k ) (4) Adjust the intercepts: (k) n 1 X (k ) Z; n i=1 i b b Output: Functions f (k) and intercepts (k) for k = 1, . . . , K - 1. Figure 3: The penalized local scoring algorithm for SMALR. The following lemma results from straightforward calculation. Lemma 5. The soluZon f that maximizes the righthand side of (10) is equivalent to the solution ti w that minimizes 1 E - Af 2 here A = (-B )1/2 and Z = A-1 (Y - pf ) + Af . b n 2 Recalling that f (k) = (k) + auxiliary functional K -1 1k E 2 =1 p j =1 fj , equation (9) and Lemma 5 then justify the use of the - p j =1 (k ) Z (k) f (k ) 2+ (Xj ) K -1 (f ) (11) Y + p (k ) (k ) where Z (k) = 4 - Pf (Y (k) = 1 | X ) (k) + j =1 fj (Xj ) and = 2. This is b precisely in the form of a multi-response SpAM optimization problem in equation (3). The resulting algorithm, in the finite sample case, is shown in Figure 3. 4 Experiments In this section, we first use simulated data to investigate the performance of the MT-SpAM simultaneous sparse backfitting algorithm. We then apply SMALR to a tumor classification and biomarker identification problem. In all experiments, the data are rescaled to lie in the p-dimensional cube [0, 1]p . We use local linear smoothing with a Gaussian kernel. To choose the regularization parameter , we simply use J -fold cross-validation or the GCV score from (Ravikumar et al., 2007) exn K (k ) (k ) tended to the multi-task setting: GCV() = i=1 k=1 Qf (k) (xi , yi ))/(n2 K 2 -(nK )df ())2 b f , K p (k ) (k ) (k ) (k ) where df () = k=1 j =1 j I n = 0 and j = trace(Sj ) is the effective degrees j of freedom for the univariate local linear smoother on the j th variable. 4.1 Synthetic Data We generated n = 100 observations from a 10-dimensional three-task additive model with four 4 (k ) (k ) (k ) k) k) relevant variables: yi = j =1 fj (xij ) + ( , k = 1, 2, 3, where ( N (0, 1); the comi i ponent functions fj (k ) Xj (k ) (Wj (k ) (k ) are plotted in Figure 4. The 10-dimensional covariates are generated as (k ) (k ) = + tU )/(1 + t), j = 1, . . . , 10 where W1 , . . . , W10 and U (k) are i.i.d. sampled from Uniform(-2.5, 2.5). Thus, the correlation between Xj and Xj is t2 /(1 + t2 ) for j = j . The results of applying MT-SpAM with the bandwidths h = (0.08, . . . , 0.08) and regularization parameter = 0.25 are summarized in Figure 4. The upper 12 figures show the 12 relevant component functions for the three tasks; the estimated function components are plotted as solid black 6 k=1 2 2 1 k=2 4 3 k=3 2 k=1 2 1 k=2 3 4 k=3 1 2 1 1 0 0 0 0 0 -1 -1 -1 -1 -1 -2 -2 -2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 -2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 -1 0.0 0 1 2 0.2 0.4 0.6 0.8 1.0 x1 k=1 2 2 1 x1 k=2 4 3 x1 k=3 2 x2 k=1 2 1 x2 k=2 3 4 x2 k=3 1 2 1 1 0 0 0 0 0 -1 -1 -1 -1 -1 -2 -2 -2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 -2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 -1 0.0 0 1 2 0.2 0.4 0.6 0.8 1.0 x3 0.4 t=0 x3 3 0.4 2 x3 t=2 x4 0.4 3 x4 t=4 x4 2 Empirical sup-L1 norm Empirical sup-L1 norm Empirical sup-L1 norm 0.3 0.3 0.3 4 1 0.2 0.2 0.2 0.1 0.1 10 6 0.1 10 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Path Index Path Index Path Index Variable selection accuracy Mo del MR-SpAM MARS t=0 89 0 t=1 80 0 t=2 47 0 t=3 37 0 t=0 7.43 (0.71) 8.66 (0.78) Estimation accuracy: MSE (sd) t=1 5.82 (0.60) 7.52 (0.61) t=2 3.83 (0.37) 5.36 (0.40) t=3 3.07 (0.30) 4.64 (0.35) Figure 4: (Top) Estimated vs. true functions from MT-SpAM; (Middle) Regularization paths using MT-SpAM. (Bottom) Quantitative comparison between MR-SpAM and MARS lines and the true function components are plotted using dashed red lines. For all the other variables (from dimension 5 to dimension 10), both the true and estimated components are zero. The middle three figures show regularization paths as the parameter varies; each curve is a plot of the maximum empirical L1 norm of the component functions for each variable, with the red vertical line representing the selected model using the GCV score. As the correlation increases (t increases), the separation between the relevant dimensions and the irrelevant dimensions becomes smaller. Using the same setup but with one common design matrix, we also compare the quantitative performance of MR-SpAM with MARS (Friedman, 1991), which is a popular method for multi-response additive regression. Using 100 simulations, the table illustrates the number of times the models are correctly identified and the mean squared error with the standard deviation in the parentheses. (The MARS simulations are carried out in R, using the default options of the mars function in the mda library.) 4.2 Gene Microarray Data Here we apply the sparse multi-category additive logistic regression model to a microarray dataset for small round blue cell tumors (SRBCT) (Khan et al., 2001). The data consist of expression profiles of 2,308 genes (Khan et al., 2001) with tumors classified into 4 categories: neuroblastoma (NB), rhabdomyosarcoma (RMS), non-Hodgkin lymphoma (NHL), and the Ewing family of tumors (EWS). The dataset includes a training set of size 63 and a test set of size 20. These data have been analyzed by different groups. The main purpose is to identify important biomarkers, which are a small set of genes that can accurately predict the type of tumor of a patient. To achieve 100% accuracy on the test data, Khan et al. (2001) use an artificial neural network approach to identify 96 genes. Tibshirani et al. (2002) identify a set of only 43 genes, using a method called nearest shrunken centroids. Zhang et al. (2008) identify 53 genes using the sup-norm support vector machine. In our experiment, SMALR achieves 100% prediction accuracy on the test data with only 20 genes, which is a much smaller set of predictors than identified in the previous approaches. We follow the same procedure as in (Zhang et al., 2008), and use a very simple screening step based on the marginal correlation to first reduce the number of genes to 500. The SMALR model is then trained using a plugin bandwidth h0 = 0.08, and the regularization parameter is tuned using 4-fold cross validation. The results are tabulated in Figure 5. In the left figure, we show a "heat map" of the selected variables on the training set. The rows represent the selected genes, with their cDNA chip image id. The patients are grouped into four categories according to the corresponding tumors, 7 78 1 3 770394 3 ID.207274 3 2 2 ID.1435862 3 2 ID.207274 377461 1435862 486110 383188 134748 841620 325182 812105 308231 377048 784224 244618 796258 207274 296448 814526 80649 236282 701751 EWS.T1 EWS.T2 EWS.T3 EWS.T4 EWS.T6 EWS.T7 EWS.T9 EWS.T11 EWS.T12 EWS.T13 EWS.T14 EWS.T15 EWS.T19 EWS.C8 EWS.C3 EWS.C2 EWS.C4 EWS.C6 EWS.C9 EWS.C7 EWS.C1 EWS.C11 EWS.C10 BL.C5 BL.C6 BL.C7 BL.C8 BL.C1 BL.C2 BL.C3 BL.C4 NB.C1 NB.C2 NB.C3 NB.C6 NB.C12 NB.C7 NB.C4 NB.C5 NB.C10 NB.C11 NB.C9 NB.C8 RMS.C4 RMS.C3 RMS.C9 RMS.C2 RMS.C5 RMS.C6 RMS.C7 RMS.C8 RMS.C10 RMS.C11 RMS.T1 RMS.T4 RMS.T2 RMS.T6 RMS.T7 RMS.T8 RMS.T5 RMS.T3 RMS.T10 RMS.T11 0 0 -3 -2 -1 -3 -2 -1 -3 -2 -1 0 k=1 ID.770394 3 3 2 2 k=1 ID.377048 2 3 k=3 ID.377048 1 1 0 0 0 1 -3 -2 -1 -3 -2 -1 -3 -2 -1 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 k=1 k=2 k=3 0.0 0.1 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 CV Score 2.0 2.5 1 1 1 3.0 0.3 0.5 0.7 Figure 5: SMALR results on gene data: heat map (left), marginal fits (center), and CV score (right). as illustrated in the vertical groupings. The genes are ordered by hierarchical clustering of their expression profiles. The heatmap clearly shows four block structures for the four tumor categories. This suggests visually that the 20 genes selected are highly informative of the tumor type. In the middle of Figure 5, we plot the fitted discriminant functions of different genes, with their image ids listed on the plot. The values k = 1, 2, 3 under each subfigure indicate the discriminant function the plot represents. We see that the fitted functions are nonlinear. The right subfigure illustrates the total number of misclassified samples using 4-fold cross validation, the values 0.3, 0.4 are both zero, for the purpose of a sparser biomarker set, we choose = 0.4. Interestingly, only 10 of the 20 identified genes from our method are among the 43 genes selected using the shrunken centroids approach of Tibshirani et al. (2002). 16 of them are are among the 96 genes selected by neural network approach of Khan et al. (2001). This non-overlap may suggest some further investigation. lambda 5 Discussion and Acknowledgements We have presented new approaches to fitting sparse nonparametric multi-task regression models and sparse multi-category classification models. Due to space constraints, we have not discussed results on the statistical properties of these methods, such as oracle inequalities and risk consistency; these theoretical results will be reported elsewhere. This research was supported in part by NSF grant CCF-0625879. References F O R NA S I E R , M . and R AU H U T, H . (2008). Recovery algorithms for vector valued data with joint sparsity constraints. SIAM Journal of Numerical Analysis 46 577­613. F R I E D M A N , J . H . (1991). Multivariate adaptive regression splines. The Annals of Statistics 19 1­67. K H A N , J ., W E I , J . S ., R I N G N E R , M ., S A A , L . H ., L A DA N Y I , M ., W E S T E R M A N N , F., B E RT H O L D , F., S C H WA B , M ., A N T O N E S C U , C . R ., P E T E R S O N , C . and M E LT Z E R , P. S . (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine 7 673 ­679. K R I S H NA P U R A M , B ., C A R I N , L ., F I G U E I R E D O , M . and H A RT E M I N K , A . (2005). Sparse multinomial logistic regression: Fast algorithms and generalization bounds. IEEE Transactions on Pattern Analysis and Machine Intelligence 27 957­ 968. R AV I K U M A R , P., L I U , H ., L A FF E RT Y, J . and WA S S E R M A N , L . (2007). SpAM: Sparse additive models. In Advances in Neural Information Processing Systems 20. MIT Press. RO C K A F E L L A R , R . T. and W E T S , R . J . - B . (1998). Variational Analysis. Springer-Verlag Inc. T I B S H I R A N I , R ., H A S T I E , T., NA R A S I M H A N , B ., and C H U , G . (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci U.S.A. 99 6567­6572. T RO P P, J ., G I L B E RT, A . C . and S T R AU S S , M . J . (2006). Algorithms for simultaneous sparse approximation. Part II: Convex relaxation. Signal Processing 86 572­588. T U R L AC H , B ., V E NA B L E S , W. N . and W R I G H T, S . J . (2005). Simultaneous variable selection. Technometrics 27 349­363. Z H A N G , H . H ., L I U , Y., W U , Y. and Z H U , J . (2008). Variable selection for the multicategory SVM via adaptive sup-norm regularization. Electronic Journal of Statistics 2 149­1167. Z H A N G , J . (2006). A probabilistic framework for multitask learning. Tech. Rep. CMU-LTI-06-006, Ph.D. thesis, Carnegie Mellon University. 8