SpAM: Sparse Additive Models Pradeep Ravikumar Han Liu John Lafferty Larry Wasserman Machine Learning Department Department of Statistics Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 Abstract We present a new class of models for high-dimensional nonparametric regression and classification called sparse additive models (SpAM). Our methods combine ideas from sparse linear modeling and additive nonparametric regression. We derive a method for fitting the models that is effective even when the number of covariates is larger than the sample size. A statistical analysis of the properties of SpAM is given together with empirical results on synthetic and real data, showing that SpAM can be effective in fitting sparse nonparametric models in high dimensional data. 1 Introduction Substantial progress has been made recently on the problem of fitting high dimensional linear reT gression models of the form Yi = Xi + i , for i = 1, . . . , n. Here Yi is a real-valued response, Xi is a p-dimensional predictor and i is a mean zero error term. Finding an estimate of when p > n that is both statistically well-behaved and computationally efficient has proved challenging; however, the lasso estimator (Tibshirani (1996)) has been remarkably successful. The lasso estimator minimizes the 1 -penalized sums of squares i T (Yi - Xi ) + jp =1 |j | (1) with the 1 penalty 1 encouraging sparse solutions, where many components j are zero. The good empirical success of this estimator has been recently backed up by results confirming that it has strong theoretical properties; see (Greenshtein and Ritov, 2004; Zhao and Yu, 2007; Meinshausen and Yu, 2006; Wainwright, 2006). The nonparametric regression model Yi = m(Xi ) + i , where m is a general smooth function, relaxes the strong assumptions made by a linear model, but is much more challenging in high dimensions. Hastie and Tibshirani (1999) introduced the class of additive models of the form Yi = jp mj (Xij ) + i (2) =1 which is less general, but can be more interpretable and easier to fit; in particular, an additive model can be estimated using a coordinate descent Gauss-Seidel procedure called backfitting. An extension of the additive model is the functional ANOVA model 1 j j Yi = mj (Xij ) + mj,k (Xij , Xik ) + mj,k, (Xij , Xik , Xi ) + · · · + i (3) j p (11) 2) (fj 3 2 + jp =1 E j 2 (fj (Xj )) + µj E(fj ). (8) Input: Data (Xi , Yi ), regularization parameter . Initialize fj = fj , for j = 1, . . . , p. Iterate until convergence: For each j = 1, . . . , p: Compute the residual: Rj = Y - k =j (0) fk (Xk ); Estimate the projection Pj = E[Rj | Xj ] by smoothing: Pj = Sj Rj ; E Estimate the norm sj = [Pj ]2 using, for example, (15) or (35); + 1 Pj ; Soft-threshold: fj = - sj Center: fj fj - mean(fj ). j Output: Component functions fj and estimator m(Xi ) = fj (Xij ). Figure 1: T H E S PA M BAC K FI T T I N G A L G O R I T H M and fj = 0 otherwise. Condition (11), in turn, implies E E 2 1 + E (fj ) = (Pj2 ) or 2) (fj E 2 (fj ) = E (Pj2 ) - . (12) where [·]+ denotes the positive part. In the finite sample case, as in standard backfitting (Hastie and Tibshirani, 1999), we estimate the projection E[Rj | Xj ] by a smooth of the residuals: where Sj is a linear smoother, such as a local linear or kernel smoother. Let sj be an estimate of E [Pj2 ]. A simple but biased estimate is 1 sj = Pj 2 = n m ean(Pj2 ). (15) Pj = Sj Rj (14) Thus, we arrive at the following multiplicative soft-thresholding update for fj : Pj fj = 1 - E 2) (Pj + (13) More accurate estimators are possible; an example is given in the appendix. We have thus derived the SpAM backfitting algorithm given in Figure 1. While the motivating optimization problem (Q) is similar to that considered in the COSSO (Lin and Zhang, 2006) for smoothing splines, the SpAM backfitting algorithm decouples smoothing and sparsity, through a combination of soft-thresholding and smoothing. In particular, SpAM backfitting can be carried out with any nonparametric smoother; it is not restricted to splines. Moreover, by iteratively estimating over the components and using soft thresholding, our procedure is simple to implement and scales to high dimensions. 3.1 SpAM for Nonparametric Logistic Regression The SpAM backfitting procedure can be extended to nonparametric logistic regression for classification. The additive logistic model is 1 p exp j =1 fj (Xj ) p (16) P(Y = 1 | X ) p(X ; f ) = + exp fj (Xj ) j =1 4 where Y {0, 1}, and the population log-likelihood is (f ) = E [Y f (X ) - log (1 + exp f (X ))]. Recall that in the local scoring algorithm for generalized additive models (Hastie and Tibshirani, 1999) in the logistic case, one runs the backfitting procedure within Newton's method. Here one iteratively computes the transformed response for the current estimate f0 Yi - p(Xi ; f0 ) (17) Zi = f0 (Xi ) + p(Xi ; f0 )(1 - p(Xi ; f0 )) and weights w(Xi ) = p(Xi ; f0 )(1 - p(Xi ; f0 ), and carries out a weighted backfitting of (Z, X ) with weights w. The weighted smooth is given by Pj = Sj (wRj ) . (18) Sj w To incorporate the sparsity penalty, we first note that the Lagrangian is given by E j jp 2 (fj (Xj )) + µj E(fj ) (19) L(f , , µ) = E [log (1 + exp f (X )) - Y f (X )] + =1 2 element of the subgradient (fj ). As in the unregularized case, this condition is nonlinear in f , and so we linearize the gradient of the log-likelihood around f0 . This yields the linearized condition 2 E [w(X )(f (X ) - Z ) | Xj ] + vj = 0. When E(fj ) = 0, this implies the condition f E (20) (w | Xj ) + E j (Xj ) = E(w Rj | Xj ). (fj )2 and the stationary condition forE omponent function fj is E (p - Y | Xj ) + vj = 0 where vj is an c In the finite sample case, in terms of the smoothing matrix Sj , this becomes fj = Sj (wRj ) E . 2 Sj w + (fj ) (21) If Sj (wRj ) 2 < , then fj = 0. Otherwise, this implicit, nonlinear equation for fj cannot be solved explicitly, so we propose to iterate until convergence: Sj (wRj ) . (22) fj Sj w + n / fj 2 When = 0, this yields the standard local scoring update (18). An example of logistic SpAM is given in Section 5. 4 4.1 Properties of SpAM SpAM is Persistent The notion of risk consistency, or persistence, was studied by Juditsky and Nemirovski (2000) and Greenshtein and Ritov (2004) in the context of linear models. Let (X, Y ) denote a new pair (independent of the observed data) and define the predictive risk when predicting Y with f (X ) by R(f ) = E(Y - f (X ))2 . (23) j j gj (xj ) we also write the risk as R( , g ) Since we consider predictors of the form f (x) = where = (1 , . . . , p ) and g = (g1 , . . . , gp ). Following Greenshtein and Ritov (2004), we say that an estimator mn is persistent relative to a class of functions Mn if P R(mn ) - R(m ) 0 (24) n where mn = argminf Mn R(f ) is the predictive oracle. Greenshtein and Ritov (2004) showed that the lasso is persistent for the class of linear models Mn = {f (x) = xT : 1 Ln } if Ln = o((n/ log n)1/4 ). We show a similar result for SpAM. Theorem 4.1. Suppose that pn en for some < 1. Then SpAM is ipersistent relative to the f n . p (x) = j =1 j gj (xj ) : 1 Ln f Ln = o (1-)/4 class of additive models Mn = 5 4.2 SpAM is Sparsistent Theorem 4.2. Suppose that satisfies the conditions 1 1 T T Cmax < and min Cmin > 0 S S max nS nS C 1 -1 2 1 min 1 - , for some 0 < 1 T c S n T S S nS Cmax s 2 where each j is a d-dimensional vector. Let S denote the true set of variables {j : mj = 0}, with s = |S |, and let S c denote its complement. Let Sn = {j : j = 0} denote the estimated set of variables from the minimizer n of (25). (26) (27) For the purpose of analysis, we use orthogonal function regression as the smoothing procedure. For each j = 1, . . . , p let j be an orthogonal basis for Hj . We truncate the basis to finite dimension dn , and let dn such that dn /n 0. Let j denote the n × d matrix j (i, k ) = j k (Xij ). If A {1, . . . , p}, we denote by A the n × d|A| matrix where for each i A, i appears as a submatrix in the natural way. The SpAM optimization problem can then be written as 1 2 jp p 1Y min - j =1 j j + n T T j j (25) 2n nj j =1 T In the case of linear regression, with mj (Xj ) = j Xj , Wainwright (2006) shows that under certain conditions on n, p, s = |supp( )|, and the design matrix X , the ls sso recovers the sparsi pattern a ty n is sparsistent: P upp( ) = supp(n ) asymptotically; that is, the lasso estimator 1. We show a similar result for SpAM with the sparse backfitting procedure. Let the regularization parameter n 0 be chosen to satisfy s s dn (log dn + log(p - s)) n dn 0, 0. 0, and dn n n2 n S - Then SpAM is sparsistent: P n = S 1. (28) 5 Experiments In this section we present experimental results for SpAM applied to both synthetic and real data, including regression and classification examples that illustrate the behavior of the algorithm in various conditions. We first use simulated data to investigate the performance of the SpAM backfitting algorithm, where the true sparsity pattern is known. We then apply SpAM to some real data. If not explicitly stated otherwise, the data are always rescaled to lie in a d-dimensional cube [0, 1]d , and a kernel smoother with Gaussian kernel is used. To tune the penalization parameter , we use a Cp statistic, which is defined as n Y 2 2 2 j p p 1i Cp (f ) = trace(Sj ) 1[fj = 0] (29) - j =1 fj (Xj ) + i n =1 n =1 5.1 where Sj is the smoothing matrix for the j -th dimension and 2 is the estimated variance. Simulations ¨ We first apply SpAM to an example from (Hardle et al., 2004). A dataset with sample size n = 150 is generated from the following 200-dimensional additive model: Yi = f1 (xi1 ) + f2 (xi2 ) + f3 (xi3 ) + f4 (xi4 ) + i f1 (x) = -2 sin(2x), f2 (x) = x - 2 1 3, (30) -x and fj (x) = 0 for j 5 with noise i N (0, 1). These data therefore have 196 irrelevant dimensions. The results of applying SpAM with the plug-in bandwidths are summarized in Figure 2. 6 f3 (x) = x - 1 2, f4 (x) = e +e -1 -1 (31) 0 .6 12 prob. of correct recovery Component Norms 0 .5 0 .4 10 0 .6 0 .8 1 .0 14 p=128 p=256 43 0 .3 Cp 8 0 .2 194 9 94 2 6 0 .1 0 .0 4 0.0 0.2 l1=97.05 0.4 0.6 0.8 l1=88.36 1.0 0.0 0.2 0.4 0.6 0.8 l1=79.26 1.0 0 .0 0 10 20 30 40 50 60 70 80 90 2 0 .2 0 .4 110 130 150 zero sample size l1=90.65 6 4 6 zero 6 4 4 0.2 0.4 0.6 0.8 1.0 4 4 6 2 4 2 m1 2 m2 2 -2 m3 m4 2 -2 m5 -2 -2 -4 -2 -4 -4 -4 -6 -4 -6 0.0 0.2 0.4 x1 0.6 0.8 1.0 0.0 0.2 0.4 x2 0.6 0.8 1.0 0.0 0.2 0.4 x3 0.6 0.8 1.0 0.0 0.2 0.4 x4 0.6 0.8 1.0 0.0 -6 0.0 0.2 -4 -2 m6 2 6 x5 0.4 x6 0.6 0.8 1.0 Figure 2: (Simulated data) Upper left: The empirical 2 norm of the estimated components as plotted j fj 2. Upper center: against the tuning parameter ; the value on the x-axis is proportional to The Cp scores against the tuning parameter ; the dashed vertical line corresponds to the value of which has the smallest Cp score. Upper right: The proportion of 200 trials where the correct relevant variables are selected, as a function of sample size n. Lower (from left to right): Estimated (solid lines) versus true additive component functions (dashed lines) for the first 6 dimensions; the remaining components are zero. 5.2 Boston Housing The Boston housing data was collected to study house values in the suburbs of Boston; there are altogether 506 observations with 10 covariates. The dataset has been studied by many other authors ¨ (Hardle et al., 2004; Lin and Zhang, 2006), with various transformations proposed for different covariates. To explore the sparsistency properties of our method, we add 20 irrelevant variables. Ten of them are randomly drawn from Uniform(0, 1), the remaining ten are a random permutation of the original ten covariates, so that they have the same empirical densities. The full model (containing all 10 chosen covariates) for the Boston Housing data is: medv = + f1 (crim) + f2 (indus) + f3 (nox) + f4 (rm) + f5 (age) + f6 (dis) + f7 (tax) + f8 (ptratio) + f9 (b) + f10 (lstat) (32) The result of applying SpAM to this 30 dimensional dataset is shown in Figure 3. SpAM identifies 6 nonzero components. It correctly zeros out both types of irrelevant variables. From the full solution path, the important variables are seen to be rm, lstat, ptratio, and crim. The importance of variables nox and b are borderline. These results are basically consistent with those obtained ¨ by other authors (Hardle et al., 2004). However, using Cp as the selection criterion, the variables indux, age, dist, and tax are estimated to be irrelevant, a result not seen in other studies. 5.3 SpAM for Spam Here we consider an email spam classification problem, using the logistic SpAM backfitting algorithm from Section 3.1. This dataset has been studied by Hastie et al. (2001), using a set of 3,065 emails as a training set, and conducting hypothesis tests to choose significant variables; there are a total of 4,601 observations with p = 57 attributes, all numeric. The attributes measure the percentage of specific words or characters in the email, the average and maximum run lengths of upper case letters, and the total number of such letters. To demonstrate how SpAM performs well with sparse data, we only sample n = 300 emails as the training set, with the remaining 4301 data points used as the test set. We also use the test data as the hold-out set to tune the penalization parameter . 7 l1=177.14 20 20 l1=1173.64 4 m1 10 3 Component Norms 10 70 -10 60 2 8 40 50 Cp 0.0 0.2 0.4 0.6 0.8 1.0 x1 l1=478.29 -10 0.0 0.2 0.4 0.6 0.8 1.0 m4 10 80 x4 l1=1221.11 20 63 1 m8 10 0 17 7 5 20 -10 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x8 -10 0.0 0.2 0.4 0.6 0.8 1.0 m1010 30 20 x10 Figure 3: (Boston housing) Left: The empirical 2 norm of the estimated components versus the regularization parameter . Center: The Cp scores against ; the dashed vertical line corresponds to best Cp score. Right: Additive fits for four relevant variables. (×10-3 ) 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 E R RO R 0.2009 0.1725 0.1354 0.1083 ( ) 0.1117 0.1174 0.1251 0.1259 # Z E RO S 55 51 46 20 0 0 0 0 S E L E C T E D VA R I A B L E S { 8 , 5 4} { 8 , 9 , 2 7 , 5 3 , 5 4 , 5 7} {7 , 8 , 9 , 1 7 , 1 8 , 2 7 , 5 3 , 5 4 , 5 7 , 5 8} {4 , 6 ­ 1 0 , 1 4 ­ 2 2 , 2 6 , 2 7 , 3 8 , 5 3 ­ 5 8} ALL ALL ALL ALL 0.20 0.12 2.0 Empirical prediction error 0.14 0.16 0.18 2.5 3.0 3.5 4.0 4.5 5.0 5.5 penalization parameter Figure 4: (Email spam) Classification accuracies and variable selection for logistic SpAM. The results of a typical run of logistic SpAM are summarized in Figure 4, using plug-in bandwidths. It is interesting to note that even with this relatively small sample size, logistic SpAM recovers a sparsity pattern that is consistent with previous authors' results. For example, in the best model chosen by logistic SpAM, according to error rate, the 33 selected variables cover 80% of the significant predictors as determined by Hastie et al. (2001). References G R E E N S H T E I N , E . and R I T OV, Y. (2004). Persistency in high dimensional linear predictor-selection and the virtue of over-parametrization. Journal of Bernoulli 10 971­988. ¨ ¨ H A R D L E , W., M U L L E R , M ., S P E R L I C H , S . and W E RWAT Z , A . (2004). Nonparametric and Semiparametric Models. Springer-Verlag Inc. H A S T I E , T. and T I B S H I R A N I , R . (1999). Generalized additive models. Chapman & Hall Ltd. H A S T I E , T., T I B S H I R A N I , R . and F R I E D M A N , J . H . (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag. J U D I T S K Y, A . and N E M I ROV S K I , A . (2000). Functional aggregation for nonparametric regression. Ann. Statist. 28 681­712. L I N , Y. and Z H A N G , H . H . (2006). Component selection and smoothing in multivariate nonparametric regression. Ann. Statist. 34 2272­2297. M E I N S H AU S E N , N . and Y U , B . (2006). Lasso-type recovery of sparse representations for high-dimensional data. Tech. Rep. 720, Department of Statistics, UC Berkeley. T I B S H I R A N I , R . (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, Methodological 58 267­288. WA I N W R I G H T, M . (2006). Sharp thresholds for high-dimensional and noisy recovery of sparsity. Tech. Rep. 709, Department of Statistics, UC Berkeley. Y UA N , M . (2007). Nonnegative garrote component selection in functional ANOVA models. In Proceedings of AI and Statistics, AISTATS. Z H AO , P. and Y U , B . (2007). On model selection consistency of lasso. J. of Mach. Learn. Res. 7 2541­2567. 8