SpAM: Sparse Additive Models
Pradeep Ravikumar Han Liu  John Lafferty  Larry Wasserman  Machine Learning Department  Department of Statistics  Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213


Abstract
We present a new class of models for high-dimensional nonparametric regression and classification called sparse additive models (SpAM). Our methods combine ideas from sparse linear modeling and additive nonparametric regression. We derive a method for fitting the models that is effective even when the number of covariates is larger than the sample size. A statistical analysis of the properties of SpAM is given together with empirical results on synthetic and real data, showing that SpAM can be effective in fitting sparse nonparametric models in high dimensional data.

1

Introduction

Substantial progress has been made recently on the problem of fitting high dimensional linear reT gression models of the form Yi = Xi  + i , for i = 1, . . . , n. Here Yi is a real-valued response, Xi is a p-dimensional predictor and i is a mean zero error term. Finding an estimate of  when p > n that is both statistically well-behaved and computationally efficient has proved challenging; however, the lasso estimator (Tibshirani (1996)) has been remarkably successful. The lasso estimator  minimizes the 1 -penalized sums of squares i
T (Yi - Xi  ) + 

jp

=1

|j |

(1)

with the 1 penalty  1 encouraging sparse solutions, where many components j are zero. The good empirical success of this estimator has been recently backed up by results confirming that it has strong theoretical properties; see (Greenshtein and Ritov, 2004; Zhao and Yu, 2007; Meinshausen and Yu, 2006; Wainwright, 2006). The nonparametric regression model Yi = m(Xi ) + i , where m is a general smooth function, relaxes the strong assumptions made by a linear model, but is much more challenging in high dimensions. Hastie and Tibshirani (1999) introduced the class of additive models of the form Yi = jp mj (Xij ) + i (2)

=1

which is less general, but can be more interpretable and easier to fit; in particular, an additive model can be estimated using a coordinate descent Gauss-Seidel procedure called backfitting. An extension of the additive model is the functional ANOVA model 1 j j Yi = mj (Xij ) + mj,k (Xij , Xik ) + mj,k, (Xij , Xik , Xi ) + ˇ ˇ ˇ + i (3)
j p <k <k <

1


which allows interactions among the variables. Unfortunately, additive models only have good statistical and computational behavior when the number of variables p is not large relative to the sample size n. In this paper we introduce sparse additive models (SpAM) that extend the advantages of sparse linear models to the additive, nonparametric setting. The underlying model is the same as in (2), but constraints are place on the component functions {mj }1j p to simultaneously encourage smoothness of each component and sparsity across components. The SpAM estimation procedure we introduce allows the use of arbitrary nonparametric smoothing techniques, and in the case where the underlying component functions are linear, it reduces to the lasso. It naturally extends to classification problems using generalized additive models. The main results of the paper are (i) the formulation of a convex optimization problem for estimating a sparse additive model, (ii) an efficient backfitting algorithm for constructing the estimator, (iii) simulations showing the estimator has excellent behavior on some simulated and real data, even when p is large, and (iv) a statistical analysis of the theoretical properties of the estimator that support its good empirical performance.

2

The SpAM Optimization Problem

In this section we describe the key idea underlying SpAM. We first present a population version of the procedure that intuitively suggests how sparsity is achieved. We then present an equivalent convex optimization problem. In the following section we derive a backfitting procedure for solving this optimization problem in the finite sample setting. To motivate our approach, we first consider a formulation that scales each component function gj by a scalar j , and then imposes an 1 constraint on  = (1 , . . . , p )T . For j  {1, . . . , p}, let Hj denote the Hilbert space of measurable functions fj (xj ) of the single scalar variable xj , such that E(fj (Xj )) = 0 and E(fj (Xj )2 ) < , furnished with the inner product f
 j , fj

and m(x) = E(Y | X = x) is the unknown regression function. Now consider the following modification of this problem that imposes additional constraints: (P ) min p E jp Y - p
j =1 j gj (Xj )

Let Hadd = H1 + H2 + . . . ,jHp denote the Hilbert space of functions of (x1 , . . . , xp ) that have an additive form: f (x) = fj (xj ). The standard additive model optimization problem, in the population setting, is Y 2 p (5) min E - j =1 fj (Xj )
fj Hj , 1j p

=

E

f

 j (Xj )fj (Xj )

.

(4)

 R ,gj Hj

subject to

=1

|j |  L
j

2

(6a) (6b) (6c) (6d)

E

E (gj ) = 0, j = 1, . . . , p

g2 =

1, j = 1, . . . , p

noting that gj is a function while  is a vector. Intuitively, the constraint that  lies in the 1 -ball { :  1  L} encourages sparsity of the estimated  , just as forpthe parametric lasso. When  is p sparse, the estimated additive function f (x) = j =1 fj (xj ) = j =1 j gj (xj ) will also be sparse, meaning that many of the component functions fj (ˇ) = j gj (ˇ) are identically zero. The constraints (6c) and (6c) are imposed for identifiability; without (6c), for example, one could always satisfy (6a) by rescaling. While this optimization problem makes plain the role 1 regularization of  to achieve sparsity, it has the unfortunate drawback of not being convex. More specifically, while the optimization problem is convex in  and {gj } separately, it is not convex in  and {gj } jointly. 2


However, consider the following related optimization problem: Y 2 p E - j =1 fj (Xj ) (Q) min
fj Hj

(7a) (7b) (7c)

subject to

E(fj ) = 0, j = 1, . . . , p.

jp

=1

E 2 (fj (Xj ))  L

This problem is convex in {fj }. Moreover, the solutions to problems (P ) and (Q) are equivalent:   , g o f o  ptimizes (P ) implies j = j gj ptimizes (Q); j j  o f o , g  T   ptimizes (Q) implies ptimizes (P ). j j = ( fj 2) j = fj / fj 2

While optimization problem (Q) has the important virtue of being convex, the way it encourages sparsity is not intu( tive; the following observation provides some insight. Consider the set C  R4 i . f f 2 2 2 2 defined by C = f11 , f12 , f21 , f22 )T  R4 : Then the projec11 + f12 + 21 + f22  L tion 12 C onto the first two components is an 2 ball. However, the projection 13 C j nto the first o and third components is an 1 ball. In this way, it can be seen that the constraint fj 2  L acts as an 1 constraint across components to encourage sparsity, while it acts as an 2 constraint within components to encourage smoothness, as in a ridge regression penalty. It is thus crucial that the norm fj 2 appears in the constraint, and not its square fj 2 . For the purposes of spar2 j sity, this constraint could be replaced by fj q  L for any q  1. In case each fj is linear, (fj (x1j ), . . . , f (xnj )) = j (x1j , . . . , xnj ), the optimization problem reduces to the lasso. The use of scaling coefficients together with a nonnegative garrote penalty, similar to our problem (P ), is considered by Yuan (2007). However, the component functions gj are fixed, so that the procedure is not asymptotically consistent. The form of the optimization problem (Q) is similar to that of the COSSO for smoothing spline ANOVA models (Lin and Zhang, 2006); however, our method differs significantly from the COSSO, as discussed below. In particular, our method is scalable and easy to implement even when p is much larger than n.

3

A Backfitting Algorithm for SpAM

We now derive a coordinate descent algorithm for fitting a sparse additive model. We assume that we observe Y = m(X ) + , where  is mean zero Gaussian noise. We write the Lagrangian for the optimization problem (Q) as L(f , , ľ) = Y 1 E - 2 p
j =1 fj (Xj )

k Let Rj = Y - =j fk (Xk ) be the j th residual. The stationary condition for minimizing L as a function of fj , holding the other components fk fixed for k = j , is expressed in terms of the Frechet derivative  L as  L(f , , ľ;  fj ) = E [(fj - Rj + vj ) fj ] = 0 (9) E 2 for any  fj  Hj satisfying E( fj ) = 0, where vj   (fj ) is an element of the subgradient, E E 2 2 2 satisfying vj  1 and vj = fj (fj ) if E(fj ) = 0. Therefore, conditioning on Xj , the stationary condition (9) implies fj + vj = E(Rj | Xj ). (10) Letting Pj = E[Rj | Xj ] denote the projection of the residual onto Hj , the solution satisfies    1 + E  fj = Pj if E(Pj2 ) >  (11) 2) (fj 3

2

+

jp

=1

E j 2 (fj (Xj )) + ľj E(fj ).

(8)


Input: Data (Xi , Yi ), regularization parameter . Initialize fj = fj , for j = 1, . . . , p. Iterate until convergence: For each j = 1, . . . , p: Compute the residual: Rj = Y - k
=j (0)

fk (Xk );

Estimate the projection Pj = E[Rj | Xj ] by smoothing: Pj = Sj Rj ; E Estimate the norm sj = [Pj ]2 using, for example, (15) or (35); + 1  Pj ; Soft-threshold: fj = - sj Center: fj  fj - mean(fj ). j Output: Component functions fj and estimator m(Xi ) = fj (Xij ). Figure 1: T H E S PA M BAC K FI T T I N G A L G O R I T H M and fj = 0 otherwise. Condition (11), in turn, implies   E E  2 1 + E  (fj ) = (Pj2 ) or 2) (fj

E 2 (fj ) =

E (Pj2 ) - .

(12)

where [ˇ]+ denotes the positive part. In the finite sample case, as in standard backfitting (Hastie and Tibshirani, 1999), we estimate the projection E[Rj | Xj ] by a smooth of the residuals: where Sj is a linear smoother, such as a local linear or kernel smoother. Let sj be an estimate of E [Pj2 ]. A simple but biased estimate is 1 sj =  Pj 2 = n m ean(Pj2 ). (15) Pj = Sj Rj (14)

Thus, we arrive at the following multiplicative soft-thresholding update for fj :     Pj fj = 1 - E 2) (Pj
+

(13)

More accurate estimators are possible; an example is given in the appendix. We have thus derived the SpAM backfitting algorithm given in Figure 1. While the motivating optimization problem (Q) is similar to that considered in the COSSO (Lin and Zhang, 2006) for smoothing splines, the SpAM backfitting algorithm decouples smoothing and sparsity, through a combination of soft-thresholding and smoothing. In particular, SpAM backfitting can be carried out with any nonparametric smoother; it is not restricted to splines. Moreover, by iteratively estimating over the components and using soft thresholding, our procedure is simple to implement and scales to high dimensions. 3.1 SpAM for Nonparametric Logistic Regression

The SpAM backfitting procedure can be extended to nonparametric logistic regression for classification. The additive logistic model is 1 p exp j =1 fj (Xj ) p (16) P(Y = 1 | X )  p(X ; f ) = + exp fj (Xj ) j =1 4


where Y  {0, 1}, and the population log-likelihood is (f ) = E [Y f (X ) - log (1 + exp f (X ))]. Recall that in the local scoring algorithm for generalized additive models (Hastie and Tibshirani, 1999) in the logistic case, one runs the backfitting procedure within Newton's method. Here one iteratively computes the transformed response for the current estimate f0 Yi - p(Xi ; f0 ) (17) Zi = f0 (Xi ) + p(Xi ; f0 )(1 - p(Xi ; f0 )) and weights w(Xi ) = p(Xi ; f0 )(1 - p(Xi ; f0 ), and carries out a weighted backfitting of (Z, X ) with weights w. The weighted smooth is given by Pj = Sj (wRj ) . (18) Sj w To incorporate the sparsity penalty, we first note that the Lagrangian is given by E j jp 2 (fj (Xj )) + ľj E(fj ) (19) L(f , , ľ) = E [log (1 + exp f (X )) - Y f (X )] + 
=1 2 element of the subgradient  (fj ). As in the unregularized case, this condition is nonlinear in f , and so we linearize the gradient of the log-likelihood around f0 . This yields the linearized condition 2 E [w(X )(f (X ) - Z ) | Xj ] + vj = 0. When E(fj ) = 0, this implies the condition f E  (20) (w | Xj ) + E j (Xj ) = E(w Rj | Xj ). (fj )2

and the stationary condition forE omponent function fj is E (p - Y | Xj ) + vj = 0 where vj is an c

In the finite sample case, in terms of the smoothing matrix Sj , this becomes fj = Sj (wRj ) E . 2 Sj w +  (fj )

(21)

If Sj (wRj ) 2 < , then fj = 0. Otherwise, this implicit, nonlinear equation for fj cannot be solved explicitly, so we propose to iterate until convergence: Sj (wRj )  . (22) fj  Sj w +  n / fj 2 When  = 0, this yields the standard local scoring update (18). An example of logistic SpAM is given in Section 5.

4
4.1

Properties of SpAM
SpAM is Persistent

The notion of risk consistency, or persistence, was studied by Juditsky and Nemirovski (2000) and Greenshtein and Ritov (2004) in the context of linear models. Let (X, Y ) denote a new pair (independent of the observed data) and define the predictive risk when predicting Y with f (X ) by R(f ) = E(Y - f (X ))2 . (23) j j gj (xj ) we also write the risk as R( , g ) Since we consider predictors of the form f (x) = where  = (1 , . . . , p ) and g = (g1 , . . . , gp ). Following Greenshtein and Ritov (2004), we say that an estimator mn is persistent relative to a class of functions Mn if
P

R(mn ) - R(m )  0 (24) n  where mn = argminf Mn R(f ) is the predictive oracle. Greenshtein and Ritov (2004) showed that the lasso is persistent for the class of linear models Mn = {f (x) = xT  :  1  Ln } if Ln = o((n/ log n)1/4 ). We show a similar result for SpAM.


Theorem 4.1. Suppose that pn  en for some  < 1. Then SpAM is ipersistent relative to the f n . p (x) = j =1 j gj (xj ) :  1  Ln f Ln = o (1-)/4 class of additive models Mn = 5


4.2

SpAM is Sparsistent

Theorem 4.2. Suppose that  satisfies the conditions   1 1 T T Cmax <  and min Cmin > 0  S  S max nS nS C 1 -1 2 1 min 1 -   , for some 0 <   1  T c S n T S S nS Cmax s 2

where each j is a d-dimensional vector. Let S denote the true set of variables {j : mj = 0}, with s = |S |, and let S c denote its complement. Let Sn = {j : j = 0} denote the estimated set of variables from the minimizer n of (25). (26) (27)

For the purpose of analysis, we use orthogonal function regression as the smoothing procedure. For each j = 1, . . . , p let j be an orthogonal basis for Hj . We truncate the basis to finite dimension dn , and let dn   such that dn /n  0. Let j denote the n × d matrix j (i, k ) = j k (Xij ). If A  {1, . . . , p}, we denote by A the n × d|A| matrix where for each i  A, i appears as a submatrix in the natural way. The SpAM optimization problem can then be written as 1 2 jp p 1Y min - j =1 j j + n  T T j j (25)  2n nj j =1

T In the case of linear regression, with mj (Xj ) = j Xj , Wainwright (2006) shows that under certain conditions on n, p, s = |supp( )|, and the design matrix X , the ls sso recovers the sparsi pattern a ty n is sparsistent: P upp( ) = supp(n ) asymptotically; that is, the lasso estimator 1. We show a similar result for SpAM with the sparse backfitting procedure.

Let the regularization parameter n  0 be chosen to satisfy s s dn (log dn + log(p - s)) n dn  0,  0.  0, and dn  n n2 n S - Then SpAM is sparsistent: P n = S  1.

(28)

5

Experiments

In this section we present experimental results for SpAM applied to both synthetic and real data, including regression and classification examples that illustrate the behavior of the algorithm in various conditions. We first use simulated data to investigate the performance of the SpAM backfitting algorithm, where the true sparsity pattern is known. We then apply SpAM to some real data. If not explicitly stated otherwise, the data are always rescaled to lie in a d-dimensional cube [0, 1]d , and a kernel smoother with Gaussian kernel is used. To tune the penalization parameter , we use a Cp statistic, which is defined as n Y 2 2 2 j p p 1i Cp (f ) = trace(Sj ) 1[fj = 0] (29) - j =1 fj (Xj ) + i n =1 n =1 5.1 where Sj is the smoothing matrix for the j -th dimension and  2 is the estimated variance. Simulations

¨ We first apply SpAM to an example from (Hardle et al., 2004). A dataset with sample size n = 150 is generated from the following 200-dimensional additive model: Yi = f1 (xi1 ) + f2 (xi2 ) + f3 (xi3 ) + f4 (xi4 ) + i f1 (x) = -2 sin(2x), f2 (x) = x -
2 1 3,

(30)
-x

and fj (x) = 0 for j  5 with noise i  N (0, 1). These data therefore have 196 irrelevant dimensions. The results of applying SpAM with the plug-in bandwidths are summarized in Figure 2. 6

f3 (x) = x -

1 2,

f4 (x) = e

+e

-1

-1

(31)


0 .6

12

prob. of correct recovery

Component Norms

0 .5

0 .4

10

0 .6

0 .8

1 .0

14

p=128

p=256

43

0 .3

Cp

8

0 .2

194 9 94 2

6

0 .1

0 .0

4

0.0

0.2
l1=97.05

0.4

0.6

0.8
l1=88.36

1.0

0.0

0.2

0.4

0.6

0.8
l1=79.26

1.0

0 .0 0 10 20 30 40 50 60 70 80 90

2

0 .2

0 .4

110 130 150
zero

sample size
l1=90.65
6 4 6

zero
6 4 4
0.2 0.4 0.6 0.8 1.0

4

4

6

2

4

2

m1 2

m2 2

-2 m3

m4 2

-2 m5

-2

-2

-4

-2

-4

-4

-4

-6

-4

-6

0.0

0.2

0.4

x1

0.6

0.8

1.0

0.0

0.2

0.4

x2

0.6

0.8

1.0

0.0

0.2

0.4

x3

0.6

0.8

1.0

0.0

0.2

0.4

x4

0.6

0.8

1.0

0.0

-6
0.0 0.2

-4

-2 m6

2

6

x5

0.4

x6

0.6

0.8

1.0

Figure 2: (Simulated data) Upper left: The empirical 2 norm of the estimated components as plotted j fj 2. Upper center: against the tuning parameter ; the value on the x-axis is proportional to The Cp scores against the tuning parameter ; the dashed vertical line corresponds to the value of  which has the smallest Cp score. Upper right: The proportion of 200 trials where the correct relevant variables are selected, as a function of sample size n. Lower (from left to right): Estimated (solid lines) versus true additive component functions (dashed lines) for the first 6 dimensions; the remaining components are zero. 5.2 Boston Housing

The Boston housing data was collected to study house values in the suburbs of Boston; there are altogether 506 observations with 10 covariates. The dataset has been studied by many other authors ¨ (Hardle et al., 2004; Lin and Zhang, 2006), with various transformations proposed for different covariates. To explore the sparsistency properties of our method, we add 20 irrelevant variables. Ten of them are randomly drawn from Uniform(0, 1), the remaining ten are a random permutation of the original ten covariates, so that they have the same empirical densities. The full model (containing all 10 chosen covariates) for the Boston Housing data is: medv =  + f1 (crim) + f2 (indus) + f3 (nox) + f4 (rm) + f5 (age) + f6 (dis) + f7 (tax) + f8 (ptratio) + f9 (b) + f10 (lstat) (32)

The result of applying SpAM to this 30 dimensional dataset is shown in Figure 3. SpAM identifies 6 nonzero components. It correctly zeros out both types of irrelevant variables. From the full solution path, the important variables are seen to be rm, lstat, ptratio, and crim. The importance of variables nox and b are borderline. These results are basically consistent with those obtained ¨ by other authors (Hardle et al., 2004). However, using Cp as the selection criterion, the variables indux, age, dist, and tax are estimated to be irrelevant, a result not seen in other studies. 5.3 SpAM for Spam

Here we consider an email spam classification problem, using the logistic SpAM backfitting algorithm from Section 3.1. This dataset has been studied by Hastie et al. (2001), using a set of 3,065 emails as a training set, and conducting hypothesis tests to choose significant variables; there are a total of 4,601 observations with p = 57 attributes, all numeric. The attributes measure the percentage of specific words or characters in the email, the average and maximum run lengths of upper case letters, and the total number of such letters. To demonstrate how SpAM performs well with sparse data, we only sample n = 300 emails as the training set, with the remaining 4301 data points used as the test set. We also use the test data as the hold-out set to tune the penalization parameter . 7


l1=177.14
20 20

l1=1173.64

4

m1 10

3

Component Norms

10

70

-10

60

2

8

40

50

Cp

0.0 0.2 0.4 0.6 0.8 1.0

x1 l1=478.29

-10
0.0 0.2 0.4 0.6 0.8 1.0

m4 10

80

x4 l1=1221.11

20

63

1

m8 10

0

17 7 5

20

-10

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

x8

-10
0.0 0.2 0.4 0.6 0.8 1.0

m1010

30

20

x10

Figure 3: (Boston housing) Left: The empirical 2 norm of the estimated components versus the regularization parameter . Center: The Cp scores against ; the dashed vertical line corresponds to best Cp score. Right: Additive fits for four relevant variables.
(×10-3 ) 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 E R RO R 0.2009 0.1725 0.1354  0.1083 ( ) 0.1117 0.1174 0.1251 0.1259 # Z E RO S 55 51 46 20 0 0 0 0
S E L E C T E D VA R I A B L E S { 8 , 5 4} { 8 , 9 , 2 7 , 5 3 , 5 4 , 5 7} {7 , 8 , 9 , 1 7 , 1 8 , 2 7 , 5 3 , 5 4 , 5 7 , 5 8} {4 , 6 ­ 1 0 , 1 4 ­ 2 2 , 2 6 , 2 7 , 3 8 , 5 3 ­ 5 8} ALL ALL ALL ALL
0.20 0.12 2.0

Empirical prediction error

0.14

0.16

0.18

2.5

3.0

3.5

4.0

4.5

5.0

5.5

penalization parameter

Figure 4: (Email spam) Classification accuracies and variable selection for logistic SpAM. The results of a typical run of logistic SpAM are summarized in Figure 4, using plug-in bandwidths. It is interesting to note that even with this relatively small sample size, logistic SpAM recovers a sparsity pattern that is consistent with previous authors' results. For example, in the best model chosen by logistic SpAM, according to error rate, the 33 selected variables cover 80% of the significant predictors as determined by Hastie et al. (2001).

References
G R E E N S H T E I N , E . and R I T OV, Y. (2004). Persistency in high dimensional linear predictor-selection and the virtue of over-parametrization. Journal of Bernoulli 10 971­988. ¨ ¨ H A R D L E , W., M U L L E R , M ., S P E R L I C H , S . and W E RWAT Z , A . (2004). Nonparametric and Semiparametric Models. Springer-Verlag Inc. H A S T I E , T. and T I B S H I R A N I , R . (1999). Generalized additive models. Chapman & Hall Ltd. H A S T I E , T., T I B S H I R A N I , R . and F R I E D M A N , J . H . (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag. J U D I T S K Y, A . and N E M I ROV S K I , A . (2000). Functional aggregation for nonparametric regression. Ann. Statist. 28 681­712. L I N , Y. and Z H A N G , H . H . (2006). Component selection and smoothing in multivariate nonparametric regression. Ann. Statist. 34 2272­2297. M E I N S H AU S E N , N . and Y U , B . (2006). Lasso-type recovery of sparse representations for high-dimensional data. Tech. Rep. 720, Department of Statistics, UC Berkeley. T I B S H I R A N I , R . (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, Methodological 58 267­288. WA I N W R I G H T, M . (2006). Sharp thresholds for high-dimensional and noisy recovery of sparsity. Tech. Rep. 709, Department of Statistics, UC Berkeley. Y UA N , M . (2007). Nonnegative garrote component selection in functional ANOVA models. In Proceedings of AI and Statistics, AISTATS. Z H AO , P. and Y U , B . (2007). On model selection consistency of lasso. J. of Mach. Learn. Res. 7 2541­2567.

8