Boosting with Structural Sparsity John Duchi Department of EECS, University of California, Berkeley, CA 94720 Yoram Singer Google, Mountain View, CA 94043 JDUCHI @ CS . BERKELEY. EDU SINGER @ GOOGLE . COM Abstract We derive generalizations of AdaBoost and related gradient-based coordinate descent methods that incorporate sparsity-promoting penalties for the norm of the predictor that is being learned. The end result is a family of coordinate descent algorithms that integrate forward feature induction and back-pruning through regularization and give an automatic stopping criterion for feature induction. We study penalties based on the 1 , 2 , and norms of the predictor and introduce mixed-norm penalties that build upon the initial penalties. The mixed-norm regularizers facilitate structural sparsity in parameter space, which is a useful property in multiclass prediction and other related tasks. We report empirical results that demonstrate the power of our approach in building accurate and structurally sparse models. 1. Introduction and problem setting Boosting is a highly effective and popular method for obtaining an accurate classifier from a set of inaccurate predictors. Boosting algorithms construct high precision classifiers by taking a weighted combination of base predictors, known as weak hypotheses (see Meir and R¨ tsch (2003) a and the numerous references therein). Many boosting algorithms can also be viewed as forward-greedy feature induction procedures. In this view, the weak-learner provides new predictors which perform well either in terms of their error-rate with respect to the distribution that boosting maintains or in terms of their potential in reducing the empirical loss. Once a feature is chosen, typically in a greedy manner, it is associated with a weight which remains intact through the reminder of the boosting process. The aesthetics and simplicity of AdaBoost and other forward greedy algorithms, such as LogitBoost (Friedman et al., 2000), also facilitate a tacit defense from overfitAppearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). ting, especially when combined with early stopping of the boosting process (Zhang & Yu, 2005). The empirical success of Boosting algorithms helped popularize the view that boosting algorithms are resilient to overfitting. However, several researchers have noted the deficiency of forwardgreedy boosting algorithms and suggest alternative coordinate descent algorithms, such as totally-corrective boosting (Warmuth et al., 2006) or Zhang's forward/backward algorithms (2008). The algorithms that we present in this paper build on existing boosting and other coordinate descent procedures while incorporating, throughout their run, regularization of the weights of the selected features. The added regularization terms influence both the selection of new features and the weight assignment steps. Moreover, as we discuss below, the regularization may eliminate (by assigning a weight of zero) previously selected features. The result is a simple procedure that includes forward induction, weight estimation, backward pruning, entertains convergence guarantees, and yields sparse models. We are also able to group features and impose structural sparsity on the learned weights, which is a focus and one of the main contributions of this paper. Our starting point is a simple yet effective modification to AdaBoost that incorporates an 1 penalty for the norm of the weight vector it constructs. The update we devise can be used both for weight optimization as well as for induction of new accurate hypotheses while taking the resulting 1-norm into account. A closely related approach was suggested by Dud´k et al. (2007) in the context of maximum i entropy, though our analysis for classification is a special case of an abstract saddle-point theorem that we prove. This general theorem is also applicable to other norms and losses, in particular the norm, which serves as a building block for imposing structural sparsity. For simplicity, we assume that the class of weak hypotheses is finite and contains n different hypotheses. We thus map each instance x to an n dimensional vector (h1 (x), . . . , hn (x)), and we overload notation and simply denote the vector as x Rn . Though omitted due to the lack of space, our framework can be used in the presence of countably infinite features, also known as the task of feature induction. Each instance xi is associated with a label yi {-1, +1}. The goal of learning is then to find a weight vector w such that the sign of w · xi is equal to yi . Moreover, we would like to attain large innerproducts so long as the predictions are correct. Formally, given a sample S = {(xi , yi )}m , the algorithm focuses i=1 on finding w for which the empirical logistic loss, L(w) = m i=1 log (1 + exp(-yi (w · xi )), is small. Our derivation also applies to the exp-loss. We first adapt AdaBoost to incorporate the 1 -norm of the weight vector into the emm pirical loss, i=1 log (1 + exp(-yi (w · xi ))) + w 1 . This problem is by no means new. It is often referred to as 1 -regularized logistic regression and several advanced optimization methods have been designed for the problem (Koh et al., 2007). 1 -regularization has many advantages, including its ability to yield sparse weight vectors w and, under certain conditions, to recover the true sparsity of w (Zhao & Yu, 2006). We extend this by replacing the 1 -norm with a mixed-norm regularizer (denoted 1 /p ) to achieve structural sparsity. Mixed-norm regularization is used when there is a partition or structure over the weights that separates them into disjoint groups of parameters, and an p -norm ties each group. For concreteness and in order to leverage existing boosting algorithms, we specifically focus on settings in which we have a matrix W = [w1 · · · wk ] Rn×k of weight vectors, and we regularize the weights in each row of W (denoted wj ) together through an p -norm. We derive updates for two important settings that generalize binary logistic-regression. The first is multitask learning (Obozinski et al., 2007), in which we have a set of tasks {1, . . . , k} and a weight vector wk for each task. Our goal is to learn a matrix W minimizing m k n and multitask boosting-based procedure with 1 / regularization. We then shift our focus to an alternative apparatus for coordinate descent with the log-loss that does not stem from the AdaBoost algorithm. In this approach we upper bound the log-loss by a quadratic function, terming the resulting update GradBoost as it updates coordinates in a fashion that follows the gradient. Similar to the generalization of AdaBoost, we study 1 and penalties by reusing the fixed-point theorem, and we derive GradBoost updates for both 1 / and 1 /2 mixed-norm regularization. It is clearly impossible to cover related work in depth and we give here a very brief overview. Our derivation follows the template-based algorithm of Collins et al. (2002), incorporating regularization in a way analogous to Dud´k et i al.'s maximum-entropy framework (2007). The base GradBoost algorithm we derive shares similarity with LogitBoost (Friedman et al., 2000) while our bounding technique was first suggested by Dekel et al. (2005). Learning sparse models through 1 regularization is the focus of a voluminous amount of work in different research areas, from statistics to information theory (Zhao & Yu, 2006; Koh et al., 2007; Zhang, 2008). Multiple authors have studied the setting of mixed-norm regularization, which is of great interest in the statistical estimation literature, though the focus is typically on consistency for linear regression rather than efficient algorithms. Negahban and Wainwright (2008) recently analyzed sparsity through 1 / mixednorms, and Obozinski et al. (2007) analyze 1 /2 . 2. AdaBoost with 1 regularization We now outline our 1 -infused modification to AdaBoost, providing only a sketch, since the algorithm can be obtained as a special case of the analysis presented in Sec. 3. We build on existing analyses of AdaBoost, which all derive upper bounds on the loss which the booster then minimizes. We can then rely on the fact that each round of boosting is guaranteed to decrease the penalized loss. In the generalized version of boosting (Collins et al., 2002), the booster selects a vector a from a set of templates A on each round of boosting. The template selects the set of base hypotheses whose weight we update. Moreover, the template vector can specify a different budget for each feature update so long as the vector a satisfies the condition j aj |xi,j | 1. Classical boosting sets a single coordinate in the vector a to a non-zero value. We start by recalling the progress bound for AdaBoost with the log-loss when using a template vector. Define importance weights t q t (i) = 1/(1 + eyi w ·xi ) , which are the probability the current classifier assigns to the incorrect label for example i, and weighted correlations µ+ = j i:yi xi,j >0 log 1 + e-yi,r (wr ·xi ) + i=1 r=1 j=1 wj p . (1) The other generalization we describe is the multiclass logistic loss. We again assume there are k weight vectors that operate on each instance. Given an example xi , the classifier's prediction is a vector [w1 · xi , . . . , wk · xi ], and the predicted class is the index of the inner-product attaining the largest of the k values, argmaxr wr · xi . In this case, the regularized loss Q(W ) is m n log 1 + i=1 wr ·xi -wyi ·xi + r=yi e j=1 wj p (2) We also give a new upper bound for the multiclass loss. Previous multiclass constructions for boosting assume that each base hypothesis provides a different prediction per class, so they are not directly applicable to the multiclass setting discussed in this paper, which allocates a dedicated predictor per class. Our result is an efficient multiclass q t (i)|xi,j | ; µ- = j i:yi xi,j <0 q t (i)|xi,j | . t Let wt+1 = wt + t , j = aj dt , and a satisfy j j aj |xi,j | 1. Then the change in the log-loss, t = L(wt ) - L(wt+1 ), between two iterations of boosting is lower bounded by (Collins et al., 2002) n t t k min d j=1 fj (dj ) + d . (5) t j=1 aj µ+ 1 - e-dj + µ- 1 - edj j j . Since the 1 penalty is an additive term, we incorporate the change in the 1-norm of w to bound the overall decrease in the loss when updating wt to wt + t with t - t + wt 1 + wt 1 . By construction, this bound is additive in j . Thus, we omit the index j, eliminate constants, and are left with the following minimization problem in : min aµ+ e-/a + aµ- e/a + | + w| . The following theorem characterizes the solution d of Eq. (5), and leads to efficient algorithms for specific functions fj . We use [k] as a shorthand for the set {1, 2, . . . , k}. ~ ~ Theorem 1. Let dj satisfy fj (dj ) = 0. The optimal solu tion d of Eq. (5) satisfies the following properties: k (i) d = 0 if and only if j=1 |fj (0)| . (ii) For all j, fj (0)dj 0 and fj (dj )dj 0. (iii) Let B = j : |d | = d and U = [k] \ B: j ~ (a) For all j U , dj = d and fj (d ) = 0. j j ~ (b) For all j B, |dj | |d | = d (c) When d = 0, jB |fj (d )| = . j Sketch of Proof We sketch a proof of the theorem using subgradient calculus (Bertsekas, 1999). For point (i), we note that the subgradient set of d evaluated at d = 0 is the set of vectors {z : z 1 }. Thus, we look at the 1 -norm of the gradient of the sum of functions at d = 0, k and if j=1 |fj (0)| then d = 0 and vice versa. Point (ii) is a consequence of the monotonicity of the derivatives (or subgradient sets) of convex functions (the derivatives are non-decreasing). For point (iii), we note that if d = 0, then (a), (b), and (c) are trivially satisfied. If not, then consider the subgradient set of the norm: j (3) We state two lemmas that aid us in finding the optimum of Eq. (3). The lemmas are special cases of Thm. 1 later. Lemma 2.1. If µ+ ew/a - µ- e-w/a > 0, then the minimizing of Eq. (3) satisfies + w 0. Likewise, if µ+ ew/a - µ- e-w/a < 0, then + w 0. Lemma 2.2. The solution of Eq. (3) with respect to is = -w if and only if µ+ ew/a - µ- e-w/a . t+1 Equipped with the above lemmas, the update to wj is straightforward to derive. Let us assume without loss of t t generality that µ+ ewj /aj - µ- e-wj /aj > , so that j = j j -wj and j + wj > 0. We need to solve the equation -µ+ e-j /aj + µ- ej /aj + = 0 or µ- 2 + - µ+ = 0, j j j j where = ej /aj . Since is strictly positive, it is equal to the positive root of the above quadratic equation, yielding j = aj log - + d = Co {sign(di )ei : |di | = d } . For part (a), any j U must have derivative = 0 to satisfy the subgradient conditions for optimality. For part ~ (b), we note that if j B and |dj | < |d |, we can move j ~ d toward dj while decreasing or not changing d , and j we decrease fj (dj ). Part (c) is another consequence of the subgradient conditions for optimality. In Fig. 1, we present a general algorithm for minimizing j fj (dj ) + d that builds directly on Thm. 1. The algorithm flips signs so that all dj 0 (see part (ii) of the theorem). It then iteratively adds points to the bounded set B, starting from the point with largest unregularized solu~ tion d(1) (see part (iii)). When the algorithm finds a set B and bound = d so that part (iii) of the theorem is satisfied (which is guaranteed by (iii.b)), it terminates. The algorithm na¨vely has runtime complexity O(k 2 ), which we i bring down to O(k log k) in the sequel by exploiting the specific structure of fj . Revisiting AdaBoost with 1 -regularization Lemmas 2.1 and 2.2 are special cases of the theorem. Recall the 1 regularized minimization problem we faced for the exponential loss in Eq. (3). We had to minimize a function of the form aµ+ e-/a + aµ- e/a + | + w|. Replacing + w with , this minimization problem is equivalent to minimizing aµ+ ew/a e-/a + aµ- e-w/a e/a + ||, a one dimensional version of the problem considered in Thm. 1. fj (d ) j 2 + 4µ+ µ- /(2µ- ) j j j . (4) t In the symmetric case, when j + wj < 0, we get that j = aj log( + 2 + 4µ+ µ- /(2µ- )). Finally, when j j j t |µ+ e j t wj /aj - µ- e-wj /aj | , Lemma 2.2 implies that j t j = -wj . When µ- is zero and µ+ ewj /aj > the soluj j tion is j = aj log(µ+ /), and analogously for µ+ = 0. j j 3. Incorporating regularization We now begin to lay the framework for multitask and multiclass boosting with mixed-norm regularizers. The main theorem in this section can be used to derive and analyze the algorithm of previous section. Before deriving updates for boosting, we consider a more general framework of minimizing a sum of one-dimensional, convex, bounded below and differentiable functions fj (d) plus an -regularizer. We want to solve I NPUT: Functions {fr }k , regularization r=1 k I F r=1 |fr (0)| R ETURN d = 0 // Find sign of optimal solutions S ET sr = - sign(fr (0)) // Get ordering of unregularized solutions ~ S OLVE dr = argmind fr (sr d) ~ ~ ~ // We have d(1) d(2) . . . d(k) ~ ~ ~ S ORT {dr } (descending) into {d(r) }; d(k+1) = 0 F OR l = 1 to k l S OLVE for such that i=1 f(i) (si ) = - ~ I F d(l+1) B REAK ~ R ETURN d such that d = sr min{dr , } r Figure 1. Algorithm for minimizing P r Define the importance-weighted correlations as µ+ r,j µ- r,j = = q t (i, r)|xi,j | + q (i, r)|xi,j | + t (1 - q t (i, yi ))|xi,j | (1 - q t (i, yi ))|xi,j | , i:yi =r,xi,j <0 i:yi =r,xi,j >0 i:yi =r,xi,j >0 i:yi =r,xi,j <0 the update to row j of W t to be wt+1 = wt + t , and j j j let a satisfy maxi n k j 1 aj |xi,j | 2 . The change in the multiclass loss, t = L(W t+1 ) - L(W t ), is bounded by aj j=1 r=1 µ+ (1 - e-j,r /aj ) + µ- (1 - ej,r /aj ) r,j r,j t t . fr (dr ) + d . 4. AdaBoost with 1 / regularization In this section we present generalizations of AdaBoost to the multitask and multiclass problems with mixed-norm regularization given in Eq. (1) and Eq. (2). We start by deriving boosting-style updates for the mixed-norm multitask loss of Eq. (1) with p = . The multitask loss is decomposable into sums of losses, one per task. Hence, for each separate task we obtain the same bound as that in the binary classification case. However, we now need to update rows wj from the matrix W while taking into account the mixed-norm penalty. Given a row j, we calculate importance weights q t (i, r) for each example i and task r as the probability the current weights assign the incorrect label, q t (i, r) = 1/(1 + exp(yi,r wr · xi )), and the correlations t µ± for each task r as µ+ = r,j r,j i:yi,r xi,j >0 q (i, r)|xi,j | - t and µr,j = i:yi,r xi,j <0 q (i, r)|xi,j |. Defining j = [j,1 · · · j,k ] and applying the progress bound for binary classification, we see when we perform the update W t+1 = W t + [ t · · · t ] we can lower bound the change in the 1 n loss, t = L(W t+1 ) - L(W t ), with aj µ+ (1 - e-j,r /aj ) + µ- (1 - ej,r /aj ) r,j r,j j,r t t The boosting bounds we derived for the multitask and multiclass losses are syntactically identical, differing only in their computation of the importance weights and correlations. These similarities allow us to attack the update for both problems together, deriving one efficient algorithm for 1 / -regularized boosting based on a corollary to Thm. 1. Adding -regularization terms to the multiclass and multitask losses, we have that the change in the objective is n n t - j=1 wt + j + j=1 wt . For simj j plicity of our derivation, we focus on updating a single row j in W , and we temporarily assume that wt = 0. We make j the substitution aj dr = j,r . The update to wj is now given by the solution to the following minimization problem: k min d r=1 µ+ e-dr + µ- edr + d r r . (6) First, we note that the objective function of Eq. (6) is separable in dj with an -regularization term. Second, the derivative of each of the terms, dropping the regularizer, is -µ+ e-dr + µ- edr . Third, the unregularized minimizers r r 1 ~ ~ are dr = 2 log(µ+ /µ- ) (where we allow dr = ±). We r r immediately have the following corollary to Thm. 1. Corollary 4.1. The minimizing d of Eq. (6) is d = 0 k if and only if r=1 |µ+ - µ- | . Assume w.l.o.g that r r + - µr µr . When d = 0, there are sets B = {r : |d | = r d } and U = [k] \ B such that (a) For all r U , µ+ e-dr - µ- edr = 0 r r ~ (b) For all r B, |dr | |d | = d r + - d - µ- e d - = 0. (c) r rB µr e Based on the corollary, we can derive an efficient procedure that first sorts the indices in [k] by the magnitude of the un~ regularized solution dr (we can assume that µ+ µ- and r r flip signs at the end as in Fig. 1), then solves the following equation for each [k]: e-d µ+ - ed r µ- - = 0 . r (7) ~ ~ r:dr d() ~ ~ r:dr d() . As before, the template vector should satisfy the constraint that j aj |xi,j | 1 for all i. For the multiclass objective from Eq. (2), we change the definition of the importance weights and correlations µ± , r,j which gives us a new bound on the change in the multiclass loss. The following lemma extends the boosting bounds of Collins et al. (2002). Lemma 4.1 (Multiclass boosting progress bound). Let q t (i, r) denote the importance weight for each example index i and class index r {1, . . . , k}, where q t (i, r) = exp(wt · xi ) r . exp(wt · xi ) l l I NPUT: Training set S = {(xi , y i )}m i=1 Regularization , number of rounds T n Update templates A R+ s.t. o nP n 1 a A maxi j=1 aj |xi,j | 2 F OR t = 1 to T C HOOSE j {1, . . . , n} F OR i = 1 to m and r = 1 to k exp(w r i S ET q t (i, r) = P exp(w·x·x)i ) l l F OR r = 1 to k X X t q (i, r)|xi,j | µ+ = (1 - q t (i, yi ))|xi,j | + r,j i:yi =r,xi,j >0 quadratic upper bounds based on those of Dekel et al. (2005). Concretely, we use bounds of the form 1 L(w + ej ) L(w) + L(w) · ej + ej · Dej , 2 where D upper bounds 2 L(w) (in the binary case, m D = diag(1/4 i=1 x2 )). We term these methods Gradi,j Boost for their use of gradients and bounds on the Hessian. We make no claim about whether the resulting algorithms entertain the weak-to-strong learnability properties of AdaBoost. The quadratic bounds allow us to perform boosting-style steps with 2 and 2 regularization in addi2 tion to the regularizers studied above. We start with the binary logistic loss. GradBoost, similar to AdaBoost, can use a template vector a to parameterize updates. We focus on single-coordinate updates for cleanliness, however. The following lemma gives a progress bound for GradBoost. Lemma 5.1 (GradBoost Progress Bound). Let g denote the gradient of the logistic loss, gj = -µ+ + µ- . Let j j aj = 1/ i x2 and wt+1 = wt + t ej with t = aj dt . i,j Then the change, t = L(wt ) - L(wt+1 ), in the logisticloss between iterations of GradBoost is lower bounded by t -aj « ,, « ,, ( t )2 (dt )2 = - gj t + . gj dt + 8 8aj µ- = r,j i:yi =r,xi,j <0 X (1 - q t (i, yi ))|xi,j | + i:yi =r,xi,j >0 i:yi =r,xi,j <0 X t q (i, r)|xi,j | M INIMIZE for j Rk such that aj =i0 (use Alg. 1) h , , P + -j,r /aj + µ- ej,r /aj +,w t +aj j , j r,j j,r aj µr,j e U PDATE W t+1 = W t + [ 1 · · · n ] Figure 2. AdaBoost for 1 / -regularized multiclass. This process continues until we find an index such that ~ ~ the solution d of Eq. (7) satisfies d d(+1) , where d() th is the largest unregularized solution. To develop an ± efficient algorithm, define M = r:dr d()µ± . To solve ~ ~ r Eq. (7) for each , applying the reasoning for Eq. (4) gives . (8) - 2M - + When M = 0, we get d = log(/M ). We can use Eq. (8) successively in the algorithm of Fig. 1 by set± ± ~ ting M+1 = M + µ± . To recap, by sorting dr = () 1 + - ± 2 log(µr /µr ) and incrementally updating M , we can use the algorithm of Fig. 1 to solve the extensions of AdaBoost with 1 / -regularization in time O(k log k). d = log It remains to show how to solve the more general update when w = 0. In particular, we would like to find the minimum of k a r=1 - + + - 2 + 4M M To derive a usable bound for GradBoost with 1 regularization, we replace the progress bound in lemma 5.1 with a bound dependent on wt+1 and wt by substituting wt+1 - wt for t . Incorporating 1 -regularization, we get that Q(wt+1 ) - Q(wt ) is upper bounded by t+1 t gj (wj -wj )+ 1 t+1 t t (wt+1 -wj )2 +|wj |-|wj | (10) 8aj j µ+ e-dr + µ- edr + w + ad r r The bound above is separable, so Thm. 1 gives the update t+1 t t wj = sign(wj - 4aj gj ) |wj - 4aj gj | - 4aj + . (9) (11) We can make the transformation r = wr /a+dr , which reduces our problem to the problem of finding the minimizer k of r=1 µ+ ewr /a e-r + µ- e-wr /a er + with r r respect to . This minimization problem can be solved by using the same sorting-based approach, then recovering d = - w/a. Combining our reasoning for the multiclass and multitask losses, we obtain an algorithm that solves both problems by appropriately setting µ± and using the algorithm of Fig. 1 r,j to update rows of W . As they are so similar, we present the algorithm only for the multiclass loss in Fig. 2. We can use Eq. (11) to derive a GradBoost algorithm for the 1 -regularized logistic loss. The algorithm is omitted as we give a general multiclass version in the sequel. It is also possible to use Thm. 1 to obtain new coordinate descent methods for losses with 1 / regularization. Due to lack of space we omit details but report the performance of these variants in Sec. 6. One form of regularization that has not been considered in the standard boosting literature is 2 or 2 regulariza2 tion. The lack thereof is a consequence of AdaBoost's exponential bounds on the loss. GradBoost, however, can straightforwardly incorporate 2 -based penalties, since it instead uses linear and quadratic bounds on the decrease in the loss. We focus on multiclass GradBoost, as modifications for multitask follow the lines of derivation discussed 5. GradBoost with 1 & 1 /2 Regularization In this section we shift our attention to a lesser used approach and derive additive updates for the logistic-loss with I NPUT: Training set S = {(xi , yi )}m ; i=1 Regularization ; number of rounds T F OR t = 1 to T C HOOSE j {1, . . . , n} S ET aj = 1/ i x2 i,j F OR i = 1 to m and r = 1 to k // Compute importance weights for each class exp(wr ·x ) S ET q t (i, r) = Pk exp(w i·x ) F OR r = 1 to k // Compute gradient terms m S ET gr,j = i=1 (q t (i, r) - 1 {r = yi })xi,j g j = [g1,j · · · gk,j ] wt+1 = wt - 2aj g j j j 1- 2aj wt -2aj g j j 2 l=1 l i MCAT Loss Rates Smooth-train Smooth-test L1-train L1-test L2-test 0.4 CCAT Loss Rates 0.25 0.35 0.2 0.3 0.15 0.25 0.1 0.2 0.15 100 200 300 400 500 600 700 800 100 200 300 400 500 600 700 800 ECAT Loss Rates 0.24 0.24 0.22 0.22 0.2 0.2 0.18 0.18 0.16 0.16 0.14 0.14 0.12 0.12 0.1 GCAT Loss Rates + 0.1 0.08 0.08 0.06 100 200 300 400 500 600 700 800 100 200 300 400 500 600 700 800 Figure 3. GradBoost for 1 /2 -regularized multiclass. thus far. We focus particularly on mixed-norm 1 /2 regularization (Obozinski et al., 2007), in which rows from the matrix W = [w1 · · · wk ] are regularized together in an 2 -norm. This leads to the following modification of the multiclass objective Q(W ) from Eq. (2): m n Figure 4. Results on the 4 top-level classes from RCV1. 6. Experiments One of the objectives of this section is to demonstrate empirically that the proposed algorithms are effective in obtaining sparse and accurate models. In our first series of experiments, we focus on boosting and feature induction, investigating the effect of 1 -regularization and its ability to automatically cease inducing new features in the presence of regularization. We examined both classification and regression tasks. For classification, we used the Reuters RCV1 Corpus (Lewis et al., 2004), which consists of 804, 414 news articles and around 100,000 words that constitute our features. Each article is labeled by at least one label from the set MCAT, CCAT, ECAT, and GCAT, and we train a classifier for each each class. We show average logistic loss rates over a series of tests using 30,000 randomly chosen articles with a 70/30 training/test split in Fig. 4 (error rates are similar and we omit them for space). As a baseline for comparison, we used boosting regularized with a smooth-1 penalty (Dekel et al., 2005), an approximation to the 1 -norm. We also compared our approach to 2 -regularized logistic regression with features chosen using mutual information with the target (details omitted). The regularization parameters were chosen using 5-fold cross validation. For both boosting algorithms, we ran the "totally corrective" variant (Warmuth et al., 2006). Concretely, we added the 30 top-scoring features on every iteration for the 1 booster and the single top-scoring feature for the smooth-1 regularized booster, then reoptimized the weights of all the features previously selected. The graphs in Fig. 4 suggest an interesting story. In all of them, the 1 regularized booster actually ceased adding features around iteration 30, including about 700 features with non-zero weights after the back-pruning/optimization steps. Therefore, the plots for 1 -AdaBoost terminate early, while the smooth-1 booster keeps adding features and starts overfitting to the training set as early as iteration 200. log 1 + i=1 r=yi e wr ·xi -wyi ·xi + j=1 wj 2 (12) Generalizing the GradBoost progress bound while using the normalization aj = 1/ i x2 as before (omitting dei,j tails), we upper bound Q(W t+1 ) - Q(W t ) by k t wj,r 2aj t+1 wj,r + gr,j - r=1 1 4 « k r=1 t+1 (wj,r )2 aj (13) + k t X ,, (wj,r )2 r=1 4aj t - gr,j wj,r , , ´ , `, + ,w t+1 ,2 - ,w t ,2 j j where gr,j is the derivative of the multiclass loss w.r.t the j th weight for class r. The above bound is evidently a separable quadratic function with 2 -regularization. We would like to use Eq. (13) to perform block coordinate descent on the 1 /2 -regularized loss Q from Eq. (12). Defining the gradient vector g j = [g1,j · · · gk,j ] and using basic properties from convex analysis, we obtain that the update performed to minimize Eq. (13) with respect to row wj is wt+1 = wj - 2aj g j j 1- 2aj - 2aj g j wt j (14) 2 + To recap, we obtain an algorithm for minimizing the 1 /2 regularized multiclass loss by iteratively choosing indices j and then applying the update of Eq. (14). The algorithm is given in Fig. 3. 0.8 0.8 0.18 0.16 L1-boost Classical Hinge L1-LS RMSE 0.25 0.2 L1-boost Classical Hinge L1-LS Coverage 0.7 L1 L1/L2 L1/Linf 0.7 L1 L1/L2 L1/Linf 0.6 0.6 RMSE 0.14 0.5 Coverage 0 0.05 0.1 0.15 0.2 0.25 0.15 0.5 0.12 0.4 0.4 0.1 0.1 0.05 0.08 10 0 0.3 0.3 0.2 0.2 0 0.05 0.1 0.15 0.2 0.25 10 1 10 2 10 0 10 1 10 2 Proportion nonzero Proportion features used Boosting Iterations Boosting Iterations Figure 5. Results for Boston Housing (left) and Aileron. Figure 6. Left: test set coverage versus overall sparsity of W . Right: test set coverage versus row sparsity of W . We also performed comparisons on regression tasks. We describe here results for the Boston housing data set from the UCI repository and an F16 aircraft control dataset, where the goal is to predict an action on the ailerons of the aircraft given its state. We used the -insensitive regression loss (Dekel et al., 2005) to learn a predictor w. m In this case, our objective is i=1 log (1 + ew·xi -yi - ) + log (1 + eyi -w·xi - ), which approximates -insensitive hinge regression. For -insensitive regression, an analysis similar to that for standard boosting can be performed to compute µ+ and µ- for every feature, which allows us to perform boosting as already described. For these tests, we compared unregularized "classical" (forward-greedy) sequential AdaBoost, our 1 -regularized totally-corrective boosting, 1 -regularized least squares (Friedman et al., 2007), and 2 -regularized -insensitive hinge loss. The boosters used a countably infinite set of features by examining all products of features and were started with a single bias feature. The algorithms could thus construct arbitrarily many products of raw features as base hypotheses and explore correlations between the features. For the 1 -regularized least squares and the hinge loss, we simply trained on the base regressors. Fig. 5 shows the results for these tests with the Housing results on the left. Each plot contains the root-mean-square error on test (the absolute error on test is qualitatively similar). The 1 -regularized booster stopped after inducing an average of under 35 features, marked with a star in the graphs, after which a dotted line indicates the booster's intact performance. We see that even when classical AdaBoost is allowed to run for 1,000 iterations, its performance on test does not meet the performance for the 35 features induced by the 1 -regularized model. In fact, even after 3,000 iterations, the smooth-1 variant was not able to outperform the 35 feature model built by the 1 -penalized version. Furthermore, the latter trains at least an order of magnitude faster than the classical forward greedy regressor and results in a significantly simpler model. 1 penalized AdaBoost also outperforms 1 -penalized least squares and the 2 -regularized hinge loss with respect to both the squared and absolute errors. In the next set of experiments, we compare the different structured regularizers with multiclass logistic losses to un- structured regularization. As our datasets are single-label, we omit experiments with multitask losses. For all multiclass experiments, we focus on two error metrics. The first is misclassification rate. The second is coverage, which measures how wrong a classifier is. Given k weight vectors wr and an example xi with label yi , the coverage is the correct weight vector's position in the sorted list of inner products wr · xi . For example, if wyi · xi is the largest, the coverage is 0, if it is third, the coverage is 2. We begin with the StatLog LandSat dataset (Spiegelhalter & Taylor, 1994), which consists of spectral values of pixels in 3×3 neighborhoods in a satellite image. We expanded the data by taking products of all features, giving 1296 features per example. The goal is to classify a pixel (a piece of ground) as one of six ground types. In Fig. 6, we plot coverage as a function of sparsity and as a function of the number of features actually used (trained with GradBoost). The plot on the left shows the test set coverage as a function of the proportion of zeros in the learned weight matrix W (we plot results for a training set of 240 examples per class as results are similar for smaller and larger sets). On the right, we show test set coverage as a function of the actual number of features that need to be computed to classify a piece of ground--that is, the proportion of all zero rows in W . From the plots, we see that for a given performance level, the 1 -regularized solution is sparser in terms of the absolute number of zeros. However, the 1 -regularized classifier requires at least 50% more features be computed than does the 1 /2 -regularized classifer for the same test accuracy. The results for misclassification rates are similar, and the variance for each point in the plots is smaller than 10-3 . We also ran tests using the MNIST handwritten digits database. The dataset consists of 60,000 training examples with a 10,000 example test set and has 10 classes. Each image is a gray-scale 28 × 28 image, which we represent as xi R784 . Rather than directly use the input xi , however, we learned weights wj using Kernel-based weak hy2 1 potheses hj (x) = K(xj , x), K(x, z) = e- 2 x-z for j S, where S is a 2766 element support set we generate by doing one pass through the data with the perceptron and keeping examples on which it makes mistakes. This gives a 27,660 dimensional multiclass problem. On the left side of Fig. 7, we plot the coverage rate for each algorithm on the 0.22 10 4 L 1 1 1 2 2 10 3 0.2 L /L L2 0.18 L /Linf Q(W ) - Q(W ) * LandSat - Ada LandSat - Grad MNIST - Ada MNIST- Grad 0.16 Collins, M., Schapire, R., & Singer, Y. (2002). Logistic regression, AdaBoost and Bregman distances. Machine Learning, 47, 253­285. Dekel, O., Shalev-Shwartz, S., & Singer, Y. (2005). Smooth epsilon-insensitive regression by loss symmetrization. J. Machine Learning Research, 6, 711­741. Coverage 0.14 10 2 0.12 0.1 t 2 3 10 0.08 1 0.06 10 0 10 10 0 1 2 3 4 5 6 7 8 x 10 9 4 Number examples/class Iterations Figure 7. Left: Coverage on MNIST. Right: Training time 80 0.28 0.26 0.24 % Non-Zero Rows LandSat - Ada LandSat - Grad MNIST - Ada MNIST- Grad 70 60 LandSat - Ada LandSat - Grad MNIST - Ada MNIST- Grad Dud´k, M., Phillips, S. J., & Schapire, R. E. (2007). Maxi imum entropy density estimation with generalized regularization and an application to species distribution modeling. J. Machine Learning Research, 8, 1217­1260. Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28, 337­374. Friedman, J., Hastie, T., & Tibshirani, R. (2007). Pathwise coordinate optimization. Annals of Applied Statistics, 1, 302­332. Koh, K., Kim, S., & Boyd, S. (2007). An interior-point method for large-scale 1 -regularized logistic regression. J. Machine Learning Research, 8, 1519­1555. Lewis, D., Yang, Y., Rose, T., & Li, F. (2004). RCV1: A new benchmark collection for text categorization research. J. Machine Learning Research, 5, 361­397. Meir, R., & R¨ tsch, G. (2003). An introduction to boosting a and leveraging. Advanced Lectures on Machine Learning (pp. 119­184). Springer. Negahban, S., & Wainwright, M. (2008). Phase transitions for high-dimensional joint support recovery. Advances in Neural Information Processing Systems 22. Obozinski, G., Taskar, B., & Jordan, M. (2007). Joint covariate selection for grouped classification (Technical Report 743). Dept. of Statistics, University of California Berkeley. Spiegelhalter, D., & Taylor, C. (1994). Machine learning, neural and statistical classification. Ellis Horwood. Warmuth, M., Liao, J., & Ratsch, G. (2006). Totally corrective boosting algorithms that maximize the margin. Proceedings of the 23rd International Conference on Machine Learning (pp. 1001­1008). Zhang, T. (2008). Adaptive forward-backward greedy algorithm for sparse learning with linear models. Advances in Neural Information Processing Systems 22. Zhang, T., & Yu, B. (2005). Boosting with early stopping: Convergence and consistency. Annals of Statistics, 33, 1538­1579. Zhao, P., & Yu, B. (2006). On model selection consistency of Lasso. J. Machine Learning Research, 7, 2541­2567. Test error rate 0.22 0.2 0.18 0.16 0.14 0.12 0.1 50 40 30 20 10 0 1 2 3 4 5 6 7 8 x 10 9 4 0 0 1 2 3 4 5 6 7 8 x 10 9 4 Iterations Iterations Figure 8. Left: MNIST/LandSat error rate. Right: Sparsity. 10,000 example test set versus the number of training examples used per class (we choose regularization values by cross-validation). Each improves as the number of training examples grows; however, it is clear that the sparsity inducing regularizers, specifically the structural 1 / and 1 /2 regularizers, give better performance than the others. The error rate on the test set is roughly 50% the coverage and qualitatively similar. We conclude with a direct comparison of AdaBoost and GradBoost with 1 / -regularization. On the left side of Fig. 7 and in Fig. 8, we plot the training objective, test error rate, and sparsity of the classifiers as a function of training time for both AdaBoost and GradBoost on the LandSat dataset and MNIST dataset. From Fig. 8, we see that both AdaBoost and GradBoost indeed leverage induction during the first few thousand iterations, adding many features that contribute to decreasing the loss. They then switch to a backpruning phase in which they remove features that are not predictive enough without increasing mistakes on the test set. We saw similar behavior across many datasets, which underscores the ability of the algorithms to perform both feature induction and backward pruning in tandem. Though this paper omits the following, we note that our algorithms all enjoy convergence guarantees, provide scoring mechanisms for induction of features in boosting, and give termination criteria for boosting based on the sparsifying regularizers. These results and full proofs are available in the long version of the paper on the the first author's website: http://www.cs.berkeley.edu/~jduchi. References Bertsekas, D. (1999). Nonlinear programming. Athena Scientific.