Boosting Classifiers with Tightened L0 -Relaxation Penalties

Noam Goldberg noamgold@tx.technion.ac.il Faculty of IE&M, Technion - Israel Institute of Technology, Haifa 3200, Israel Jonathan Eckstein jeckstei@rci.rutgers.edu MSIS Department and RUTCOR, Rutgers University, 640 Bartholomew Road, Piscataway NJ 08854

Abstract
We propose a novel boosting algorithm which improves on current algorithms for weighted voting classification by striking a better balance between classification accuracy and the sparsity of the weight vector. In order to justify our optimization formulations, we first consider a novel integer linear program as a model for sparse classifier selection, generalizing the minimum disagreement halfspace problem whose complexity has been investigated in computational learning theory. Specifically, our mixed integer problem is that of finding a separating hyperplane with minimum empirical error subject to an L0 -norm penalty. We note that common "soft margin" linear programming formulations for robust classification are equivalent to the continuous relaxation of our formulation. Since the initial continuous relaxation is weak, we suggest a tighter relaxation, using novel cutting planes, to better approximate the integer solution. To solve this relaxation, we propose a new boosting algorithm based on linear programming with dynamic generation of variables and constraints. We demonstrate the classification performance of our proposed algorithm with experimental results, and justify our selection of parameters using a minimum description length, compression interpretation of learning.

represented as a matrix A  RM ×N whose rows correspond to observations and whose columns correspond to attributes. We are also given a vector of labels y  {-1, 1}M . We have a potentially large set of base classifiers hu : RN  {-1, 0, 1} indexed by the set U = {1, . . . , U }, and would like to train a weighted voting classifier g(x) = uU u hu (x); we classify a new observation x  RN as either positive or negative based on sgn(g(x)). To determine , the empirical risk minimization strategy minimizes the sum over the observations of the 0/1 loss (Ai , yi , g(Ai )) = I(yi = g(Ai )), (1) where I(·) is the 0/1 indicator function and Ai is the ith row of A. Empirical risk minimization can result in overfitting and significantly larger losses with respect to unseen test data; robust algorithms for classification mitigate this problem by considering other loss functions such as the soft margin loss (with margin fixed to 1): (Ai , yi , g(Ai )) = max{1 - yi g(Ai ), 0}, (2)

and adding a model complexity penalty: a simple complexity penalty is the number of features used by the the classifier, that is, the number of u  U for which u = 0. This quantity is denoted  0 and called the L0 -norm (although it is not a true norm). In order to avoid hard combinatorial optimization problems, methods such as "soft margin" LPs (Graepel et al., 1999; R¨tsch et al., 2001; Demiriz et al., a 2002) and SVMs (Cortes & Vapnik, 1995) use the L1 or L2 norms of , respectively, instead of L0 . This approach has the computational advantage of yielding a convex optimization problem, provided one is using an appropriate loss function such as (2). Demiriz et al. (2002) solve the following linear programming (LP) problem, based on previous formulations of Graepel et al. (1999) and R¨tsch et al. (2001): a

1. Introduction
Consider a binary classification problem with M observations, each consisting of N real-valued attributes,
Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

Boosting Classifiers with Tightened L0 -Relaxation Penalties
M ,,0

max

-D
i=1

i

diag(y)H +   1 U u=1 u = 1

,

(3) where diag(y) is the diagonal matrix with entries y1 , . . . , yM . In this formulation, each observation i  {1, ..., M } has a variable i which allows it to have a margin smaller than . The margin of observation i is equal to yi Hi , where Hi is the ith row of H, an M × U matrix whose elements are Hiu = hu (Ai ). The parameter D penalizes the magnitude of each margin deviation i = max{ - yi Hi , 0}. In (3), the objective is to maximize the margin of separation, minus a misclassification penalty; however, this formulation is known to be equivalent, for some appropriate choice of the constant D , to
U ,0 M

that are satisfied by all integer-feasible solutions of the MIP; see Section 3. We propose an algorithm which solves our relaxation formulation by dynamically generating not only variables (as in LP-Boost), but also constraints, and investigate the properties of the classifiers it constructs. Previous related work such as Weston et al. (2003) proposes methods based on finding a local minimum of a non-convex approximation of (5). Since our method's approximation of (5) is a true relaxation which may be solved to optimality, our approach, unlike this earlier work, yields a lower bound for (5), and its output is not dependent on which of potentially many local minima might be encountered. Before exploring the details of our proposed approach, we briefly discuss some learning-theoretic motivations for starting from a formulation like (5), as well as some analysis which can suggest how to choose the penalty parameter C; these issues are explored in greater detail in Goldberg & Eckstein (2010). First, it has been known dating back to Freund & Schapire (1997) that there are bounds on the classifier prediction risk which can be expressed in terms of  0 . Another approach to prediction risk and model selection uses the minimum prediction length (MDL) principle, which views learning as a problem of optimally compressing the data (Gr¨nwald, 2007). The promise of using u compression-based risk bounds is suggested, for example, in Von Luxburg et al. (2004) in the case of SVMs. In particular, we consider a two-part-code MDL approach which views the selection of the classifier function g as a problem of minimizing the length of a code used to transmit the vector of labels; the first part of the code describes the classifier g, and the second part encodes which observations are misclassified by g. In particular, Blum & Langford (2003) provide a classification risk bound proportional to the code length ¯ L[y, g] = L[g] + L[y|g],

min

u + D
u=1 i=1

i | diag(y)H +   1

;

(4) see for example Bennett & Bredensteiner (2000). Here, the margin is fixed to 1, and one minimizes the L1 norm of , plus a misclassification penalty. The matrix H, in which each column corresponds to the vector of labels assigned by a base classifier u  U, usually has too many columns to be written in full as a part of the LP formulation. Instead, Demiriz et al. (2002) propose, in their LP-Boost algorithm, a column generation procedure that creates columns of H as they are needed; to be exact, they generate cuts for the dual formulation, which is essentially equivalent. Here, we propose a related approach for an enhanced formulation of the learning problem. Instead of (4), which minimizes the sum of (2) over the observations, plus D  1 , we consider the combinatorial problem which minimizes the sum of the loss (1) over the observations, plus a penalty proportional to the density of the vector . Specifically, for any scalar C  0, we consider the problem
M

min
0 i=1

I(yi Hi   0) + C 

0

,

(5)

Even special cases of (5) are known to be N P-hard (see Section 2). First, we formulate the problem as a mixed integer program (MIP) and point out that the continuous relaxation of this MIP is in fact equivalent to (4). Rather than attempting to solve our MIP formulation, which is a potentially difficult combinatorial problem, or solving its weak continuous relaxation, which would be equivalent to (4) and thus to (3) and prior work, we propose a computationally tractable "middle ground": we solve a version of the continuous relaxation augmented by a large class of inequalities

where L[g] denotes the number of bits required to encode g, and L[y|g] is the number of bits needed to encode those labels y that are incorrectly predicted by g. One can derive an upper bound on L[g] that is proportional to  0 , and an upper bound on L[y|g] that M is proportional to i=1 I(yi Hi   0). Therefore, with an appropriate choice of C, problem (5) minimizes a ¯ bound on L[y, g], which in turn corresponds to a bound on the prediction risk. In some applications, the bound on L[g] can be tightened by minimizing an objective function that applies potentially different penalties cu to different features u. We will explore such a generalization of the penalty term C  0 below.

Boosting Classifiers with Tightened L0 -Relaxation Penalties

2. Sparse Weighted Voting Classifier Selection (SWVC)
Finding   RU that minimizes i=1 I(yi Hi   0) is known as the minimum disagreement halfspace (MDH) problem, and is N P-hard (H¨ffgen et al., 1995). We o call problem (5), which generalizes MDH by including an L0 penalty and also requires   0, the sparse weighted voting classifier (SWVC) problem. Requiring   0 is without loss of generality, since at the cost of U being at most twice as large, we may include in U the negative "mirror image" hu- = -hu of any base classifier hu . Considering the special case C = 0, existing MDH complexity and inapproximability results clearly also apply to SWVC. These computational complexity results also extend to other values of C (Goldberg & Eckstein, 2010); for example, the problem remains N P-hard for C  O(M 1- ) for all > 0. We now formulate SWVC as a mixed integer program (MIP), using binary variables µu to indicate whether feature u is used, and binary variables i to indicate whether observation i is misclassified:   U (, µ, )  QH,y  M   {0, 1}M µu min i + C , (6)  ,µ,  u=1 i=1 µ  {0, 1}U where QH,y is the soft margin classification polytope defined as:   diag(y)H + (M E + 1)  1         Eµ QH,y = (, µ, ) , M R , 0       µ,   RU , µ,   0 and E is a suitably large constant. We can show that formulation (6) correctly models SWVC if E  M M/2 (Goldberg & Eckstein, 2010). We next consider the continuous relaxation of (6), min
M i=1 i M

to be equivalent to the soft margin maximization formulation (3), it also follows that formulation (3) is equivalent to the SWVC relaxation (7).

3. Relaxing the Hard Problem and Strengthening the Relaxation
The standard continuous relaxation (7) of the MIP (6) can be very weak. Although we omit the details here, one can construct a family of examples in which the optimal integral and continuous relaxation objective values of (6) differ by a multiplicative factor of M E. In Goldberg & Eckstein (2010), we prove that one should have E  (2M ) in order to maintain the general correctness of formulation (6), implying that there are cases in which the ratio between the optimal value of (6) and the value of its continuous relaxation (7) is (M 2M ). We propose to solve a natural generalization of the SWVC relaxation which applies a potentially different penalty cu to each feature u  U, replacing the objective function of (7) with (8a) below. Also, for the purpose of comparison with formulation (3), we introduce the normalization constraint (8c) below, and treat the classifier margin  as a parameter, rather than fixing it at 1 as in (7):
min
M X i=1

,µ,

i +

U X u=1

cu µu i = 1, ..., M

(8a) (8b) (8c) uU (8d)

s.t.:

X
uU

yi Hiu u + (1 + )i   u = 1

X
uU

µu - u  0 u , µu , i  0

uU (8e) i = 1, ..., M.

,µ,

+C

U u=1

µu | (, µ, )  QH,y

. (7)

Theorem 2.1. (, ) is an optimal solution of (4) if and only if (/(M E + 1), /E, ) is an optimal solution of (7) with penalty C = 1/(D (M + 1/E)). Our proof of this result (Goldberg & Eckstein, 2010) is based on a one-to-one mapping, which scales objective values by a positive constant, between solutions that are feasible for (4) and solutions that are feasible for (7). An immediate consequence of Theorem 2.1 is that the SWVC relaxation (7) is equivalent to the soft margin formulation (4) for appropriate choices of the penalties C and D . Since formulation (4) is known

This approach also eliminates the large constant E from the formulation, thus avoiding practical numerical difficulties. We now consider adding valid inequalities to (8a)-(8e) in order to strengthen its relaxation. We say that a base classifier h distinguishes between observations i and i if it classifies i correctly and differently from i , that is, hu (Ai ) = yi = hu (Ai ). Let Si,i = {u  U | ht (Ai ) = yi = hu (Ai ) } denote the set of base classifiers that correctly classify observation i and distinguish it from i . Let  = {(i, i ) | yi = yi } and consider the inequalities i + i + µu  1 (i, i )  . (8f)

uSi,i

Intuitively, these inequalities assert that for each (i, i )  , we must either misclassify at least one of

Boosting Classifiers with Tightened L0 -Relaxation Penalties

the observations i and i , or we need to use at least one of the distinguishing features in Si,i . Theorem 3.1. The inequalities (8f) hold for all integer-feasible solutions of (8a)-(8e). For a proof, see Goldberg (2010). Theorem 3.1 implies that appending constraints of the form (8f) to the formulation (8a)-(8e) or (7) does not eliminate any integer solutions; however, it may well eliminate solutions for which  or µ are not integral.

vii : we would like to minimize, over all u  U, the quantity
M

c(u) = cu - ¯

vii -
i=1

yi Hiu wi - .

(9)

(i,i )A: Sii u

4. The L0 -RBoost Boosting Algorithm
We now describe a boosting algorithm, which we call L0 -RBoost, that effectively solves (8), including (8f). The overall approach is similar to Demiriz et al. (2002), in that we use a column generation method to iteratively generate elements of U as they are needed. At each iteration, the set of features is restricted to some subset   U that have been generated so far, with all variables corresponding to features u   implicitly set to 0. Due to the potentially large number of constraints (8f), our algorithm, unlike (Demiriz et al., 2002), also dynamically generates not only variables, but also constraints. In particular, at each iteration, it solves the following LP problem: LP(, A): the problem (8), but fixing u = µu = 0 for all u  U\, and including constraints of the form (8f) only for (i, i ) in some subset A  . When  and A are relatively small sets, LP(, A) is equivalent to an LP of much smaller dimension than (8), as variables fixed to 0 may be dropped from the formulation. Let wi , , and vii denote the LP dual variables (or Lagrange multipliers) of constraints (8b), (8c), and (8f), respectively. Each iteration of L0 -RBoost proceeds as follows: first, we solve LP(, A), and next we attempt to use the resulting dual variable information to identify a feature u  U whose addition to the set of features  would improve the optimal objective function value (8a). This step, which invokes the base or "weak" learning algorithm, coincides with the notion of a "pricing" step in LP column generation. The LPBoost algorithm (Demiriz et al., 2002) operates in a similar manner, but solves (the dual of) the LP formulation (3). In LP-Boost, one would ideally like to execute the pricing/base learning step by solving M the problem maxuU i=1 yi Hiu wi , which is known as the maximum agreement problem (Kearns et al., 1994). In L0 -RBoost, the pricing problem is more general than maximum agreement, due to the additional constraints (8f) and their corresponding dual variables

If c(u)  0 for all u  U, then any optimal solution ¯ of the problem LP(, A) is also optimal for the larger problem LP(U, A) including all possible features. For brevity, we omit the formal proof of this assertion, which involves merging two constraints in the dual of LP(U, A), while preserving all optimal solutions. On the other hand, if solving (9) identifies some u  U \ with c(u ) < 0, we expand  to include this new fea¯ ture. Next, the L0 -RBoost iteration performs an operation with no direct analog in LP-Boost: it attempts to expand the set of pairs A for which constraints of the form (8f) are included in the formulation. The algorithm augments A with pairs of observations (i, i ) that are distinguished by the new feature u , and in some cases, when termination is imminent, by all (i, i )   for which (8f) is violated. It then proceeds to the next iteration, re-solving LP(, A) with the expanded set  and A. Algorithm 1 gives a complete description of L0 -RBoost. Algorithm 1 L0 -RBoost 1: Input: M × N matrix A and labels y  {-1, 1}M 2: Output: (, µ, ) 3:   {1, 2}, where h1 (Ai ) = 1 and h2 (Ai ) = -1 for all i  {1, . . . , M } 4: A   5: repeat 6: Solve LP(, A), obtaining the solution (, µ, ) and dual variables w, , and vii , (i, i )  A 7: Solve the base learning problem: with c(u) de¯ fined as in (9), find u  Arg minuU {¯(u)} c 8:     {u } 9: A  A  {(i, i ) | yi = yi = hu (Ai ) = hu (Ai )}
10:

V  (i, i )


i + i +

uSi,i 

µu < 1

11: if c(u )  0 and V =  then ¯ 12: AAV 13: end if 14: until c(u )  0 and V =  ¯

Note that unlike standard column generation and LPBoost, each iteration involves the generation of two "coupled" variables u and µu . We initialize  to contain two simple elements, corresponding to the constant base classifiers that classify all observations as

Boosting Classifiers with Tightened L0 -Relaxation Penalties

respectively either positive or negative. Using LP duality theory and the construction of the set V in step 1 of the algorithm, it can be shown that when L0 -RBoost terminates, (, , µ) is optimal for (8). Note that it would also be possible to add multiple features at each iteration, and there are techniques for terminating the algorithm early while still providing guarantees on the quality of the solution; see L¨bbecke & Desrosiers u (2005) and Goldberg (2010). We did not find either technique necessary in the computational experiments presented in Section 6 below. The cuts added in step 9 are designed to increase the value of the newly generated variable µu , which could otherwise be as small as u . In practice, it may be possible to accelerate the algorithm by adding only a carefully chosen subset of the cuts specified in step 9; see the next section for an example. Step 12 augments A by all currently violated cuts, ensuring that all constraints (8f) are satisfied at termination. In principle, we could include such cuts at every iteration, but we only do so if c(u )  0 and termination seems immi¯ nent. The reason is that the number of misclassified observation, which is strongly related to the number of violated cuts, tends to drop as the algorithm progresses; delaying step 12 thus allows fewer cuts to be generated in total. Finally, note that the number || of cuts (8f) is at most quadratic in the number of observations. Thus, for data sets with very few observations, it may be possible to start the algorithm with all possible cuts, initializing A = .

ing applications, it is reasonable to bound the degree of the monomials considered by a constant K; the monomial that maximally agrees with the data, in the sense of minimizing (9), can then be found in polynomial time. The base learning problem minimizing (9) can be solved by trivial enumeration for simple classes of base classifiers, such as monomials of some given small degree or trees of small depth (e.g., "decision stumps"). For monomials of a larger degree, we experiment in Section 6 with an extension of the maximum monomial agreement algorithm of Eckstein & Goldberg (2008); for more details, see Goldberg (2010). The MDL/compression motivation discussed in Section 1, together with the structural risk minimization (SRM) approach to learning (Vapnik, 1998; ShaweTaylor et al., 1998), suggests an encoding scheme in which the complexity penalty assigned to a monomial feature varies with its degree. In particular, we set the cost cu of using feature u to cu = k + log N + log K k + , log M (10)

5. Base Learning with Simple Rules
We now consider applying our boosting algorithm to construct ensembles of simple rules (Cohen & Singer, 1999; Friedman & Popescu, 2008). For this example, we assume that our given data matrix A is binary; note that any data matrix A  RM ×N can be "binarized" using a number of attributes that is at most polynomially larger than N (Boros et al., 1997; Goldberg & Shan, 2007). In this binary setting, we choose our base classifiers to be signed Boolean monomials, that N is, functions hu+ , hu- : {0, 1}  {-1, 0, 1} given by hu+ (x) =
jJ

where k is the degree of the monomial corresponding to u, and  is a small positive constant. A detailed account of this choice may be found in Goldberg & Eckstein (2010), but in brief the motivation is as follows: we first encode the degree k of the monomial corresponding to u, requiring approximately log K bits since the maximum allowed degree is K. Then, we encode which of the 2k N possible monomials of degree k k is meant, which requires an additional log 2k N k bits. We then divide this quantity by log M , which is an upper bound on the number of bits needed to encode a misclassified observation; this normalization is required since the objective coefficients of the misclassification variables i in (8a) are all 1. In our experiments applying Algorithm 1 to Boolean monomial base classifiers, we found it beneficial to add only a subset of the possible cuts in step 9. In particular, we found a small Hamming distance between Ai and Ai to be a good indicator of the potential of cut (i, i ) to tighten the relaxation. After adjoining the base classifier u to the formulation, we scan all pairs (i, i ) for which yi = hu (Ai ) = hu (Ai ), and use only the cuts corresponding to pairs whose Hamming distance is within a fixed factor of the minimum encountered. This technique reduces the amount of time spent adding new rows to the LP constraint matrix (which can be time-consuming, since LP solvers typically store the constraint matrix in a column-oriented sparse form), and restrains the size of the LP. Steps 10-13 of the algorithm still ensure that there are no remaining violated constraints upon termination.

xj
cC

(1 - xc )

and

hu- = -hu+ ,

with J, C  {1, . . . , N } and J  C = . The degree of such a monomial is simply |J| + |C|. When the set of possible features U corresponds to all possible monomial classifiers over the N variables, minimizing the function (9) is a generalization of the N P-hard maximum monomial agreement problem (Kearns et al., 1994). However, in most learn-

Boosting Classifiers with Tightened L0 -Relaxation Penalties

6. Experimental Work and Discussion
Using Boolean monomial base classifiers of maximum degree K = 1 or K = 5, we compared the classification performance of L0 -RBoost (Algorithm 1) to LPBoost (Demiriz et al., 2002). We experimented with five UCI datasets (Asuncion & Newman, 2007): BCW, VOTE, CLVHEART, HUHEART, and SONAR. Observations with missing attribute values were deleted, except for the congressional voting dataset (VOTE), where a missing value was considered as a distinct attribute value. The data were binarized using the technique of Boros et al. (1997); the resulting number of attributes ranged between 8 and 17, and varied by cross-validation (CV) run as well as by dataset. The optimization formulation used by LP-Boost is (3) with D = 1/M . In this work, the parameter  corresponds to an upper bound on the fraction of margin errors |{i | i > 0 }| /M (Graepel et al., 1999); however, no method for choosing  is suggested other than intensive cross-validation, and making use of the classification results of other algorithms. The LP formulation (3) is known to find sparse classifiers, but is also sensitive to the choice of the tunable penalty parameter D. By choosing a small enough penalty D (i.e., a large enough ) in (3), it is possible to find a sparse "degenerate" classifier with large  by assigning a large proportion of observations to be outliers. In our experiments with L0 -RBoost, we set the parameters cu of formulation (8) using (10) with  = 1.5. We then investigated the dependence of L0 -RBoost and LP-Boost on the tunable parameters  and , respectively, examining both classifier performance and sparsity. Figure 1 displays the the training accuracy, testing accuracy, and  0 attained by L0 -RBoost and LP-Boost on the SONAR dataset, in relation to the algorithms' tunable parameters, for monomials of degree K = 1 or up to K = 5. Each plot point corresponds to 10 replications of a 10-fold CV experiment. For both algorithms, the figure shows that the classifiers selected tend to overfit the training data for small values of the parameters. This overfitting is by far the most pronounced for LP-Boost with K = 5. The experiments show that, over the entire range of parameter values, L0 -RBoost generalizes well compared with LP-Boost. We also find that the performance of L0 -RBoost is robust with respect to a wide range of choices of the parameter . In LP-Boost, the margin  rises monotonically as one increases ; so, in Figure 1, we show  rather than  on the horizontal axis to facilitate comparison with L0 -RBoost. We found the performance of LP-Boost to be highly sensitive to the choice of the parameter, confirming the observations

of Demiriz et al. (2002). When the input data are not linearly separable by the set of base classifiers U, as is typically the case when K = 1, and  is small, then a non-separating hyperplane with zero margin becomes optimal. Further, the value of  that may be considered too small varied significantly between datasets. Figure 2 plots accuracy on the test set versus  0 , again for the SONAR dataset, summarizing the classification versus sparsity performance of the experiments depicted in Figure 1. As in Figure 1, each point corresponds to an average over 10 replications of a 10-fold CV experiment. In general, the  0 values produced by L0 -RBoost are bounded within a smaller interval, which is to be expected, given that the cost coefficients cu are fixed identically in all of the experiments. However, it is apparent that the classifiers computed by L0 -RBoost tend to strike a better trade-off between accuracy and sparsity than the classifiers computed by LP-Boost. Table 1 compares the average classification accuracy and sparsity obtained by the classifiers computed by L0 -RBoost, LP-Boost, and SLIPPER (Cohen & Singer, 1999). SLIPPER combines AdaBoost with a heuristic base learner; in our case of binary data, the rules produced by this heuristic are simply monomials of arbitrary degree. The table entries in the top five rows are our own computational results for 20 replications of 10-fold CV experiments. In these experiments we compare L0 -RBoost, with K = 1 or K = 5, to LP-Boost using the same base classifiers, and to SLIPPER with monomials of arbitrary degree. The L0 -RBoost results in the table use the fixed parameter value  = 20/M , while the LP-Boost results in this top portion of the table use  = 0.50 for K = 5, and  = 0.56 for K = 1. These values of  were chosen because they produced relatively sparse classifiers which seemed to perform the best on the SONAR and CLVHEART datasets. The bottom portion of the table shows a practical comparison of our results with the previously published results for LP-Boost with C4.5 and stump decision trees (Demiriz et al., 2002), and the previously published performance of the SLIPPER algorithm (Cohen & Singer, 1999), which were also based on a 10-fold CV setup. The LP-Boost results of Demiriz et al. (2002) use values of  individually tuned for each dataset, so we did not expect to match all of their classification results. We ran the SLIPPER algorithm using the publicly available implementation, with all parameters set to their default values. The apparent discrepancy between our observation of SLIPPER's performance and the previously published results may be due to the binarization of the datasets in our experiments.

Boosting Classifiers with Tightened L0 -Relaxation Penalties

The results of Table 1 indicate that L0 -RBoost finds classifiers that are approximately as sparse as the classifiers found by LP-Boost, but usually achieve superior classification performance. Finally, although we demonstrate this classification performance based on parameters that are justified analytically, without finetuning, we can only expect to improve the performance of L0 -RBoost by using cross-validation to select the margin parameter  or the penalty parameters cu . Acknowledgments Based upon work funded in part by the U.S. DHS under Grant Award #2008-DN-077ARI001-02. We thank Rob Schapire for helpful discussions and comments. The first author would also like to thank Kristin Bennett for comments at an early stage of this work.

Friedman, J.H. and Popescu, B.E. Predictive learning via rule ensembles. Annals of Applied Statistics, 2 (3):916­954, 2008. Goldberg, N. Optimization for Sparse and Accurate Classifiers. PhD thesis, Rutgers University, 2010. Goldberg, N. and Eckstein, J. Sparse weighted voting classifier selection and its LP relaxation. Technical Report RRR 9-2010, RUTCOR, Rutgers University, 2010. Goldberg, N. and Shan, C. Boosting optimal logical patterns. In Proc. of SIAM Int. Conf. on Data Mining, 2007. Graepel, T., Herbrich, R., Sch¨lkopf, B., Smola, A., o Bartlett, P., M¨ller, K-R., Obermayer, K., and u Williamson, R. Classification on proximity data with LP-machines. Int. Conf. of Artificial Neural Networks, pp. 304­309, 1999. Gr¨nwald, Peter D. The minimum description length u principle. MIT Press, 2007. H¨ffgen, K-U., Simon, H.U., and Horn, K.S. Van. Roo bust trainability of single neurons. J. of Computer and Systems Sciences, 50:114­125, 1995. Kearns, M.J., Schapire, R.E., and Sellie, L.M. Toward efficient agnostic learning. Machine Learning, 17(23):115­141, 1994. L¨bbecke, M.E. and Desrosiers, J. Selected topics u in column generation. Operations Research, 53(6): 1007­1023, 2005. R¨tsch, G., Onoda, T., and M¨ller, K-R. Soft margins a u for AdaBoost. Machine Learning, 42:287­320, 2001. Shawe-Taylor, J., Bartlett, P.L., Williamson, R.C., and Anthony, M. Structural risk minimization over data-dependent hierarchies. IEEE Trans. on Information Theory, 44:1926­1940, 1998. Vapnik, V.N. Statistical learning theory. John Wiley and Sons, 1998. Von Luxburg, U., Bousquet, O., and Sch¨lkopf, B. A o compression approach to support vector model selection. J. of Machine Learning Research, 5:293­ 323, 2004. Weston, J., Elisseeff, A., Sch¨lkopf, B., and Tipping, o M. Use of the zero norm with linear models and kernel methods. J. of Machine Learning Research, 3:1439­1461, 2003.

References
Asuncion, A. and Newman, D.J. UCI machine learning repository, 2007. URL http://www.ics.uci.edu/ mlearn/MLRepository.html. Bennett, K.P. and Bredensteiner, E.J. Duality and geometry in SVM classifers. Proc. of the 17th Int. Conf. on Machine Learning, pp. 57­64, 2000. Blum, A. and Langford, J. PAC-MDL bounds. In Proc. of the 16th Annual Conf. on Computational Learning Theory, pp. 344­357, 2003. Boros, E., Hammer, P.L., Ibaraki, T., and Kogan, A. Logical analysis of numerical data. Math. Programming, 79:163­190, 1997. Cohen, W.W. and Singer, Y. A simple, fast, and effective rule learner. In Proc. of the 16th National Conf. on Artificial Intelligence, pp. 335­342, 1999. Cortes, C. and Vapnik, V. Support-vector networks. Machine Learning, 20:273­297, 1995. Demiriz, A., Bennett, K.P., and Shawe-Taylor, J. Linear programming boosting via column generation. Machine Learning, 46:225­254, 2002. Eckstein, J. and Goldberg, N. An improved branchand-bound method for maximum monomial agreement. In Optimization for Machine Learning, NIPS Workshop, 2008. URL http://opt2008.kyb. tuebingen.mpg.de/eckstein.pdf. Freund, Y. and Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. of Computer and Systems Sciences, 55(1):119­139, 1997.

Boosting Classifiers with Tightened L0 -Relaxation Penalties

1 0.95 0.9 0.85

1 0.95 0.9 0.85

Accuracy

0.75 0.7 0.65 0.6 0.55 0.5 0 0.1 0.2 0.3 0.4 0.5 K=1, test K=1, train K=5, test K=5, train 0.6

Accuracy

0.8

0.8 0.75 0.7 0.65 0.6 0.55 0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 K=1, test K=1, train K=5, test K=5, train 0.7


(a) L0 -RBoost accuracy vs. margin

(b) LP-Boost accuracy vs. margin

Figure 1. The classification performance of LP-Boost versus L0 -RBoost with monomial base classifiers for the SONAR dataset. Each point corresponds to the average of a 10-replication, 10-fold CV experiment.

0.75

0.76 0.75

0.7 0.74

Accuracy

Accuracy
L0-RBoost LP-Boost 0 5 10 15 20 25

0.65

0.73 0.72 0.71

0.6

0.55 0.7 0.5 L0-RBoost 0.69 0 5 10 15 20 25 30 LP-Boost 35 40

||||0

||||0

(a) K = 1

(b) K = 5

Figure 2. Test accuracy vs.  0 on the SONAR dataset: each point corresponds to an average over 10 replications of a 10-fold CV experiment with a particular input parameter

Dataset Name Samples Method L0 -RBoost K = 1 L0 -RBoost K = 5 LP-Boost K = 1 LP-Boost K = 5 SLIPPER SLIPPER LP-Boost stumps LP-Boost C4.5 BCW 683 Acc 0.963 0.950 0.925 0.937 0.959 0.958 0.966 0.959 
0

VOTE 435 Acc 0.950 0.960 0.957 0.957 0.952 0.959 
0

CLVHEART 297 Acc 0.846 0.833 0.774 0.810 0.802 0.752 0.795 0.791 
0

HUHEART 294 Acc 0.812 0.807 0.803 0.797 0.802 0.806 
0

SONAR 208 Acc 0.712 0.725 0.735 0.734 0.674 0.745 0.870 0.817 
0

9.1 9.4 3.8 5.5 19.5

6.0 3.0 2 2.1 3.9

10.3 38.9 7.6 25.3 14 70.8

10.4 27.4 3.8 11.2 14.7

7.2 21.3 8.6 30.8 18.8 85.7

Table 1. Average accuracy and  0 for 20 replications of 10-fold cross-validation. The bottom three rows are as reported for SLIPPER (Cohen & Singer, 1999) and LP-Boost (Demiriz et al., 2002). Gray cells indicate that the corresponding data are unavailable from the given publication.