ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

Ping Li Department of Statistical Science, Cornell University, Ithaca, NY 14853 USA

pingli@cornell.edu

Abstract
We propose abc-boost (adaptive base class boost) for multi-class classification and present abc-mart, an implementation of abcboost, based on the multinomial logit model. The key idea is that, at each boosting iteration, we adaptively and greedily choose a base class. Our experiments on public datasets demonstrate the improvement of abc-mart over the original mart algorithm.

2001) as the underlying learning procedure (Cossock & Zhang, 2006; Zheng et al., 2008; Li et al., 2008). Note that in (1), the values of pi,k are not affected by adding a constant C to each Fi,k , because eFi,k +C
K-1 Fi,s +C s=0 e

=

eC eFi,k eC
K-1 Fi,s s=0 e

=

eFi,k
K-1 Fi,s s=0 e

= pi,k

1. Introduction
Classification is a basic task in machine learning. A training data set {yi , Xi }N consists of N feature veci=1 tors (samples) Xi , and N class labels, yi . Here yi  {0, 1, 2, ..., K - 1} and K is the number of classes. The task is to predict the class labels. Among many classification algorithms, boosting has become very popular (Schapire, 1990; Freund, 1995; Freund & Schapire, 1997; Bartlett et al., 1998; Schapire & Singer, 1999; Friedman et al., 2000; Friedman, 2001). This study focuses on multi-class classification (i.e., K  3). The multinomial logit model has been used for solving multi-class classification problems. Using this model, we first learn the class probabilities: pi,k = Pr (yi = k|xi ) = eFi,k (xi )
K-1 Fi,s (xi ) s=0 e

Therefore, for identifiability, one should impose a constraint on Fi,k . One popular choice is to asK-1 sume k=0 Fi,k = const, which is equivalent to K-1 Fi,k = 0, i.e., the sum-to-zero constraint. k=0 This study proposes abc-boost (adaptive base class boost), based on the following two key ideas: 1. Popular loss functions for multi-class classification usually need to impose a constraint such that only the values for K - 1 classes are necessary (Friedman et al., 2000; Friedman, 2001; Zhang, 2004; Lee et al., 2004; Tewari & Bartlett, 2007; Zou et al., 2008). Therefore, we can choose a base class and derive algorithms only for K -1 classes. 2. At each boosting step, we can adaptively choose the base class to achieve the best performance. We present a concrete implementation named abcmart, which combines abc-boost with mart. Our extensive experiments will demonstrate the improvements of our new algorithm.

,

(1)

and then predict each class label according to yi = argmax pi,k . ^
k

2. Review Mart & Gradient Boosting
(2) Mart is the marriage of regress trees and function gradient boosting (Friedman, 2001; Mason et al., 2000). Given a training dataset {yi , xi }N and a loss function i=1 L, (Friedman, 2001) adopted a "greedy stagewise" approach to build an additive function F (M ) ,
M

A classification error occurs if yi = yi . In (1), Fi,k = ^ Fi,k (xi ) is a function to be learned from the data. Boosting algorithms (Friedman et al., 2000; Friedman, 2001) have been developed to fit the multinomial logit model. Several search engine ranking algorithms used mart (multiple additive regression trees) (Friedman,
Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

F (M ) (x) =
m=1

m h(x; am ),

(3)

such that, at each stage m, m = 1 to M ,
N

{m , am } = argmin
,a i=1

L yi , F (m-1) + h(xi ; a)

(4)

Adaptive Base Class Boost

Algorithm 2 Mart (Friedman, 2001, Alg. 6) Here h(x; a) is the "weak" learner. Instead of directly solving the difficult problem (4), (Friedman, 2001) approximately conducted steepest descent in function space, by solving a least square (Line 4 in Alg. 1)
N

am = argmin
a, i=1

[-gm (xi ) - h(xi ; a)] ,

2

0: ri,k = 1, if yi = k, and ri,k = 0 otherwise. 1: Fi,k = 0, k = 0 to K - 1, i = 1 to N 2: For m = 1 to M Do 3: For k = 0 to K - 1 Do 4: pi,k = exp(Fi,k )/ K-1 exp(Fi,s ) s=0 5: {Rj,k,m }J = J-terminal node regression tree j=1 : from {ri,k - pi,k , xi }N i=1 6: j,k,m =
K-1 K
xi Rj,k,m xi Rj,k,m

ri,k -pi,k

where -gm (xi ) = - L (yi , F (xi )) F (xi )

F (x)=F (m-1) (x)

7: Fi,k = Fi,k +  8: End 9: End

(1-pi,k )pi,k J j,k,m 1xi Rj,k,m j=1

is the steepest descent direction in the N -dimensional data space at F (m-1) (x). For the other coefficient m , a line search is performed (Line 5 in Alg. 1):
N

m = argmin
 i=1

L yi , F (m-1) (xi ) + h(xi ; am ) .

Algorithm 1 Gradient boosting (Friedman, 2001).
1: Fx = argmin N L(yi , ) i=1 2: For m = 1 to M Do 3: yi = - L(yi ,F (xi )) ~ F (xi ) 4: 5: am = m =

Alg. 2 has three main parameters. The number of terminal nodes, J, determines the capacity of the weak learner. (Friedman, 2001) suggested J = 6. (Friedman et al., 2000; Zou et al., 2008) commented that J > 10 is very unlikely. The shrinkage, , should be large enough to make sufficient progress at each step and small enough to avoid over-fitting. (Friedman, 2001) suggested   0.1. The number of iterations, M , is largely determined by the affordable computing time.

F (x)=F (m-1) (x) argmina, N [~i - h(xi ; a)]2 i=1 y N argmin i=1 L yi , F (m-1) (xi )

.

3. Abc-boost and Abc-mart
+ h(xi ; am )

6: Fx = Fx + m h(x; am ) 7: End 8: End

Abc-mart implements abc-boost. Corresponding to the two key ideas of abc-boost in Sec. 1, we need to: (A) re-derive the derivatives of (5) under the sum-to-zero constraint; (B) design a strategy to adaptively select the base class at each boosting iteration. 3.1. Derivatives of the Multinomial Logit Model with a Base Class Without loss of generality, we assume class 0 is the base. Lemma 1 provides the derivatives of the class probabilities pi,k under the logit model (1). Lemma 1
pi,k = pi,k (1 + pi,0 - pi,k ) , k = 0 Fi,k pi,k = pi,k (pi,0 - pi,s ) , k = s = 0 Fi,s pi,0 = pi,0 (-1 + pi,0 - pi,k ) , k = 0 Fi,k

Mart adopted the multinomial logit model (1) and the corresponding negative multinomial log-likelihood loss
N K-1

L=
i=1

Li ,

Li = -
k=0

ri,k log pi,k

(5)

where ri,k = 1 if yi = k and ri,k = 0 otherwise. (Friedman, 2001) used the following derivatives: Li = - (ri,k - pi,k ) , Fi,k  2 Li 2 = pi,k (1 - pi,k ) . Fi,k (6) (7)

Proof: Note that Fi,0 = - Alg. 2 describes mart for multi-class classification. At each stage, the algorithm solves the mean square problem (Line 4 in Alg. 1) by regression trees, and implements Line 5 in Alg. 1 by a one-step Newton update within each terminal node of trees. Mart builds K regression trees at each boosting step. It is clear that K-1 the constraint k=0 Fi,k = 0 need not hold.
pi,k = pi,k = Fi,k
K-1 s=0

K-1 k=1

Fi,k . Hence
eFi,k +e
K-1 s=1

eFi,k = eFi,s
K-1 s=0

e eFi,k - eFi,s

K-1 s=1 Fi,k

e

Fi,s

-Fi,s

eFi,k - e-Fi,0
K-1 s=0

eFi,s

2

= pi,k (1 + pi,0 - pi,k ) .

Adaptive Base Class Boost

The other derivatives can be obtained similarly. Lemma 2 provides the derivatives of the loss (5). Lemma 2 For k = 0,
Li = (ri,0 - pi,0 ) - (ri,k - pi,k ) , Fi,k  2 Li = pi,0 (1 - pi,0 ) + pi,k (1 - pi,k ) + 2pi,0 pi,k . 2 Fi,k

Algorithm 3 Abc-mart using the exhaustive search strategy. The vector B stores the base class numbers.
0: ri,k = 1, if yi = k, ri,k = 0 otherwise. 1 1: Fi,k = 0, pi,k = K , k = 0 to K - 1, i = 1 to N 2: For m = 1 to M Do 3: For b = 0 to K - 1, Do 4: For k = 0 to K - 1, k = b, Do 5: {Rj,k,m }J = J-terminal node regression tree j=1 : from {-(ri,b - pi,b ) + (ri,k - pi,k ), xi }N i=1 6: j,k,m =
xi Rj,k,m xi Rj,k,m

Proof:
K-1

-(ri,b -pi,b )+(ri,k -pi,k )

pi,b (1-pi,b )+pi,k (1-pi,k )+2pi,b pi,k

Li = -
s=1,s=k

ri,s log pi,s - ri,k log pi,k - ri,0 log pi,0 .

Its first derivative is
Li =- Fi,k
K-1 K-1

s=1,s=k

ri,s pi,s ri,k pi,k ri,0 pi,0 - - pi,s Fi,k pi,k Fi,k pi,0 Fi,k

7: 8: 9: 10: 11: 12: 13:

Gi,k,b = Fi,k +  J j,k,m 1xi Rj,k,m j=1 End Gi,b,b = - k=b Gi,k,b qi,k = exp(Gi,k,b )/ K-1 exp(Gi,s,b ) s=0 K-1 L(b) = - N i=1 k=0 ri,k log (qi,k ) End B(m) = argmin L(b)
b K-1 s=0

=
s=1,s=k

-ri,s (pi,0 - pi,k ) - ri,k (1 + pi,0 - pi,k ) - ri,0 (-1 + pi,0 - pi,k )
K-1

14: Fi,k = Gi,k,B(m) 15: pi,k = exp(Fi,k )/ 16: End

exp(Fi,s )

=-
s=0

ri,s (pi,0 - pi,k ) + ri,0 - ri,k

holds because
{-(ri,b - pi,b ) + (ri,k - pi,k )}
b=k

= (ri,0 - pi,0 ) - (ri,k - pi,k ) .

And the second derivative is
 Li pi,k pi,0 + 2 = - Fi,k Fi,k Fi,k = - pi,0 (-1 + pi,0 - pi,k ) + pi,k (1 + pi,0 - pi,k ) =pi,0 (1 - pi,0 ) + pi,k (1 - pi,k ) + 2pi,0 pi,k .
2

=-
b=k

ri,b +
b=k

pi,b + (K - 1)(ri,k - pi,k )

= - 1 + ri,k + 1 - pi,k + (K - 1)(ri,k - pi,k ) =K(ri,k - pi,k ).

We can also show that, for the second derivatives, 3.2. The Exhaustive Strategy for the Base We adopt a greedy strategy. At each training iteration, we try each class as the base and choose the one that achieves the smallest training loss (5) as the final base class, for the current iteration. Alg. 3 describes abc-mart using this strategy. 3.3. More Insights in Mart and Abc-mart First of all, one can verify that abc-mart recovers mart when K = 2. For example, consider K = 2, ri,0 = 1, Li 2 ri,1 = 0, then Fi,1 = 2pi,1 , FLi = 4pi,0 pi,1 . Thus, 2 the factor
K-1 K

{(1 - pi,b )pi,b + (1 - pi,k )pi,k + 2pi,b pi,k }
b=k

(K + 2)(1 - pi,k )pi,k ,

with equality holding when K = 2, because
{(1 - pi,b )pi,b + (1 - pi,k )pi,k + 2pi,b pi,k }
b=k

=(K + 1)(1 - pi,k )pi,k +
b=k

pi,b -  pi,b - 
b=k b=k

p2 i,b 2 pi,b 
b=k

=

1 2

i,1

(K + 1)(1 - pi,k )pi,k +

appeared in Alg. 2 is recovered.

When K  3, it is interesting that mart used the averaged first derivatives. The following equality
{-(ri,b - pi,b ) + (ri,k - pi,k )} = K(ri,k - pi,k ),
b=k

=(K + 1)(1 - pi,k )pi,k + (1 - pi,k )pi,k =(K + 2)(1 - pi,k )pi,k .

The factor K-1 in Alg. 2 may be reasonably replaced K K by K+2 (both equal 1 when K = 2), or smaller. In a 2

Adaptive Base Class Boost

sense, to make the comparisons more fair, we should have replaced the shrinkage factor  in Alg. 3 by  ,   K -1K +2 K2 + K - 2  . = K K K2

In other words, the shrinkage used in mart is effectively larger than the same shrinkage used in abc-mart.

Since we experimented with a series of parameters, J, , and M , we report, in Table 2, the overall "best" (i.e., smallest) mis-classification errors. Later, we will also report the more detailed mis-classification errors for every combination of J and , in Sec. 4.2 to 4.9. We believe this is a fair (side-by-side) comparison.
Table 2. Summary of test mis-classification Dataset mart abc-mart Rerr (%) Covertype 11350 10420 8.2 Letter 129 99 23.3 Letter2k 2439 2180 10.6 Letter4k 1352 1126 16.7 Pendigits 124 100 19.4 Zipcode 111 100 9.9 Optdigits 55 43 21.8 Isolet 80 64 20.0 errors. P -value 0 0.02 0 0 0.05 0.22 0.11 0.09

4. Evaluations
Our experiments were conducted on several public datasets (Table 1), including one large dataset and several small or very small datasets. Even smaller datasets will be too sensitive to the implementation (or tuning) of weak learners.
Table 1. Whenever possible, we used the standard (default) training and test sets. For Covertype, we randomly split the original dataset into halves. For Letter, the default test set consisted of the last 4000 samples. For Letter2k (Letter4k), we took the last 2000 (4000) samples of Letter for training and the remaining 18000 (16000) for test. dataset Covertype Letter Letter2k Letter4k Pendigits Zipcode Optdigits Isolet K 7 26 26 26 10 10 10 26 # training 290506 16000 2000 4000 7494 7291 3823 6218 # test 290506 4000 18000 16000 3498 2007 1797 1559 # features 54 16 16 16 16 256 64 617

In Table 2, we report the numbers of mis-classification errors mainly for the convenience of future comparisons with this work. The reported P -values are based on the error rate, for testing whether abc-mart has statistically lower error rates than mart. We should mention that testing the statistical significance of the difference of two small probabilities (error rates) requires particularly strong evidence. 4.2. Experiments on the Covertype Dataset Table 3 summarizes the smallest test mis-classification errors along with the relative improvements (Rerr ). For each J and , the smallest test errors, separately for abc-mart and mart, are the lowest points in the curves in Figure 2, which are almost always the last points on the curves, for this dataset.
Table 3. Covertype. We report the test mis-classification errors of mart and abc-mart, together with the results from Friedman's MART program (in [ ]). The relative improvements (Rerr , %) of abc-mart are included in ( ).  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 M 1000 1000 1000 2000 2000 2000 3000 3000 3000 5000 5000 5000 J 6 10 20 6 10 20 6 10 20 6 10 20 mart 40072 29456 19109 31541 21774 14505 26972 18494 12740 22511 15450 11350 [39775] [29196] [19438] [31526] [21698] [14665] [26742] [18444] [12893] [22335] [15429] [11524] abc-mart 34758 (13.3) 23577 (20.0) 15362 (19.6) 26126 (17.2) 17479 (19.7) 12045 (17.0) 22111 (18.0) 14984 (19.0) 11087 (13.0) 18445 (18.1) 13018 (15.7) 10420 (8.2)

Ideally, we hope that abc-mart will improve mart (or be as good as mart), for every reasonable combination of tree size J and shrinkage . Except for Covertype and Isolet, we experimented with every combination of   {0.04, 0.06, 0.08, 0.1} and J  {4, 6, 8, 10, 12, 14, 16, 18, 20}. Except for Covertype, we let the number of boosting steps M = 10000. However, the experiments usually terminated earlier because the machine accuracy was reached. For the Covertype dataset, since it is fairly large, we only experimented with J = 6, 10, 20 and  = 0.1. We limited M = 5000, which is probably already a too large learning model for real applications, especially applications that are sensitive to the test time. 4.1. Summary of Experiment Results We define Rerr , the "relative improvement of test misclassification errors" as
Rerr = error of mart - error of abc-mart . error of mart (8)

The results on Covertype are reported differently from other datasets. Covertype is fairly large. Building

Adaptive Base Class Boost

a very large model (e.g., M = 5000 boosting steps) would be expensive. Testing a very large model at run-time can be costly or infeasible for certain applications. Therefore, it is often important to examine the performance of the algorithm at much earlier boosting iterations. Table 3 shows that abc-mart may improve mart as much as Rerr  20%, as opposed to the reported Rerr = 8.2% in Table 2. Figure 1 indicates that abc-mart reduces the training loss (5) considerably and consistently faster than mart. Figure 2 demonstrates that abc-mart exhibits considerably and consistently smaller test mis-classification errors than mart.
10
6

Table 4. Letter. We report the test mis-classification errors of mart and abc-mart, together with Friedman's MART program results (in [ ]). The relative improvements (Rerr , %) of abc-mart are included in ( ).
 = 0.04 174 [178] 163 [153] 155 [151] 145 [141] 143 [142] 141 [151] 143 [148] 132 [132] 129 [143] 152 127 122 126 117 112 111 113 100 (12.6) (22.1) (21.3) (13.1) (18.2) (20.6) (22.4) (14.4) (22.5) mart  = 0.06 177 [176] 157 [160] 148 [152] 145 [148] 147 [143] 144 [150] 145 [146] 133 [137] 140 [135] abc-mart 147 (16.9) 126 (19.7) 112 (24.3) 115 (20.7) 114 (22.4) 113 (21.5) 112 (22.8) 110 (17.3) 104 (25.7)  = 0.08 177 [177] 159 [156] 155 [148] 136 [144] 139 [145] 145 [144] 139 [145] 129 [135] 134 [139] 142 118 108 106 107 106 106 108 100 (19.8) (25.8) (30.3) (22.1) (23.0) (26.9) (23.7) (16.3) (25.4)  = 0.1 172 [177] 159 [162] 144 [151] 142 [136] 141 [145] 152 [142] 141 [137] 138 [134] 136 [143] 137 (20.3) 119 (25.2) 103 (28.5) 100 (29.6) 104 (26.2) 108 (28.9) 99 (29.8) 104 (24.6) 102 (25.0)

J J J J J J J J J J J J J J J J J J

= = = = = = = = = = = = = = = = = =

4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20

Train: J = 10  = 0.1
Training loss

10

6

Train: J = 20  = 0.1

Training loss

10

5

10

5

10

4

abc 10
4

abc

1

1000 2000 3000 4000 5000 Iterations

10

3

1

1000 2000 3000 4000 5000 Iterations
Test misclassifications

220 200 180 160 140 120 100 1 2000 4000 6000 8000 10000 Iterations Test: J = 16  = 0.1 abc Test misclassifications Test: J = 6  = 0.1

220 200 180 160 140 120 100 1 220 Test misclassifications 200 180 160 140 120 100 1

Figure 1. Covertype. The training loss, i.e., (5).
5 Test misclassifications 4 3 2 1 1 abc 1000 2000 3000 Iterations 4000 5000 x 10
4

Test: J = 10  = 0.1

Test misclassifications

Test: J = 10  = 0.1

4

x 10

4

Test: J = 20  = 0.1

3

abc 2000 4000 6000 8000 10000 Iterations Test: J = 20  = 0.1

2 abc 1 1 1000 2000 3000 4000 5000 Iterations
Test misclassifications

220 200 180 160 140 120 abc 100 1 1000 2000 3000 4000 5000 Iterations

Figure 2. Covertype. The test mis-classification errors.

abc 500 1000 1500 2000 2500 3000 Iterations

4.3. Experiments on the Letter Dataset We trained till the loss (5) reached machine accuracy, to exhaust the capacity of the learner so that we could provide a reliable comparison, up to M = 10000. Table 4 summarizes the smallest test mis-classification errors along with the relative improvements. For abcmart, the smallest error usually occurred at very close to the last iteration, as reflected in Figure 3, which again demonstrates that abc-mart exhibits considerably and consistently smaller test errors than mart. One observation is that test errors are fairly stable across J and , unless J and  are small (machine accuracy was not reached in these cases). 4.4. Experiments on the Letter2k Dataset Table 5 and Figure 4 present the test errors, illustrating the considerable and consistent improvement of abc-mart over mart on this dataset.

Figure 3. Letter. The test mis-classification errors. Table 5. Letter2k. The test mis-classification errors.
 = 0.04 2694 [2750] 2683 [2720] 2569 [2577] 2534 [2545] 2503 [2474] 2488 [2432] 2503 [2499] 2494 [2464] 2499 [2507] 2476 2355 2277 2236 2199 2202 2215 2216 2199 (8.1) (12.2) (11.4) (11.8) (12.1) (11.5) (11.5) (11.1) (12.0) mart  = 0.06 2698 [2728] 2664 [2688] 2603 [2579] 2516 [2546] 2516 [2465] 2467 [2482] 2501 [2494] 2497 [2482] 2512 [2523] abc-mart 2458 (8.9) 2319 (12.9) 2281 (12.4) 2204 (12.4) 2210 (12.2) 2218 (10.1) 2216 (11.4) 2208 (11.6) 2195 (12.6)  = 0.08 2684 [2706] 2640 [2716] 2563 [2603] 2504 [2539] 2473 [2492] 2460 [2451] 2496 [2437] 2472 [2489] 2504 [2460] 2406 2309 2253 2190 2193 2198 2228 2205 2183 (10.4) (12.5) (12.1) (12.5) (11.3) (10.7) (10.7) (10.8) (12.8)  = 0.1 2689 [2733] 2629 [2688] 2571 [2559] 2491 [2514] 2492 [2455] 2460 [2454] 2500 [2424] 2439 [2476] 2482 [2505] 2407 2314 2241 2184 2200 2180 2202 2213 2213 (10.5) (12.0) (12.8) (12.3) (11.7) (11.4) (11.9) (9.3) (10.8)

J=4 J=6 J=8 J=10 J=12 J=14 J=16 J=18 J=20 J=4 J=6 J=8 J=10 J=12 J=14 J=16 J=18 J=20

Adaptive Base Class Boost
3200 Test misclassifications 3000 2800 2600 2400 2200 2000 1 3200 Test misclassifications 3000 2800 2600 2400 2200 2000 1 200 abc 400 600 Iterations 800 1000 abc 2000 3000 Iterations 4000 Test: J = 6  = 0.1 3200 Test misclassifications 3000 2800 2600 2400 2200 2000 1 3200 Test misclassifications 3000 2800 2600 2400 2200 2000 1 200 abc 400 600 Iterations 800 500 1000 1500 Iterations Test: J = 20  = 0.1 abc 2000 Test: J = 10  = 0.1
Test misclassifications 2000 1800 1600 1400 1200 1000 1 2000 Test misclassifications 1800 1600 1400 1200 abc 1000 1 200 400 600 800 100012001400 Iterations 2000 abc 2000 Test misclassifications 1800 1600 1400 1200 abc 1000 1 2000 Test misclassifications 1800 1600 1400 1200 abc 1000 1 200 400 600 Iterations 800 1000 1000 2000 3000 4000 5000 Iterations Test: J = 20  = 0.1

Test: J = 6  = 0.1

Test: J = 10  = 0.1

4000 6000 Iterations Test: J = 16  = 0.1

8000

Test: J = 16  = 0.1

Figure 4. Letter2k. The test mis-classification errors.

Figure 5. Letter4k. The test mis-classification errors.

Figure 4 indicates that we train more iterations for abcmart than mart, for some cases. In our experiments, we terminated training mart whenever the training loss (5) reached 10-14 because we found out that, for most datasets and parameter settings, mart had difficulty reaching smaller training loss. In comparisons, abcmart usually had no problem of reducing the training loss (5) down to 10-16 or smaller; and hence we terminated training abc-mart at 10-16 . For the Letter2k dataset, our choice of the termination conditions may cause mart to terminate earlier than abc-mart in some cases. Also, as explained in Sec. 3.3, the fact that mart effectively uses larger shrinkage than abc-mart may be partially responsible for the phenomenon in Figure 4. 4.5. Experiments on the Letter4k Dataset
Table 6. Letter4k. The test mis-classification errors.
 = 0.04 1681 [1664] 1618 [1584] 1531 [1508] 1499 [1500] 1420 [1456] 1410 [1412] 1395 [1428] 1376 [1396] 1386 [1384] 1407 1292 1259 1228 1213 1181 1167 1164 1149 (16.3) (20.1) (17.8) (18.1) (14.6) (16.2) (16.3) (15.4) (17.1) mart  = 0.06 1660 [1684] 1584 [1584] 1522 [1492] 1463 [1480] 1434 [1416] 1388 [1392] 1402 [1392] 1375 [1392] 1397 [1416] abc-mart 1372 (17.3) 1285 (18.9) 1246 (18.1) 1201 (17.9) 1178 (17.9) 1154 (16.9) 1153 (17.8) 1136 (17.4) 1127 (19.3)

Table 7. Pendigits. The test mis-classification errors.
 = 0.04 144 [145] 135 [139] 133 [133] 132 [132] 136 [134] 129 [126] 129 [127] 132 [129] 130 [129] 109 109 105 102 101 105 109 110 109 (24.3) (19.3) (20.5) (17.7) (25.7) (18.6) (15.5) (16.7) (16.2) mart  = 0.06 145 [146] 135 [140] 130 [132] 129 [128] 134 [134] 131 [131] 130 [129] 130 [129] 125 [125] abc-mart 106 (26.9) 105 (22.2) 101 (22.3) 102 (20.9) 103 (23.1) 102 (22.1) 107 (17.7) 106 (18.5) 105 (16.0)  = 0.08 145 [144] 143 [137] 129 [133] 127 [128] 135 [140] 130 [133] 133 [132] 126 [128] 126 [130] 106 104 104 102 103 102 106 105 107 (26.9) (27.3) (19.4) (19.7) (23.7) (21.5) (20.3) (16.7) (15.1)  = 0.1 143 [142] 135 [138] 128 [134] 130 [132] 133 [134] 133 [131] 134 [126] 133 [131] 130 [128] 107 102 104 100 105 102 106 105 105 (25.2) (24.4) (18.8) (23.1) (21.1) (23.3) (20.9) (21.1) (19.2)

J J J J J J J J J J J J J J J J J J

= = = = = = = = = = = = = = = = = =

4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20

180

J J J J J J J J J J J J J J J J J J

= = = = = = = = = = = = = = = = = =

4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20

 = 0.08 1671 [1664] 1588 [1596] 1516 [1492] 1479 [1480] 1409 [1428] 1377 [1400] 1396 [1404] 1357 [1400] 1371 [1388] 1348 1261 1191 1181 1170 1148 1154 1126 1126 (19.3) (20.6) (21.4) (20.1) (17.0) (16.6) (17.3) (17.0) (17.9)

 = 0.1 1655 [1672] 1577 [1588] 1521 [1548] 1470 [1464] 1437 [1424] 1396 [1380] 1387 [1376] 1352 [1364] 1370 [1388] 1318 1234 1183 1182 1162 1158 1142 1149 1142 (20.4) (21.8) (22.2) (19.6) (19.1) (17.0) (17.7) (15.0) (16.4)

Test misclassifications

Test misclassifications

Test: J = 6  = 0.1

180 Test: J = 10  = 0.1 160 140 120 abc 100 1 180 Test misclassifications 500 1000 Iterations Test: J = 20  = 0.1 160 140 120 100 1 1500

160 140 120 abc 100 1 180 500 1000 1500 2000 2500 Iterations Test: J = 16  = 0.1 160 140 120 100 1 abc

Test misclassifications

abc 200 400 Iterations 600 800

200

400 Iterations

600

800

4.6. Experiments on the Pendigits Dataset

Figure 6. Pendigits. The test mis-classification errors.

Adaptive Base Class Boost

4.7. Experiments on the Zipcode Dataset .
Table 8. Zipcode. We report the test mis-classification errors of mart and abc-mart, together with the results from Friedman's MART program (in [ ]). The relative improvements (Rerr , %) of abc-mart are included in ( ).
 = 0.04 130 [132] 123 [126] 120 [119] 118 [117] 117 [118] 118 [115] 119 [121] 113 [120] 114 [111] 120 110 106 103 103 103 106 102 104 (7.7) (10.6) (11.7) (12.7) (12.0) (12.7) (10.9) (9.7) (8.8) mart  = 0.06 125 [128] 124 [128] 122 [123] 118 [119] 116 [118] 120 [116] 111 [113] 114 [116] 112 [110] abc-mart 113 (9.5) 112 (9.7) 102 (16.4) 104 (11.9) 101 (12.9) 106 (11.7) 102 (8.1) 100 (12.3) 103 (8.0)  = 0.08 129 [132] 123 [127] 122 [120] 120 [115] 117 [116] 119 [114] 116 [114] 114 [116] 115 [110] 116 109 103 106 101 103 100 101 105 (10.1) (11.4) (15.5) (11.7) (13.7) (13.4) (13.8) (11.4) (8.7)  = 0.1 127 [128] 126 [124] 123 [115] 118 [117] 118 [113] 118 [114] 115 [114] 114 [113] 111 [115] 109 104 103 105 104 104 104 101 105 (14.2) (17.5) (16.3) (11.0) (11.9) (11.9) (9.6) (11.4) (5.4)

Table 9. Optdigits. The test mis-classification errors
 = 0.04 58 [61] 58 [57] 61 [62] 60 [62] 57 [60] 57 [57] 60 [60] 60 [59] 58 [60] 48 43 46 48 47 50 51 49 51 (17.2) (25.9) (24.6) (20.0) (17.5) (12.3) (15.0) (18.3) (12.1) mart  = 0.06 57 [58] 57 [54] 60 [58] 55 [59] 58 [59] 58 [61] 58 [59] 59 [60] 61 [62] abc-mart 45 (21.1) 47 (17.5) 45 (25.0) 47 (14.5) 48 (17.2) 50 (13.8) 48 (17.2) 48 (18.6) 48 (21.3)  = 0.08 57 [57] 59 [59] 57 [59] 57 [57] 56 [60] 58 [59] 59 [58] 59 [58] 58 [60] 46 43 46 47 47 49 46 49 50 (19.3) (27.1) (19.3) (17.5) (16.1) (15.5) (22.0) (20.0) (13.8)  = 0.1 59 [58] 57 [56] 60 [56] 60 [60] 60 [59] 55 [57] 57 [58] 59 [57] 59 [60] 43 43 47 50 47 47 49 50 47 (27.1) (24.6) (21.6) (16.7) (21.7) (14.6) (14.0) (15.3) (20.3)

J J J J J J J J J J J J J J J J J J

= = = = = = = = = = = = = = = = = =

4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20

J J J J J J J J J J J J J J J J J J

= = = = = = = = = = = = = = = = = =

4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20

100 Test misclassifications

80

Test misclassifications

Test: J = 6  = 0.1

100

Test: J = 10  = 0.1

80

60 abc 40 1 100 500 1000 1500 2000 2500 Iterations Test: J = 16  = 0.1 80

60 abc 40 1 100 200 400 600 800 1000 1200 Iterations Test: J = 20  = 0.1 80

180 Test misclassifications 160 140 120 abc 100 1 180 Test misclassifications Test misclassifications Test: J = 16  = 0.1 160 140 120 abc 100 1 200 400 600 800 1000 1200 Iterations 1000 2000 3000 Iterations 4000 Test misclassifications Test: J = 6  = 0.1

180 Test: J = 10  = 0.1 160
Test misclassifications

140 120 abc 100 1 180 160 140 120 abc 100 1 200 400 600 Iterations 800 1000 500 1000 1500 Iterations Test: J = 20  = 0.1 2000

60 abc 40 1 200 400 Iterations 600 800

Test misclassifications

60 abc 40 1 100 200 300 400 500 600 Iterations

Figure 8. Optdigits. The test mis-classification errors.

4.9. Experiments on the Isolet Dataset This dataset is high-dimensional with 617 features, for which tree algorithms become less efficient. We have only conducted experiments for one shrinkage, i.e,.  = 0.1.
Table 10. Isolet. The test mis-classification errors.
mart  = 0.1 80 [86] 84 [86] 84 [88] 82 [83] 91 [90] 95 [94] 94 [92] 86 [91] 87 [94] abc-mart  = 0.1 64 (20.0) 67 (20.2) 72 (14.3) 74 (9.8) 74 (18.7) 74 (22.1) 78 (17.0) 78 (9.3) 78 (10.3)

Figure 7. Zipcode. The test mis-classification errors.

4.8. Experiments on the Optdigits Dataset This dataset is one of two largest datasets (Optdigits and Pendigits) used in a recent paper on boosting (Zou et al., 2008), which proposed the multi-class gentleboost and ADABOOST.ML. For Pendigits, (Zou et al., 2008) reported 3.69%, 4.09%, and 5.86% error rates, for gentleboost, ADABOOST.ML, and ADABOOST.MH (Schapire & Singer, 2000), respectively. For Optdigits, they reported 5.01%, 5.40%, and 5.18%, respectively.

J J J J J J J J J

= = = = = = = = =

4 6 8 10 12 14 16 18 20

Adaptive Base Class Boost
140 Test misclassifications 120 100 80 abc 60 1 140 Test misclassifications 120 100 80 abc 60 1 200 400 Iterations 600 800 Test misclassifications Test: J = 16  = 0.1 500 1000 1500 2000 2500 3000 Iterations Test misclassifications Test: J = 6  = 0.1 140 120 100 80 abc 60 1 140 Test: J = 20  = 0.1 120 100 80 abc 60 1 100 200 300 400 500 600 Iterations 200 400 600 800 1000 1200 Iterations Test: J = 10  = 0.1

Figure 9. Isolet. The test mis-classification errors.

5. Conclusion
We present the concept of abc-boost and its concrete implementation named abc-mart, for multi-class classification (with K  3 classes). Two key components of abc-boost include: (A) By enforcing the (commonly used) constraint on the loss function, we can derive boosting algorithms for only K - 1 classes using a base class; (B) We adaptively (and greedily) choose the base class at each boosting step. Our experiments demonstrated the improvements. Comparisons with other boosting algorithms on some public datasets may be possible through prior publications, e.g., (Allwein et al., 2000; Zou et al., 2008). We also hope that our work could be useful as the baseline for future development in multi-class classification.

Acknowledgement
The author thanks the reviewers and area chair for their constructive comments. The author also thanks Trevor Hastie, Jerome Friedman, Tong Zhang, Phil Long, Cun-Hui Zhang, and Giles Hooker. Ping Li is partially supported by NSF (DMS-0808864), ONR (N000140910911, young investigator award), and the "Beyond Search - Semantic Computing and Internet Economics" Microsoft 2007 Award.

References
Allwein, E. L., Schapire, R. E., & Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. J. of Machine Learning Research, 1, 113�141.

Bartlett, P., Freund, Y., Lee, W. S., & Schapire, R. E. (1998). Boosting the margin: a new explanation for the effectiveness of voting methods. The Annals of Statistics, 26, 1651�1686. Cossock, D., & Zhang, T. (2006). Subset ranking using regression. Conf. on Learning Theory, 605�619. Freund, Y. (1995). Boosting a weak learning algorithm by majority. Inf. Comput., 121, 256�285. Freund, Y., & Schapire, R. E. (1997). A decisiontheoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci., 55, 119�139. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29, 1189�1232. Friedman, J. H., Hastie, T. J., & Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 28, 337�407. Lee, Y., Lin, Y., & Wahba, G. (2004). Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. J. of Amer. Stat. Asso., 99, 67�81. Li, P., Burges, C. J., & Wu, Q. (2008). Mcrank: Learning to rank using classification and gradient boosting. Neur. Inf. Proc. Sys. Conf. 897�904. Mason, L., Baxter, J., Bartlett, P., & Frean, M. (2000). Boosting algorithms as gradient descent. Neur. Inf. Proc. Sys. Conf. 512-518 Schapire, R. (1990). The strength of weak learnability. Machine Learning, 5, 197�227. Schapire, R. E., & Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37, 297�336. Schapire, R. E., & Singer, Y. (2000). Boostexter: A boosting-based system for text categorization. Machine Learning, 39, 135�168. Tewari, A., & Bartlett, P. L. (2007). On the consistency of multiclass classification methods. J. of Machine Learning Research, 8, 1007�1025. Zhang, T. (2004). Statistical analysis of some multicategory large margin classification methods. J. of Machine Learning Research, 5, 1225�1251. Zheng, Z., Zha, H., Zhang, T., Chapelle, O., Chen, K., Sun, G. (2008). A General Boosting Method and its Application to Learning Ranking Functions for Web Search Neur. Inf. Proc. Sys. Conf. 1697�1704. Zou, H., Zhu, J., & Hastie, T. (2008). New multicategory boosting algorithms based on multicategory fisher-consistent losses. The Annals of Applied Statistics, 2, 1290�1306.