FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213 jkbradle@cs.cmu.edu Robert E. Schapire Department of Computer Science Princeton University Princeton, NJ 08540 schapire@cs.princeton.edu Abstract We study boosting in the filtering setting, where the booster draws examples from an oracle instead of using a fixed training set and so may train efficiently on very large datasets. Our algorithm, which is based on a logistic regression technique proposed by Collins, Schapire, & Singer, represents the first boosting-by-filtering algorithm which is truly adaptive and does not need the less realistic assumptions required by previous work. Moreover, we give the first proof that the algorithm of Collins et al. is a strong PAC learner, albeit within the filtering setting. Our proofs demonstrate the algorithm's strong theoretical properties for both classification and conditional probability estimation, and we validate these results through extensive experiments. Empirically, our algorithm proves more robust to noise and overfitting than batch boosters in conditional probability estimation and proves competitive in classification. 1 Introduction Boosting provides a ready method for improving existing learning algorithms for classification. Taking a weaker learner as input, boosters use the weak learner to generate weak hypotheses which are combined into a classification rule more accurate than the weak hypotheses themselves. Boosters such as AdaBoost [1] have shown considerable success in practice. Most boosters are designed for the batch setting where the learner trains on a fixed example set. This setting is reasonable for many applications, yet it requires that all examples be collected before training. Moreover, most batch boosters maintain a distribution over the entire training set, making them computationally expensive for very large datasets. To make boosting feasible on larger datasets, learners can be designed for the filtering setting. The batch setting provides the learner with a fixed training set, but the filtering setting provides an oracle which can produce an unlimited number of labeled examples which are received one at a time. This idealized model may describe learning problems with on-line example sources, including very large datasets which must be loaded piecemeal into memory. By using new training examples each round, filtering boosters avoid maintaining a distribution over a training set and so may use large datasets much more efficiently than batch boosters. Freund proposed an early filtering booster which is non-adaptive in that it requires an a priori bound on weak hypothesis error rates and combines weak hypotheses via an unweighted majority vote [2]. Domingo & Watanabe proposed MadaBoost, a modification of AdaBoost for filtering, and showed experimentally that it is competitive with AdaBoost [3]. However, it theoretically requires weak hypotheses' error rates to be monotonically increasing, an assumption which we found to be often violated in practice. Bshouty & Gavinsky proposed another filtering booster, but, like Freund's, their algorithm requires an a priori bound on weak hypothesis error rates [4]. 1 This paper presents FilterBoost, an adaptive boosting-by-filtering algorithm. We show it is applicable to both conditional probability estimation, where the learner predicts the probability of each label given an example, and classification. In Section 2, we describe the algorithm, after which we interpret it as a stepwise method for fitting an additive logistic regression model for conditional probabilities. In comparison, there seems to be no clear way to apply MadaBoost to conditional probability estimation. We then bound the number of rounds and examples required to achieve any target error in (0, 1). Our bounds resemble those for MadaBoost but do not require the less realistic assumptions and a priori knowledge needed by previous filtering boosters. We also show that FilterBoost can use the confidence-rated predictions from weak hypotheses described by Schapire & Singer [5]. In Section 3, we give results from extensive experiments. For conditional probability estimation, we show that FilterBoost often outperforms batch boosters, which prove less robust to overfitting. For classification, we show that filtering boosters' efficiency on large datasets allows them to achieve higher accuracies faster than batch boosters in many cases. FilterBoost is based on a modification of AdaBoost by Collins, Schapire & Singer designed to minimize logistic loss [6]. Their batch algorithm has yet to be shown to achieve arbitrarily low test error, but we use techniques similar to those of MadaBoost to adapt the algorithm to the filtering setting and prove generalization bounds. The result is an adaptive algorithm with realistic assumptions and strong theoretical properties. Its robustness and efficiency on large datasets make it competitive with existing methods for both conditional probability estimation and classification. 2 The FilterBoost Algorithm Let X be the set of examples and Y a discrete set of labels. For simplicity, assume X is countable, and consider only binary labels Y = {-1, +1}. We assume there exists an unknown target distribution D over labeled examples (x, y ) X × Y from which training and test examples are generated. The goal in classification is to choose a hypothesis h : X Y which minimizes the classification error PrD [h(x) = y ], where the subscript indicates that the probability is with respect to (x, y ) sampled randomly from D. In the batch setting, a booster is given a fixed training set S and a weak learner which, given any distribution Dt over training examples S , is guaranteed to return a weak hypothesis ht : X R such that the error t PrDt [sign(ht (x)) = y ] < 1/2. For T rounds t, the booster builds a distribution Dt over S , runs the weak learner on S and Dt , and receives ht . The booster usually then estimates t using S and weights ht with t = t ( t). After T rounds, the booster outputs a final t hypothesis H which is a linear combination of the weak hypotheses (e.g. H (x) = t ht (x)). The sign of H (x) indicates the predicted label y for x. ^ Two key elements of boosting are constructing Dt over S and weighting weak hypotheses. Dt is built such that misclassified examples receive higher weights than in Dt-1 , eventually forcing the weak learner to classify previously poorly classified examples correctly. Weak hypotheses ht are generally weighted such that hypotheses with lower errors receive higher weights. 2.1 Boosting-by-Filtering We describe a general framework for boosting-by-filtering which includes MadaBoost [3] and our algorithm Filterboost. The filtering setting assumes the learner has access to an example oracle, allowing it to use entirely new examples sampled i.i.d. from D on each round. However, while maintaining the distribution Dt is straightforward in the batch setting, there is no fixed set S on which to define Dt in filtering. Instead, the booster simulates examples drawn from Dt by drawing examples from D via the oracle and reweighting them according to Dt . Filtering boosters generally accept each example (x, y ) from the oracle for training on round t with probability proportional to the example's weight Dt (x, y ). The mechanism which accepts examples from the oracle with some probability is called the filter. Thus, on each round, a boosting-by-filtering algorithm draws a set of examples from Dt via the filter, trains the weak learner on this set, and receives a weak hypothesis ht . Though a batch booster would estimate t using the fixed set S , filtering boosters may use new examples from the filter. 2 Like batch boosters, filtering boosters may weight ht using t = t ( t), and they output a linear combination of h1 , . . . , hT as a final hypothesis. The filtering setting allows the learner t to estimate the error of Ht to arbitrary Define Ft (x) t-11 t ht (x) = precision by sampling from D via the Algorithm F ilterB oost accepts Oracle(), , , : oracle. Both MadaBoost and FilterFor t = 1, 2, 3, . . . Boost use the examples sampled from the oracle by the filter to decide when t - 3t(t+1) Ht is accurate enough to stop boosting. Call F ilter(t, t , ) to get mt examples to train WL; get ht 2.2 FilterBoost t - g etE dg e(t, , t , ) ^ D 1 /2+t ^ FilterBoost, given in Figure 1, is modt - 1 ln 1/2-t 2 ^ eled after the aforementioned algo( F rithm by Collins et al. [6] and Madaefine Ht (x) = sign t+1 (x) Boost [3]. Given an example oracle, Algorithm exits from F ilter() function.) weak learner, target error (0, 1), and confidence parameter (0, 1) Function F ilter(t, t , ) returns (x, y ) upper-bounding the probability of failDefine r = # calls to Filter so far on round t ure, it iterates until the current comt t - r(r+1) bined hypothesis Ht has error . 1 For (i = 0; i < 2 ln( t ); i = i + 1): On round t, FilterBoost draws mt ex amples from the filter to train the weak (x, y ) - Oracle() learner and get ht . The number mt qt (x, y ) - 1+ey1Ft (x) must be large enough to ensure ht has error t < 1/2 with high probabilReturn (x, y ) with probability qt (x, y ) ity. The edge of ht is t = 1/2 - t, End algorithm; return Ht-1 and this edge is estimated by the func^ tion g etE dg e(), discussed below, and Function g etE dg e(t, , t , ) returns t is used to set ht 's weight t . The curLet m - 0, n - 0, u - 0, - rent combinedt hypothesis is defined as While (|u| < (1 + 1/ )): Ht = sign( t =1 t ht ). (x, y ) - F ilter(t, t , ) The F ilter() function generates (x, y ) n - n + 1 from Dt by repeatedly drawing (x, y ) m - m + I (ht (x) = y ) from the oracle, calculating the weight u - m/n - 1/2 qt (x, y ) Dt (x, y ), and accepting ( (x, y ) with probability qt (x, y ). - 1/2n) ln(n(n + 1)/t ) Function g etE dg e() uses a modificaReturn u/(1 + ) tion of the Nonmonotonic Adaptive Sampling method of Watanabe [7] and Figure 1: The algorithm FilterBoost. ` Domingo, Galvada & Watanabe [8]. Their algorithm draws an adaptively chosen number of examples from the filter and returns an estimate t of the edge of ht within relative error of the true edge t with high probability. The ^ g etE dg e() function revises this estimate as t = t /(1 + ). ^ ^ 2.3 Analysis: Conditional Probability Estimation We begin our analysis of FilterBoost by interpreting it as an additive model for logistic regression, for this interpretation will later aid in the analysis for classification. Such models take the form t Pr[y = 1|x] 1 log = ft (x) = F (x), which implies Pr[y = 1|x] = Pr[y = -1|x] 1 + e-F (x) where, for FilterBoost, ft (x) = t ht (x). Dropping subscripts, we can write the expected negative log likelihood of example (x, y ) after round t as - = l1 . 1 (Ft + t ht ) = (F + h) = E ln E n + e-y(F (x)+h(x)) 1 + e-y(F (x)+h(x)) 3 Taking a similar approach to the analysis of AdaBoost in [9], we show in the following theorem that FilterBoost performs an approximate stepwise minimization of this negative log likelihood. The proof is in the Appendix. Theorem 1 Define the expected negative log likelihood (F + h) as above. Given F , FilterBoost chooses h to minimize a second-order Taylor expansion of around h = 0. Given this h, it then chooses to minimize an upper bound of . The batch booster given by Collins et al. [6] which FilterBoost is based upon is guaranteed to converge to the minimum of this objective when working over a finite sample. Note that FilterBoost uses weak learners which are simple classifiers to perform regression. AdaBoost too may be interpreted 1 as an additive logistic regression model of the form Pr[y = 1|x] = 1+e-2F (x) with E[exp(-y F (x))] as the optimization objective [9]. It is not clear how to apply MadaBoost to conditional probability estimation since it uses a non-differentiable objective. 2.4 Analysis: Classification In this section, we interpret FilterBoost as a traditional boosting algorithm for classification and prove bounds on its generalization error. We first give a theorem relating errt , the error rate of Ht over the target distribution D, to pt , the probability with which the filter accepts a random example generated by the oracle on round t. Theorem 2 Let errt = PrD [Ht (x) = y ], and let pt = ED [qt (x, y )]. Then errt 2pt . Proof: errt W = PrD [Ht (x) = y ] = PrD [y Ft-1 (x) 0] = PrD [qt (x, y ) 1/2] 2 · ED [qt (x, y )] = 2pt (using Markov's inequality above) e next use the expected negative log likelihood from Section 2.3 as an auxiliary function to aid in bounding the required number of boosting rounds. Viewing as a function of the boosting round ( t, we can write t = - x,y ) D (x, y ) ln(1 - qt (x, y )). Our goal is then to minimize t , and the following lemma captures the learner's progress in terms of the decrease in t on each round. This lemma assumes edge estimates returned by g etE dg e() are exact, i.e. t = t , which leads to a ^ simpler bound on T in Theorem 3. We then consider the error in edge estimates and give a revised bound in Lemma 2 and Theorem 5. The proofs of Lemmas 1 and 2 are in the Appendix. Lemma 1 Assume for all t that t ( - x,y ) D (x, y ) ln(1 - qt (x, y )). Then = 0 and t is estimated exactly. 1 t - t+1 pt -2 1 . Let t = 2 /4 - t Combining Theorem 2, which bounds the error of the current combined hypothesis in terms of pt , with Lemma 1 gives the following upper bound on the required rounds T . Theorem 3 Let = mint |t |, and let be the target error. Given Lemma 1's assumptions, if FilterBoost runs 2 ln(2) r 1 T> 1 -2 /4 - 2 ounds, then errt < for some t, 1 t T . In particular, this is true for T > ln(2) 2 2 . Proof: For all (x, y ), since F1 (x, y ) = 0, then q1 (x, y ) = 1/2 and 1 = ln(2). Now, suppose errt , t {1, ..., T }. Then, from Theorem 2, pt /2, so Lemma 1 gives 1 1 14 t - t+1 - 2 /4 - 2 2 Unraveling this recursion as T t=1 (t - t+1 ) = 1 - T +1 1 gives T 1 2 ln(2) . 1 -2 /4 - 2 So, errt , t {1, ..., T } is contradicted if T exceeds t theorem's lower bound. The simplified he b Tound follows from the first bound via the inequality 1 - 1 - x x for x [0, 1]. heorem 3 shows FilterBoost can reduce generalization error to any (0, 1), but we have thus far overlooked the probabilities of failure introduced by three steps: training the weak learner, deciding when to stop boosting, and estimating edges. We bound the probability of each of these steps failing on round t with a confidence parameter t = 3t(t+1) so that a simple union bound ensures the probability of some step failing to be at most FilterBoost's confidence parameter . Finally, we revise Lemma 1 and Theorem 3 to account for error in estimating edges. The number mt of examples used to train the weak learner must be large enough to ensure that weak hypothesis ht has a non-zero edge. We overlook this issue since the choice of mt depends on the choice of weak learner. To decide when to stop boosting (i.e. when errt ), we can use Theorem 2, which upper-bounds the error of the current combined hypothesis Ht in terms of the probability pt that F ilter() accepts a random example from the oracle. If the filter rejects enough examples in a single call, we know pt is small, so Ht is accurate enough. Theorem 4 formalizes this intuition; the proof is in the Appendix. Theorem 4 In a single call to F ilter(t), if n examples have been rejected, where n then errt with probability at least 1 - t . 2 ln(1/t ), Theorem 4 provides a stopping condition which is checked on each call to F ilter(). Each check t may fail with probability at most t = r(r+1) on the rth call to F ilter() so that a union bound ensures FilterBoost stops prematurely on round t with probability at most t . Theorem 4 uses a similar argument to that used for MadaBoost, giving similar stopping criteria for both algorithms. We estimate weak hypotheses' edges t using the Nonmonotonic Adaptive Sampling (NAS) algorithm [7,8] which MadaBoost uses. To compute an estimate t of the true edge t within relative er^ 2 2 ror (0, 1) with probability at least 1 - t , the NAS algorithm requires at most 2((1+t )2) ln( 1 t ) t filtered examples. (If the booster has an a priori lower bound on all t , the number examples may be chosen using a Hoeffding-style bound, as Bshouty & Gavinsky's filtering booster does [4].) With these guarantees on the error in estimating edges, we can rewrite Lemma 1 as the following lemma. Lemma 2 (Assume for all t that t = 0 and t is estimated to within (0, 1) relative error. Let t = - x,y ) D (x, y ) ln(1 - qt (x, y )). Then 1 1 1 2. - 2 t - t+1 pt -2 /4 - t 1+ Using Lemma 2, the following theorem modifies Theorem 3 to account for error in edge estimates. Theorem 5 Let = mint |t |. Let be the target error. Given Lemma 2's assumptions, if FilterBoost runs 2 ln(2) 1 r T> 1 -2 /4 - 2 ( 1- )2 1+ ounds, then errt < for some t, 1 t T . The bounds from Theorems 3 and 5 show FilterBoost requires at most O(-1 -2 ) boosting rounds. MadaBoost [3] resembles FilterBoost but uses truncated exponential loss qt (x, y ) = min{1, exp(y Ft-1 (x))} instead of logistic loss qt (x, y ) = (1 + exp(y Ft (x)))-1 used by FilterBoost. The algorithms' analyses differ, with MadaBoost requiring the edges t to be monotonically decreasing, but both lead to similar bounds on T proportional to -1 . These bounds are significantly 5 worse than those for the non-adaptive filtering boosters proposed by Freund [2] and by Bshouty & Gavinsky [4] and the batch booster AdaBoost [1], the three of which require O(log(-1 )) rounds. We did not test the filtering boosters of Freund [2] or Bshouty & Gavinsky [4]. Based on the success of adaptive boosting, where weak hypotheses are weighted according to their accuracies, the adaptive FilterBoost and MadaBoost are likely improvements over Freund's booster, which combines weak hypotheses by a majority vote. Bshouty & Gavinsky's booster requires an a priori lower bound on edges t . Despite FilterBoost and MadaBoost's worse bounds on T , these filtering algorithms are much more computationally efficient--and so often faster at learning--on large datasets than batch learners like AdaBoost, as is evidenced by our results in Section 3. An alternate bound for FilterBoost may be derived using techniques from Shalev-Shwartz & Singer [10]. They use the framework of convex repeated games to define a general method for bounding the performance of online and boosting algorithms. For FilterBoost, their techniques, combined with Theorem 2, give a bound similar to that in Theorem 3 but proportional to -2 instead of -1 . Schapire & Singer [5] show AdaBoost benefits from confidence-rated predictions, where weak hypotheses return predictions whose absolute values indicate confidence. These values are chosen to greedily minimize AdaBoost's exponential loss function over training data, and this aggressive weighting can result in faster learning. FilterBoost may use confidence-rated predictions in an identical manner. In the proof of Lemma 1, the decrease in the negative log (ikelihood t of the data l -t y ht (x) . (relative to Ht and the target distribution D) is lower-bounded by pt - pt x,y ) Dt (x, y )e Since pt is fixed, maximizing this bound is equivalent to minimizing the exponential loss over Dt . 3 Experiments Vanilla FilterBoost accepts examples (x, y ) from the oracle with probability qt (x, y ), but it may instead accept all examples and weight each with qt (x, y ). Weighting instead of filtering examples increases accuracy but also increases the size of the training set passed to the weak learner. For efficiency, we choose to filter when training the weak learner but weight when estimating edges t . We also modify FilterBoost's g etE dg e() function for efficiency. The Nonmonotonic Adaptive Sampling (NAS) algorithm used to estimate edges t uses many examples, but using several orders of magnitude fewer sacrifices little accuracy. The same is true for MadaBoost. In all tests, we use Cn log(t + 1) examples to estimate t , where Cn = 300 and the log factor scales the number as the NAS algorithm would. For simplicity, we train weak learners with Cn log(t + 1) examples as well. These modifications mean (error in edge estimates) and (confidence) have no effect on our tests. To simulate an oracle, we randomly permute the data and use examples in the new order. In practice, filtering boosters can achieve higher accuracy by cycling through training sets again instead of stopping once examples are depleted, and we use this "recycling" in our tests. We test FilterBoost with and without confidence-rated predictions (labeled "(C-R)" in our results). We compare FilterBoost against MadaBoost [3], for it is one of the only adaptive filtering boosters which does not require an a priori bound on weak hypotheses' edges. We implement it with the same modifications as FilterBoost. We test FilterBoost against two batch boosters: the well-studied and historically successful AdaBoost [1] and the algorithm from Collins et al. [6] which is essentially a batch version of FilterBoost (labeled "AdaBoost-LOG"). We test both with and without confidencerated predictions as well as with and without resampling (labeled "(resamp)"). In resampling, the booster trains weak learners on small sets of examples sampled from the distribution Dt over the training set S rather than on the entire set S , and this technique often increases efficiency with little effect on accuracy. Our batch boosters use sets of size Cm log(t + 1) for training, like the filtering boosters, but use all of S to estimate edges t since this can be done efficiently. We test the batch boosters using confidence-rated predictions and resampling in order to compare FilterBoost with batch algorithms optimized for the efficiency which boosting-by-filtering claims as its goal. We test each booster using decision stumps and decision trees as weak learners to discern the effects of simple and complicated weak hypotheses. The decision stumps minimize training error, and the decision trees greedily maximize information gain and are pruned using 1/3 of the data. Both weak learners minimize exponential loss when outputing confidence-rated predictions. We use four datasets, described in the Appendix. Briefly, we use two synthetic sets: Majority (majority vote) and Twonorm [11], and two real sets from the UCI Machine Learning Repository 6 Figure 2: Running times: Ada/Filter/MadaBoost. Majority; WL = stumps. [12]: Adult (census data; from Ron Kohavi) and Covertype (forestry data with 7 classes merged to 2; Copyr. Jock A. Blackard & Colorado State U.). We average over 10 runs, using new examples for synthetic data (with 50,000 test examples except where stated) and cross validation for real data. Figure 2 compares the boosters' runtimes. As expected, filtering boosters run slower per round than batch boosters on small datasets but much faster on large ones. Interestingly, filtering boosters take longer on very small datasets in some cases (not shown), for the probability the filter accepts an example quickly shrinks when the booster has seen that example many times. 3.1 Results: Conditional Probability Estimation In Section 2.3, we discussed the interpretation of FilterBoost and AdaBoost as stepwise algorithms for conditional probability estimation. We test both algorithms and the variants discussed above on all four datasets. As Figure 3 shows, both FilterBoost variants are competitive with batch algorithms when boosting decision stumps. With decision trees, all algorithms except for FilterBoost overfit badly, including FilterBoost(C-R). In each plot, we compare FilterBoost with the best of AdaBoost and AdaBoost-LOG: AdaBoost was best with decision stumps and AdaBoostLOG with decisions trees. For comparison, batch logistic regression via gradient descent achieves Figure 3: Log (base e) loss & root mean squared error RMSE 0.3489 and log (base e) loss (RMSE). Majority; 10,000 train exs. .4259; FilterBoost, interpretable as a Left two: WL = stumps (FilterBoost vs. AdaBoost); stepwise method for logistic regres- Right two: WL = trees (FilterBoost vs. AdaBoost-LOG). sion, seems to be approaching these asymptotically. On Adult and Twonorm, FilterBoost generally outperforms the batch boosters, which tend to overfit when boosting decision trees, though AdaBoost slightly outperforms FilterBoost on smaller datasets when boosting decision stumps. The Covertype dataset is an exception to our results and highlights a danger in filtering and in resampling for batch learning: the complicated structure of some datasets seems to require decision trees to train on the entire dataset. With decision stumps, the filtering boosters are competitive, yet only the non-resampling batch boosters achieve high accuracies with decision trees. The first decision tree trained on the entire training set achieves about 94% accuracy, which is unachievable by any of the filtering or resampling batch boosters when using Cm = 300 as the base number of examples for training the weak learner. To compete with non-resampling batch boosters, the other boosters must use Cm on the order of 105 , by which point they become very inefficient. 7 Figure 4: FilterBoost vs. MadaBoost. Figure 5: FilterBoost vs. AdaBoost & AdaBoost-LOG. Majority. 3.2 Results: Classification Vanilla FilterBoost and MadaBoost perform similarly in classification (Figure 4). Confidence-rated predictions allow FilterBoost to outperform MadaBoost when using decision stumps but sometimes cause FilterBoost to perform poorly with decision trees. Figure 5 compares FilterBoost with the best batch booster for each weak learner. With decision stumps, all boosters achieve higher accuracies with the larger dataset, on which filtering algorithms are much more efficient. Majority is represented well as a linear combination of decision stumps, so the boosters all learn more slowly when using the overly complicated decision trees. However, this problem generally affects filtering boosters less than most batch variants, especially on larger datasets. Adult and Twonorm gave similar results. As in Section 3.1, filtering and resampling batch boosters perform poorly on Covertype. Thus, while FilterBoost is competitive in classification, its best performance is in regression. References [1] Freund, Y., & Schapire, R. E. (1997) A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55, 119-139. [2] Freund, Y. (1995) Boosting a weak learning algorithm by majority. Information and Computation, 121, pp. 256-285. [3] Domingo, C., & Watanabe, O. (2000) MadaBoost: a modification of AdaBoost. 13th Annual Conference on Computational Learning Theory (pp. 180-189). [4] Bshouty, N. H., & Gavinsky, D. (2002) On boosting with polynomially bounded distributions. Journal of Machine Learning Research, 3, pp. 483-506. [5] Schapire, R. E., & Singer, Y. (1999) Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37, 297-336. [6] Collins, M., Schapire, R. E., & Singer, Y. (2002) Logistic regression, AdaBoost and Bregman distances. Machine Learning, 48, pp. 253-285. [7] Watanabe, O. (2000) Simple sampling techniques for discovery science. IEICE Trans. Information and Systems, E83-D(1), 19-26. ` [8] Domingo, C., Galvada, R., & Watanabe, O. (2002) Adaptive sampling methods for scaling up knowledge discovery algorithms. Data Mining and Knowledge Discovery, 6, pp. 131-152. [9] Friedman, J., Hastie, T., & Tibshirani, R. (2000) Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 28, 337-407. [10] Shalev-Shwartz, S., & Singer, Y. (2006) Convex repeated games and Fenchel duality. Advances in Neural Information Processing Systems 20. [11] Breiman, L. (1998) Arcing classifiers. The Annals of Statistics, 26, pp. 801-849. [12] Newman, D. J., Hettich, S., Blake, C. L., & Merz, C. J. (1998) UCI Repository of machine learning databases [http://www.ics.uci.edu/mlearn/MLRepository.html]. Irvine, CA: U. of California, Dept. of Information & Computer Science. 8