Optimal ROC Curve for a Combination of Classifiers

Marco Barreno barreno@cs.berkeley.edu Computer Science Division UC Berkeley

´ Alvaro A. Cardenas cardenas@cs.berkeley.edu Computer Science Division UC Berkeley

J. D. Tygar tygar@cs.berkeley.edu Computer Science Division UC Berkeley

Abstract
We present a new analysis for the combination of binary classifiers. Our analysis makes use of the Neyman-Pearson lemma as a theoretical basis to analyze combinations of classifiers. In particular, we give a method for finding the optimal decision rule for a combination of classifiers and prove that it has the optimal ROC curve. We show how our method generalizes and improves previous work on combining classifiers and generating ROC curves.

1

Introduction

We present an optimal way to combine binary classifiers in the Neyman-Pearson sense: for a given upper bound on false alarms (false positives), we find the set of combination rules maximizing the detection rate (true positives). In this way, we form the optimal ROC curve of a combination of classifiers. This paper makes the following original contributions: (1) We present a new method for finding the meta-classifier with the optimal ROC curve. (2) We show how our framework can be used to interpret, generalize, and improve previous work by Provost and Fawcett [1] and Flach and Wu [2]. (3) We present experimental results that show our method is practical and performs well, even when we must estimate the distributions with insufficient data. In addition, we prove the following results: (1) We show that the optimal ROC curve is composed in general of 2n + 1 different decision rules and of the interpolation between these rules (over the space n of 22 possible Boolean combinations of n classifiers). (2) We prove that our method is optimal in this space. (3) We prove that the Boolean AND and OR rules are always part of the optimal set for the special case of independent classifiers (though in general we make no independence assumptions). (4) We prove a sufficient condition for Provost and Fawcett's method to be optimal.

2

Background

Consider classification problems where examples from a space of inputs X are associated with binary labels {0, 1} and there is a fixed but unknown probability distribution P(x, c) over examples (x, c)  X × {0, 1}. H0 and H1 denote the events that c = 0 and c = 1, respectively. A binary classifier is a function f : X  {0, 1} that predicts labels on new inputs. When we use the term "classifier" in this paper we mean binary classifier. We address the problem of combining results from n base classifiers f1 , f2 , . . . , fn . Let Yi = fi (X ) be a random variable indicating the output of classifier fi and Y  {0, 1}n = (Y1 , Y2 , . . . , Yn ). We can characterize the performance of classifier fi by its detection rate (also true positives, or power) PDi = Pr[Yi = 1|H1 ] and its false alarm rate (also false positives, or size) PF i = Pr[Yi = 1|H0 ]. In this paper we are concerned with proper classifiers, that is, classifiers where PDi > PF i . We sometimes omit the subscript i. 1


The Receiver Operating Characteristic (ROC) curve plots PF on the x-axis and PD on the y -axis (the ROC space). The point (0, 0) represents a classifier always returning 0, the point (1, 1) represents a classifier always returning 1, and the point (0, 1) represents perfect classification. If one classifier's curve has no points below another, it weakly dominates the latter. If no points are below and at least one point is strictly above, it dominates it. The line y = x describes a classifier that is no better than chance, and every proper classifier dominates this line. When an ROC curve consists of a single point, we connect it with straight lines to (0, 0) and (1, 1) in order to compare it with others (see Lemma 1). In this paper, we focus on base classifiers that occupy a single point in ROC space. Many classifiers have tunable parameters and can produce a continuous ROC curve; our analysis can apply to these cases by choosing representative points and treating each one as a separate classifier. 2.1 The ROC convex hull Provost and Fawcett [1] give a seminal result on the use of ROC curves for combining classifiers. They suggest taking the convex hull of all points of the ROC curves of the classifiers. This ROC convex hull (ROCCH) combination rule interpolates between the base classifiers f1 , f2 , . . . , fn , selecting (1) a single best classifier or (2) a randomization between the decisions of two classifiers for every false alarm rate [1]. This approach, however, is not optimal: as pointed out in later work by Fawcett, the Boolean AND and OR rules over classifiers can often perform better than the ROCCH [3]. However, the AND and OR rules are only 2 of 22 possible Boolean combinations of the outputs of n base classifiers. Our work addresses the selection of the optimal combinations.
n

2.2 Neyman-Pearson theory In this section we introduce Neyman-Pearson theory from the framework of statistical hypothesis testing [4, 5], which forms the basis of our analysis. We wish to test a null hypothesis H0 against an alternative H1 . Let the random variable Y have probability distributions P (Y|H0 ) under H0 and P (Y|H1 ) under H1 , and define the likelihood ratio (Y) = P (Y|H1 )/P (Y|H0 ). The Neyman-Pearson lemma states that the likelihood ratio test 1 if (Y) >   if (Y) =  , D(Y) = (1) 0 if (Y) <  for some   (0, ) and   [0, 1], is a most powerful test for its size: no other test has higher PD = Pr[D(Y) = 1|H1 ] for the same bound on PF = Pr[D(Y) = 1|H0 ]. (When (Y) =  , D = 1 with probability  and 0 otherwise.) Given a test size , we maximize PD subject to PF   by choosing  and  as follows. First we find the smallest value   such that Pr[ (Y) >   |H0 ]  . To maximize PD , which is monotonically nondecreasing with PF , we choose the highest value   that satisfies Pr[D(Y) = 1|H0 ] = Pr[ (Y) >   |H0 ] +   Pr[ (Y) =   |H0 ]  , finding   = ( - Pr[ (Y) >   |H0 ])/ Pr[ (Y) =   |H0 ].

3

The optimal ROC curve for a combination of classifiers

We characterize the optimal ROC curve for a decision based on a combination of arbitrary classifiers--for any given bound  on PF , we maximize PD . We frame this problem as a NeymanPearson hypothesis test parameterized by the choice of . We assume nothing about the classifiers except that each produces an output in {0, 1}. In particular, we do not assume the classifiers are independent or related in any way. Before introducing our method we analyze the one-classifier case (n = 1). Lemma 1 Let f1 be a classifier with performance probabilities PD1 and PF 1 . Its optimal ROC curve is a piecewise linear function parameterized by a free parameter  bounding PF : for  < PF 1 , PD () = (PD1 /PF 1 ), and for  > PF 1 , PD () = [(1 - PD1 )/(1 - PF 1 )]( - PF 1 ) + PD1 . 2


Proof. When  < PF 1 , we can obtain a likelihood ratio test by setting   = (1) and   = /PF 1 , and for  > PF 1 , we set   = (0) and   = ( - PF 1 )/(1 - PF 1 ). 2 The intuitive interpretation of this result is that to decrease or increase the false alarm rate of the classifier, we randomize between using the predictions and always choosing 1 or 0. In ROC space, this forms lines interpolating between (PF 1 , PD1 ) and (1, 1) or (0, 0), respectively. To generalize this result for the combination of n classifiers, we require the distributions P (Y|H0 ) and P (Y|H1 ). With this information we then compute the likelihood ratios (y) for all outcomes y  {0, 1}n . Let L be the list of likelihood ratios ranked from low to high. Lemma 2 Given any 0    1, the ordering L determines parameters   and   for a likelihood ratio test of size . Lemma 2 essentially sets up a classification rule for each interval between likelihoods in L and interpolates between them to create a test with size exactly . Our meta-classifier does this for any given bound on its false positive rate, then makes predictions according to Equation 1. To find the ROC curve for our meta-classifier, we plot PD against PF for all 0    1. In particular, for each y  {0, 1}n we can compute Pr[ (Y) > (y)|H0 ], which gives us one value for   and a point in ROC space (PF and PD follow directly from L). Each   will turn out to be the slope of a line segment between adjacent vertices, and varying   interpolates between the vertices. We call the ROC curve obtained in this way the LR-ROC. Theorem 1 The LR-ROC weakly dominates the ROC curve of any possible combination of Boolean functions g : {0, 1}n  {0, 1} over the outputs of n classifiers. Proof. Let  be the probability of false alarm PF for g . Let   and   be chosen for a test of size  . Then our meta-classifier's decision rule is a likelihood ratio test. By the Neyman-Pearson lemma, no other test has higher power for any given size. Since ROC space plots power on the y -axis and size on the x-axis, this means that the PD for g at PF =  cannot be higher than that of the LR-ROC. Since this is true at any  , the LR-ROC weakly dominates the ROC curve for g . 2 3.1 Practical considerations To compute all likelihood ratios for the classifier outcomes we need to know the probability distributions P (Y|H0 ) and P (Y|H1 ). In practice these distributions need to be estimated. The simplest method is to run the base classifiers on a training set and count occurrences of each outcome. It is likely that many outcomes will not occur in the training, or will occur only a small number of times. Our initial approach to deal with small or zero counts when estimating was to use add-one smoothing. In our experiments, however, simple special-case treatment of zero counts always produced better results than smoothing, both on the training set and on the test set. See Section 5 for details. Furthermore, the optimal ROC curve may have a different likelihood ratio for each possible outcome from the n classifiers, and therefore a different point in ROC space, so optimal ROC curves in general will have 2n points. This implies an exponential (in the number of classifiers) lower bound on the running time of any algorithm to compute the optimal ROC curve for a combination of classifiers. For a handful of classifiers, such a bound is not problematic, but it is impractical to compute the optimal ROC curve for dozensnor hundreds of classifiers. (However, by computing and sorting the likelihood ratios we avoid a 22 -time search over all possible classification functions.)

4
4.1

Analysis
The independent case

In this section we take an in-depth look at the case of two binary classifiers f1 and f2 that are conditionally independent given the input's class, so that P (Y1 , Y2 |Hc ) = P (Y1 |Hc )P (Y2 |Hc ) for c  {0, 1} (this section is the only part of the paper in which we make any independence assumptions). Since Y1 and Y2 are conditionally independent, we do not need the full joint distribution; we 3


Class 1 (H1 ) Y1 0 1 Y2 0 0.2 0.375 1 0.1 0.325
(a)

Class 0 (H0 ) Y1 Y2 0 1 0 0.5 0.1 1 0.3 0.1

Class 1 (H1 ) Y1 Y2 0 1 0 0.2 0.1 1 0.2 0.5
(b)

Class 0 (H0 ) Y1 Y2 0 1 0 0.1 0.3 1 0.5 0.1

Table 1: Two probability distributions. need only the probabilities PD1 , PF 1 , PD2 , and PF 2 to find the combined PD and PF . For example, (01) = ((1 - PD1 )PD2 )/((1 - PF 1 )PF 2 ). The assumption that f1 and f2 are conditionally independent and proper defines a partial ordering on the likelihood ratio: (00) < (10) < (11) and (00) < (01) < (11). Without loss of generality, we assume (00) < (01) < (10) < (11). This ordering breaks the likelihood ratio's range (0, ) into five regions; choosing  in each region defines a different decision rule. The trivial cases 0   < (00) and (11) <  <  correspond to always classifying as 1 and 0, respectively. PD and PF are therefore both equal to 1 and both equal to 0, respectively. For the case (00)   < (01), Pr [ (Y) >  ] = Pr [Y = 01  Y = 10  Y = 11] = Pr [Y1 = 1  Y2 = 1] .Thresholds in this range define an OR rule for the classifiers, with PD = PD1 + PD2 - PD1 PD2 and PF = PF 1 + PF 2 - PF 1 PF 2 . For the case (01)   < (10), we have Pr [ (Y) >  ] = Pr [Y = 10  Y = 11] = Pr [Y1 = 1] . Therefore the performance probabilities are simply PD = PD1 and PF = PF 1 . Finally, the case (10)   < (11) implies that Pr [ (Y) >  ] = Pr [Y = 11] , and therefore thresholds in this range define an AND rule, with PD = PD1 PD2 and PF = PF 1 PF 2 . Figure 1a illustrates this analysis. The assumption of conditional independence is a sufficient condition for ensuring that the AND and OR rules improve on the ROCCH for n classifiers, as the following result shows. Theorem 2 If the distributions of the outputs of n proper binary classifiers Y1 , Y2 , . . . , Yn are conditionally independent given the instance class, then the points in ROC space for the rules AND (Y1  Y2  ˇ ˇ ˇ  Yn ) and OR (Y1  Y2  ˇ ˇ ˇ  Yn ) are strictly above the convex hull of the ROC curves of the base classifiers f1 , . . . , fn . Furthermore, these Boolean rules belong to the LR-ROC. Proof. The likelihood ratio of the case when AND outputs 1 is given by (11 ˇ ˇ ˇ 1) = (PD1 PD2 ˇ ˇ ˇ PDn )/(PF 1 PF 2 ˇ ˇ ˇ PF n ). The likelihood ratio of the only case when OR does not output 1 is given by (00 ˇ ˇ ˇ 0) = [(1 - PD1 )(1 - PD2 ) ˇ ˇ ˇ (1 - PDn )]/[(1 - PF 1 )(1 - PF 2 ) ˇ ˇ ˇ (1 - PF n )]. Now recall that for proper classifiers fi , PDi > PF i and thus (1 - PDi )/(1 - PF i ) < 1 < PDi /PF i . It is now clear that (00 ˇ ˇ ˇ 0) is the smallest likelihood ratio and (11 ˇ ˇ ˇ 1) is the largest likelihood ratio, and therefore the OR and AND rules will always be part of the optimal set of decisions for conditionally independent classifiers. These rules are strictly above the ROCCH: because (11 ˇ ˇ ˇ 1) > PD1 /PD2 , and PD1 /PD2 is the slope of the line from (0, 0) to the first point in the ROCCH (f1 ), the AND point must be above the ROCCH. A similar argument holds for OR since (00 ˇ ˇ ˇ 0) < (1 - PDn )/(1 - PF n ). 2 4.2 Distributions with unexpected rules We return now to the general case with no independence assumptions. We present two example distributions for the two-classifier case that demonstrate unexpected results. The first distribution appears in Table 1a. The likelihood ratio values are (00) = 0.4, (10) = 3.75, (01) = 1/3, and (11) = 3.25, giving us (01) < (00) < (11) < (10). The three non-trivial rules correspond to the Boolean functions Y1  ŹY2 , Y1 , and Y1  ŹY2 . Note that Y2 appears only negatively despite being a proper classifier, and both the AND and OR rules are sub-optimal. The distribution for the second example appears in Table 1b. The likelihood ratios of the possible outcomes are (00) = 2.0, (10) = 1/3, (01) = 0.4, and (11) = 5, so (10) < (01) < (00) < (11) and the three points defining the optimal ROC curve are ŹY1  Y2 , Ź(Y1  Y2 ), and Y1  Y2 (see Figure 1b). In this case, an XOR rule emerges from the likelihood ratio analysis. 4


1 0.9 0.8 0.7 0.6 PD 0.5 0.4 0.3 0.2 0.1 0 0

y1  y2 y2 1

1 0.9 0.8 0.7

Źy1  y2 Ź(y1  y2 ) y1  y2
D

1 0.9 0.8 0.7 0.6 P 0.5 0.4 0.3 0.2 0.1

f2

f3 f3

y1  y2
PD

0.6 0.5 0.4 0.3 ROC of f1 0.2 0.1 1 0 0

f1

f2

y1

f1

0
0.2 0.4 PFA 0.6

ROC of f2 Optimal combined ROC 0.8

Original ROC Optimal ROC

0.2

0.4 PF

0.6

0.8

1

0

0

0.2

0.4 PF

0.6

0.8

1

(a)

(b)

(c)

Figure 1: (a) ROC for two conditionally independent classifiers. (b) ROC curve for the distributions in Table 1b. (c) Original ROC curve and optimal ROC curve for example in Section 4.4. These examples show that it is not sufficient to use weighted voting rules w1 Y1 + w2 Y2 + ˇ ˇ ˇ + wn Yn   , where w  (0, ) (as some ensemble methods use). A weighted voting rule will always have the AND and OR rules in its ROC curve, so it cannot express optimal rules in all cases. 4.3 Optimality of the ROCCH We have seen that in many cases, there are combination points strictly above the ROCCH of a set of classifiers. As the following result shows, however, there are conditions under which the ROCCH is equivalent to the LR-ROC. Theorem 3 Consider n classifiers f1 , . . . , fn . The convex hull of the points (PF i , PDi ) and (0, 0) and (1, 1) (the ROCCH) is an optimal ROC curve for the combination if (Yi = 1)  (Yj = 1) for i < j and the following ordering is satisfied: (00 ˇ ˇ ˇ 0) < (00 ˇ ˇ ˇ 01) < (00 ˇ ˇ ˇ 011) < ˇ ˇ ˇ < (1 ˇ ˇ ˇ 1). Proof. The condition (Yi = 1)  (Yj = 1) for i < j implies that we only need to consider n + 2 points in the ROC space (the two extra points are (0, 0) and (1, 1)). It also implies the following conditions on the joint distribution: Pr[Y1 = 0  ˇ ˇ ˇ  Yi = 0  Yj = 1  ˇ ˇ ˇ  Yn = 1|H0 ] = PF j - PF i and Pr[Y1 = 1  ˇ ˇ ˇ  Yn = 1|H0 ] = PF 1 . With these conditions and the ordering condition on the likelihood ratios, we have Pr[ (Y) > (1 ˇ ˇ ˇ 1)|H0 ] = 0, and Pr[ (Y) > (0 ˇ i ˇ 0 1 ˇ ˇ ˇ 1)|H0 ] = PF i . Therefore, finding the optimal threshold of the likeliˇ hood ratio test we get for PF i-1   < PF i ,   = (0 i ˇ ˇ 0 1 ˇ ˇ ˇ 1), and for PF i   < PF i+1 , ˇ   = (0 ˇ i ˇ 0 1 ˇ ˇ ˇ 1). This change in   implies that the point PF i is part of the LR-ROC. Setting ˇ  = PF i (thus   = (0 ˇ i ˇ 0 1 ˇ ˇ ˇ 1) and   =0) implies Pr[ (Y) >   |H1 ] = PDi . ˇ 2
-1

The condition Yi = 1  Yj = 1 for i < j is the same inclusion condition Flach and Wu use for repairing an ROC curve [2]. It intuitively represents the performance in ROC space of a single classifier with different operating points. The next section explores this relationship further. 4.4 Repairing an ROC curve Flach and Wu give a voting technique to repair concavities in an ROC curve that generates operating points above the ROCCH [2]. Their intuition is that points underneath the convex hull can be mirrored to appear above the convex hull in much the same way as an improper classifier can be negated to obtain a proper classifier. Although their algorithm produces better ROC curves, their solution will often yield curves with new concavities (see for example Flach and Wu's Figure 4 [2]). Their algorithm has a similar purpose to ours, but theirs is a local greedy optimization technique, while our method performs a global search in order to find the best ROC curve. 5


1.0

1.0

q q

0.8

q q

0.8

q

0.6

q q q

0.6

q
Pd

q q

Pd

0.4

0.4

q

q

q

q

0.0

0.00

0.05

0.10 Pfa

0.15

0.20

0.0

q

Meta (train) Base (train) Meta (test) Base (test) PART

q

q
0.000 0.005 Pfa 0.010

Meta (train) Base (train) Meta (test) Base (test) PART 0.015

0.2

(a) adult
1.0
q q q

0.2

(b) hypothyroid
1.0
q q

0.8

0.8

q
q

q
q

q

0.6

Pd

0.4

Pd Meta (train) Base (train) Meta (test) Base (test) PART 0.15 0.4

0.6

q

q

0.0

q

0.0

q

q
0.05 Pfa 0.10

q q

q
0.02 0.04 Pfa 0.06

Meta (train) Base (train) Meta (test) Base (test) PART 0.10

0.2

0.00

0.2 0.00

0.08

(c) sick-euthyroid

(d) sick

Figure 2: Empirical ROC curves for experimental results on four UCI datasets.

Here is an example comparing their method to ours. Consider the following probability distribution on a random variable Y  {0, 1}2 : P ((00, 10, 01, 11)|H1 ) = (0.1, 0.3, 0.0, 0.6), P ((00, 10, 01, 11)|H0 ) = (0.5, 0.001, 0.4, 0.099). Flach and Wu's method assumes the original ROC curve to be "repaired" has three models, (or operating points): f1 predicts 1 when Y  {11}, f2 predicts 1 when Y  {11, 01}, and f3 predicts 1 when Y  {11, 01, 10} Figure 1c. If we apply Flach and Wu's repair algorithm, the point f2 is corrected to the point f2 ; however, the operating points of f1 and f3 remain the same. Our method improves on this result by ordering the likelihood ratios (01) < (00) < (11) < (10) and using that ordering to make three different rules: f1 predicts 1 when Y  {10}, f2 predicts 1 when Y  {10, 11}, and f3 predicts 1 when Y  {10, 11, 00}. We show ROC curves for both methods in Figure 1c. 6


5

Experiments

We ran experiments to test the performance of our combining method with real classifiers on the adult, hypothyroid, sick-euthyroid, and sick datasets from the UCI machine learning repository [6]. We chose five base classifiers from the YALE machine learning platform [7]: PART (a decision list algorithm), SMO (John Platt's Sequential Minimal Optimization), SimpleLogistic, VotedPerceptron, and Y-NaiveBayes. We used the default settings for all classifiers. The adult dataset has around 30,000 training points and 15,000 test points, the sick dataset has around 2000 training points and 700 test points, and the other two have around 1000 of each. For each experiment, we estimate the joint distribution by training the base classifiers on a training set and counting the outcomes. (For datasets without a provided test set, we randomly chose half the data to hold out for testing.) We compute the likelihood ratios for all outcomes and order them. When there are outcomes with no positive or no negative training examples, we treat ˇ/0 as nearinfinite and 0/ˇ as near-zero. For an outcome with no training examples at all, we define 0/0 = 1. We derive a sequence of decision rules from the likelihood ratios computed on the training set. We can compute an optimal ROC curve for the combination by counting the number of true positives and false positives each rule achieves. In the test set we use the rules learned on the training set. 5.1 Results The ROC graphs for our four experiments appear in Figure 2. The ROC curves in these experiments all rise very quickly and then flatten out, so we show only the range of PF 1 for which the values are interesting. We can draw some general conclusions from these graphs. First, PART clearly outperforms the other base classifiers in three out of four experiments, though it seems to overfit on the hypothyroid dataset. The LR-ROC dominates the ROC curves of the base classifiers on both training and test sets. The ROC curves for the base classifiers are all strictly below the LR-ROC in results on the test sets. The results on training sets seem to imply that the LR-ROC is primarily classifying like PART with a small boost from the other classifiers; however, the test set results (in particular, Figure 2b) demonstrate that the LR-ROC generalizes better than the base classifiers. We also ran separate experiments to test how many classifiers our method could support in practice. Without any effort at optimization, we found that the estimation of the joint distribution and creation of the ROC curve finished within one minute for 20 classifiers (not including time to train the individual classifiers). Optimization may improve the number slightly, but the inherent exponential structure of the optimal ROC curve means that it is not possible to do significantly better (at the same rate, 30 classifiers would take over 12 hours and 40 classifiers almost a year and a half).

6

Related work

Our work is loosely related to ensemble methods such as bagging [8] and boosting [9] because it finds meta-classification rules over a set of base classifiers. However, bagging and boosting each take one base classifier and train many times, resampling or reweighting the training data to generate classifier diversity [10] or increase the classification margin [11]. The decision rules applied to the generated classifiers are majority voting or weighted voting. In contrast, our method takes any combination of binary classifiers and finds optimal combination rules from the more general space of all binary functions. Ranking algorithms, such as RankBoost [12], approach the problem of ranking points by score or preference. Although we present our methods in a different light, our decision rule can be interpreted as a ranking algorithm. RankBoost, however, is a boosting algorithm and therefore fundamentally different from our approach. Ranking can be used for classification by choosing a cutoff or threshold, and in fact ranking algorithms tend to optimize the common area under the ROC curve (AUC) metric. Although our method may have the side effect of maximizing the AUC, its formulation is different in the sense that instead of optimizing a single global metric, it is a constrained optimization problem: our method maximizes PD for each PF . Another method for combining classifiers is stacking [13]. Stacking trains a meta-learner to combine the predictions of several base classifiers. The similarities between our method and stacking 7


are much greater, and in fact our method might be considered a stacking method with a particular meta-classifier. It can be difficult to show the improvement of stacking in general over selecting the best base-level classifier [14]; however, stacking has a useful interpretation as generalized crossvalidation that makes it practical. Our analysis shows that our combination method is the optimal stacking meta-learner in the Neyman-Pearson sense, but incorporating the model validation aspect of stacking would make an interesting extension to our work.

7

Conclusion

In this paper we introduce a new way to analyze a combination of classifiers and their ROC curves. We give a method for combining classifiers and prove that it is optimal in the Neyman-Pearson sense. This work raises several interesting questions. Although the algorithm presented in this paper has dramatically better performance than the doubly exponential brute-force approach, the exponential factor in running time limits the number of classifiers that can be considered in practice. Can a good approximation algorithm approach optimality while having lower time complexity? Can we find conditions in which not all 2n output combinations are possible (or likely)? Though in general we make no assumptions about independence, Theorem 2 proves that certain simple rules are optimal when we do know that the classifiers are independent. Theorem 3 proves that the ROCCH can be optimal when only n output combinations are possible. Perhaps other restrictions on the distribution of outcomes can lead to useful special cases. The robustness of our method to estimation errors is uncertain. In our experiments we found that smoothing did not improve generalization, but undoubtedly our method would benefit from better estimation of the outcome distribution and increased robustness.

References
[1] Foster Provost and Tom Fawcett. Robust classification for imprecise environments. Machine Learning Journal, 42(3):203­231, March 2001. [2] Peter A. Flach and Shaomin Wu. Repairing concavities in ROC curves. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI'05), pages 702­707, August 2005. [3] Tom Fawcett. ROC graphs: Notes and practical considerations for data mining researchers. Technical Report HPL-2003-4, HP Laboratories, Palo Alto, CA, January 2003. Updated March 2004. [4] J. Neyman and E. S. Pearson. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London, Series A, Containing Papers of a Mathematical or Physical Character, 231:289­337, 1933. [5] Vincent H. Poor. An Introduction to Signal Detection and Estimation. Springer-Verlag, second edition, 1988. [6] D. J. Newman, S. Hettich, C. L. Blake, and C. J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/mlearn/MLRepository.html. [7] I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler. YALE: Rapid prototyping for complex data mining tasks. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2006. [8] L. Breiman. Bagging predictors. Machine Learning, 24(2):123­140, 1996. [9] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Thirteenth International Conference on Machine Learning, pages 148­156, Bari, Italy, 1996. Morgan Kaufmann. [10] Thomas G. Dietterich. Ensemble methods in machine learning. Lecture Notes in Computer Science, 1857:1­15, 2000. [11] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):1651­1686, October 1998. [12] Yoav Freund, Raj Iyer, Robert E. Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research (JMLR), 4:933­969, 2003. [13] D. H. Wolpert. Stacked generalization. Neural Networks, 5:241­259, 1992.   [14] Saso Dzeroski and Bernard Zenko. Is combining classifiers with stacking better than selecting the best one? Machine Learning, 54:255­273, 2004.

8