Cost-sensitive Multi-class Classification from Probability Estimates Deirdre B. O'Brien Google Inc., Mountain View, CA Maya R. Gupta University Of Washington, Seattle, WA Rob ert M. Gray Stanford University, Stanford, CA deirdre@google.com gupta@ee.washington.edu rmgray@stanford.edu where each corner of the simplex has pj = 1 for some ^ j , and pi = 0 for all i = j ; and that the cost matrix ^ c induces a partitioning of the p-simplex into regions ^ assigned to each of the K classes. However, the probability estimation can suffer from systematic errors, e.g. oversmoothing the estimate towards class prior probabilities. The main contribution of this paper is an analytic and experimental investigation of how changing the partitioning of the p-simplex can reduce the ^ effect of such systematic errors on classification loss, analogous to ROC analysis for two-class classification. First, we discuss systematic probability estimation errors and show how these errors can cause classification errors. Then in Section 3 we review methods to reduce the effect of such errors. In Section 4 we establish properties that describe how changing c affects the classpartitioning of the p-simplex. In Section 5, we propose ^ learning a partitioning of the p-simplex that seeks to ^ minimize the empirical misclassification costs for the given c, and we provide experimental evidence of the effectiveness of our approach in Section 6. Abstract For two-class classification, it is common to classify by setting a threshold on class probability estimates, where the threshold is determined by ROC curve analysis. An analog for multi-class classification is learning a new class partitioning of the multiclass probability simplex to minimize empirical misclassification costs. We analyze the interplay between systematic errors in the class probability estimates and cost matrices for multiclass classification. We explore the effect on the class partitioning of five different transformations of the cost matrix. Experiments on benchmark datasets with naive Bayes and quadratic discriminant analysis show the effectiveness of learning a new partition matrix compared to previously proposed methods. 1. Introduction Many classifiers first estimate class probabilities pk (x) ^ for each class k {1, . . . , K }, then classify a test sample x as the class y (x) that minimizes the expected ^ misclassification costs: y (x) = arg min ^ jK =1 2. Systematic Error in Multi-class Probability Estimation Friedman uses the term oversmoothing for cases where the probability estimates are systematically smoothed towards the class prior probabilities, and undersmoothing for cases where the class probability estimates produced are too confident, such as 1-NN (Friedman, 1997). Other systematic errors in the probability estimates can occur; Niculescu-Mizil and Caruana have documented the systematic errors introduced by various methods of probability estimation for two-class classification (Niculescu-Mizil & Caruana, 2005). Here we provide illustrative examples of over- and undersmoothing in multiclass tasks. i=1,...,K . ci|j pj (x) = g (p(x); c), ^ ^ (1) where ci|j is the ith row, j th column element of the cost matrix c, and the cost of classifying as class i when the true class is j . We define the function g to use as short-hand for this minimization. Such probability-based classifiers can be interpreted as mapping each test sample to a point on the p-simplex, ^ Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). Cost-sensitive Multi-class Classification 2.1. A Naive Bayes' Example Consider a three-class problem with discrete features. We estimate class probabilities for test samples using naive Bayes (Hastie et al., 2001). The three classes are equally likely, and the feature vector consists of two identical copies of the same feature. Thus the naive Bayes' assumption of independent features is clearly violated. Figure 1(a) shows pairings of a true class probability (marked with an attached circle) and the associated naive Bayes' estimated probability (marked with a triangle). The incorrect feature-independence assumption undersmooths the probability estimates, pushing them towards the edges of the simplex. When an estimated probability and the corresponding true probability fall in the same class partition, the undersmoothing does not cause any classification error. When the line attaching a triangle to a circle crosses a class partition line, a classification error occurs. One sees that undersmoothing does not cause errors given the 0/1 cost matrix (dashed lines), but causes many errors given an asymmetric cost matrix (solid lines). 2.2. A k -NN Example Let N be the number of training samples. Then for the k -NN classifier as the number of nearest neighbors k N , the probability estimates are smoothed towards the class prior probabilities. Figure 1(b) illustrates an extreme example: k = 2000, N = 3000, and the samples are drawn iid and with equal probability from one of three class-conditional normal distributions. The oversmoothing does not cause errors given the 0/1 cost matrix (dashed lines), but causes many errors given an asymmetric cost matrix (solid lines). 2 3 1 (a) Naive Bayes example 2 3 1 (b) k-NN example Figure 1. Circles mark the true probabilities, triangles mark the estimated probabilities, and each line connects a true probability to the corresponding estimate. The dashed lines mark the class partitioning of the p-simplex induced ^ by the 0-1 cost matrix, and the solid lines mark the class partitioning induced by an asymmetric cost matrix. 3. Related Work Approaches to deal with the systematic errors in probability estimation can be analyzed in terms of the classification rule given in (1). Such approaches generally either change the partitioning of the p-simplex, or ^ change the probability estimates. 3.1. Related Work in Two-class Classification For the two-class case, the p-simplex is a line segment ^ from p1 (x) = 0 to p1 (x) = 1, and a scalar threshold t ^ ^ partitions the two class regions. The optimal threshold t derived with respect to (1) is, t = c1|2 - c2|2 . c1|2 + c2|1 - c1|1 - c2|2 duces the effect of systematic errors of the class probability estimates. The most common approach uses the receiver operating characteristic (ROC) curves (Egan, 1975; Hanley & McNeil, 1982). An ROC curve plots estimates of the probabilities PY |Y (2|2) versus ^ PY |Y (2|1) for thresholds t, 0 t 1, where the es^ timates are derived from training or validation data. For a given cost matrix the desired point on the ROC curve is chosen and the associated threshold t is used for the classifier (Noe, 1983; Provost & Fawcett, 2001). Other methods fix the threshold at the theoretical optimal t given by (2), and seek to improve classification by improving the probability estimates. Friedman considered adding a scalar a to the probability estimates (2) Classification errors can be reduced by changing the class-partitioning by specifying a threshold t that re- Cost-sensitive Multi-class Classification for class 1 (Friedman, 1997); it is easy to show that this ~ method is equivalent to using a threshold t = t - a. Zadrozny and Elkan use monotonic functions of the probability estimate p1 to give a calibrated estimate ^ and show great improvements in cost-sensitive classification when the calibrated probability estimates are used in place of the original estimates (Zadrozny & Elkan, 2001; Zadrozny & Elkan, 2002). One of their approaches builds on Platt's earlier work to transform support vector machine (SVM) scores into probability estimates using a sigmoid function (Platt, 2000). The same approach can be applied to probability estimates rather than SVM scores. Zadrozny and Elkan propose two other approaches to perform the calibration: binning and pair-adjacent violators. Binning takes the probability estimates obtained using cross-validation, orders these values and then groups them into B bins so that there are an equal number of samples in each bin (Zadrozny & Elkan, 2001). The upper and lower boundaries of each bin are determined, and for any test sample with a probability estimate falling in bin b, the updated probability estimate for class 1 is given by the fraction of validation samples in bin b that belong to class 1. Pair-adjacent violators (PAV) monotonically transform the probability estimates using isotonic regression (Ayer et al., 1955). It has been shown that applying threshold t from (2) to the calibrated probability estimates obtained using PAV is equivalent to using a threshold chosen by ROC analysis on the original probabilities (O'Brien, 2006). 3.2. Related Work in Multi-class Classification Zadrozny and Elkan extended their two-class solutions to multi-class problems by breaking the classification task into a number of binary classification tasks and using error correcting output codes (ECOC) to obtain multi-class probability estimates (Zadrozny & Elkan, 2002). Other methods seek to extend the ROC thresholding approach to K -class classification. Instead of choosing a scalar threshold t, a partition of the (K - 1)dimensional p-simplex must be specified. Mossman ^ proposed a method for three class tasks using a very restrictive partitioning of the simplex (Mossman, 1999): y (x) = ^ 1 2 3 if if if p3 (x) 1 and p2 (x) - p1 (x) 2 ^ ^ ^ p3 (x) 1 and p2 (x) - p1 (x) > 2 ^ ^ ^ p3 (x) > 1 . ^ (3) minimum expected misclassification cost assignment of (1) (Lachiche & Flach, 2003): y (x) = arg max wi pi , ^ ^ i (4) where the wi are chosen by minimizing costs on the training set: w = arg min w nN =1 c(arg maxi wi pi (xn ))|yn , ^ (5) and xn , yn are the nth training sample and its associated class label. We refer to (4) and (5) as the LF method. Mossman's method and the LF method can both be viewed as learning a new partitioning for the p-simplex. In Section 5, we show how these parti^ tions can be achieved by using different cost matrices in equation (1). MetaCost is a wrapper method that can be used with any classification algorithm and reduces the variance of the probability estimates by bootstrapping (Domingos, 1999). MetaCost reduces the variance of probability estimates, but is not designed to overcome systematic probability estimation errors. 4. The Effect of the Cost Matrix on the Class-Decision Boundaries In this section we establish how different changes in the cost matrix affect the class partitioning of the p^ simplex enacted by (1). In Section 5 we use these properties to propose a new method to reduce the effect of systematic errors in probability estimation. For a particular cost matrix c, and any two classes i, k that are adjacent in the partition of the p-simplex, the ^ partition-boundary between them is described by the hyperplane, jK =1 ci|j pj = ^ jK =1 ck|j pj . ^ (6) We restrict attention to cost matrices where the cost of correct assignment is always less than the cost of incorrect assignment, that is, cj |j < ci|j , i = j . Prop erty 1: For any p, the assigned class is the same ^ for cost matrices c and c for any scalar . Proof: The minimization in (1) is unaffected by replacing ci|j by ci|j , i, j . Prop erty 2: If the cost matrix c is ful l rank, then there is a point e where al l two class boundaries as described by (6) intersect, and e may occur outside Lachiche and Flach proposed an alternative to the Cost-sensitive Multi-class Classification the probability simplex. We term e the "equal risk point". Proof: If c is full rank then the solution is = c-1 1/ c-1 1 1 that will solve (6) for all classes. However, it can happen that c-1 1 1 = 0, in which case the equal risk point can be said to be at infinity. If c is not full rank there may still be an unique e equal risk poi t. n In l, genera must solve -1 e = 0 for some R. T 1 1 0 If the above matrix is full rank then there is a unique point solution for e , otherwise the above system is underdetermined and there is a hyperplane of solutions, or the above system can be overdetermined and there is no solution. c Prop erty 3: Adding a constant to al l costs for a particular true class (equivalently, adding a constant to each term in any column of the cost matrix) does not affect the assignment. Proof: The minimization in (1) is unaffected by changing ci|j to ci|j + j , i. Prop erty 4: The class-partitioning produced by any cost matrix c can equivalently be produced by some cost matrix c where ci|i = 0 and ci|j > 0 for i = j . ~ ~ ~ Proof: Follows directly from Property 3. Prop erty 5: (See Fig. 2b) Adding a constant to the cost of assignment to class i irrespective of the true class (that is, adding to each term in row i of a cost matrix), produces a new class-partitioning with partition boundaries paral lel to those of the original. Proof: This change only affects the two-class boundary equations specified by (6) for class i and each class k : jK =1 e aries with the p ^ = 0 plane are unchanged. ~ Proof: To prove that the new equal risk point e is on the line joining the original equal risk point e and the corner of the simplex where p = 1, we show that ^ there exists a constant such that e ~e j = j + (1 - )I(j = ) , (7) for all j , and where I is the indicator function. From (6), multiplying column equal risk for classes and k : j K =1,j = by , and requiring ~e (ci|j - ck|j )j + (ci| ~e - ck| )( ) = 0. (8) First note that if e = 0, then e remains an equal risk point for the transformed cost matrix. Otherwise, for e any e , we write j = sj e . Comparing (6) and (8), ~ ~e ~ there exists such that j = (sj e ), j = . Thus, ~e e ~ e = . (9) j e j ~ Let = e / e , then (9) establishes (7) j = . Also, j j e ~e j = j (10) = = = ~ 1 - e (1 - e ), (11) ~ where (11) follows from (10) since components of e e and both sum to 1. This establishes (7) for . Lastly, by setting p = 0 in equation (6), it is evident ^ that changes in ci| and cj | will not effect the intersections of the class boundaries with the p = 0 plane. ^ Prop erty 7: (See Fig. 2d) Scaling al l costs where the assigned class is i by a positive constant (that is, multiplying al l elements in row i by ) moves an equal risk point e along the hyper-plane where al l classes but class i have equal expected misclassification costs. Proof: Let cj |k = cj |k for all j = i, and let ci|k = ci|k . ~ ~ ~e produced by c solves the Then the equal risk point ~ same set of class boundary equations (6) specifying e , except jK =1 e (ci|j - cK |j )j = 0 (ci|j + )pj = ^ jK =1 ck|j pj = ^ jK =1 ci|j pj + . ^ Thus the new boundary between class i and k is parallel to the original boundary between class i and k . To maintain cj |j < ci|j , i = j requires maxk=i (ck|k - ci|k ) < < mink=i (ck|i - ci|i ). Prop erty 6: (See Fig. 2c) Scaling al l costs where the true class is by a positive constant (that is, multiplying column of c by ) moves an equal risk point along the line joining it to the corner of the simplex where p = 1. The intersections of the class bound^ jK =1 ~e (ci|j - cK |j )j = 0. Because the constraints specifying that the other classes have equal misclassification costs still apply, the ~ new equal risk point e must occur along the hyperplane specified by that subset of the constraints. Cost-sensitive Multi-class Classification 5. Learning the Partition Matrix We propose changing the class partitions so that the class-partitioning corrects for the systematic error in the probability estimates. Changing the partition of the multi-class p-simplex is analogous to the two-class ^ practice of changing the threshold on p1 (x). ^ We split the cost matrix's role into two separate entities: a partition matrix (which partitions the p-simplex ^ with linear boundaries), and the misclassification cost matrix that specifies how misclassification errors are to be scored. From now on we will use the term partition matrix for the former role, and the term cost matrix only for the latter role. Given a training set {(x1 , y1 ), (x2 , y2 ), ..., (xN , yN )}, where xn has class probability estimate vector p(xn ), ^ we propose to use a partition matrix a that solves n a = arg min c g(p(xn );a)|yn , (12) ^ a 3 2 2 13 1 (a) Initial Cost Matrices 2 2 3 13 1 (b) Parallel Cost Matrices ( = 0.3) 2 2 where the function g in (12) is defined in (1). To avoid the issue of overfitting in learning a partition matrix, we restrict the partition to have linear boundaries that are parallel to the original decision boundaries produced by the cost matrix (see Fig. 2b). We have also considered learning partition matrices with different constraints, including partition matrices with all K 2 - K free parameters, partition matrices that are column-multiply modifications of the original cost matrix (see Fig. 2c), and row-multiply modifications (see Fig. 2d). We found these different constraints resulted in similar performance, with the parallel partitioning working consistently well (O'Brien, 2006). By Property 4 stated in Section 4, without loss of generality we consider only partition matrices a where ai|i = 0 and ai|j > 0 for i = j . From Property 5, adding i to row i of the cost matrix will yield a partition matrix with partition boundaries parallel to the partition boundaries induced by c. To maintain the requirement that ai|i = 0 we subtract the same constant from column i without affecting the partition boundaries (Property 3). Thus given the cost matrix c, the partition matrix is a where ai|j = ci|j + i - j . This approach requires learning the parameters i for i = 1, . . . , K . Related methods can also be viewed as applications of (12) but with different restrictions on a. Mossman's method for three classes (Mossman, 1999) implicitly requires a to be of the form 0 1 L - 1-22 1+2 0 L (13) , 1- 2 1 1 L 1- 1 L 0 1- 1 3 1 3 1 (c) Column Multiply ( = 1.5) 2 2 3 13 1 (d) Row Multiply ( = 1.75) Figure 2. This figure illustrates Properties 5, 6, and 7 described in Section 4 for a three-class classification. The partition produced by the cost matrix is marked by solid lines: a 0/1 cost matrix for the left figures, and an asymmetric cost matrix for the right figures. The corners of the simplex are marked such that pj = 1 at corner j , and if a ^ test sample has estimated class probability p that falls in ^ the region including corner j , then the estimated class label is j . For each of the figures, manipulations are applied to the class 1 elements of the cost matrix. The dashed lines show the initial cost matrix partitions, the gray lines help to illustrate the properties. Cost-sensitive Multi-class Classification where the terms are those used in (3) and L 1. The LF method (Lachiche & Flach, 2003) is equivalent to choosing a partition matrix a such that ai|j = wj zi|j where zi|j is the 0-1 cost matrix and wj is the weight used in equations (4) and (5). Thus the LF method is equal to a column-multiply of a 0-1 cost matrix (see Property 5 stated in Section 4, and Fig 2c), whereas our proposed method enacts a parallel shift based on the actual cost matrix c. 5.1. Optimizing the Partition Matrix We learn the new partition matrix by minimizing an empirical loss calculated on a validation set of labeled samples using a greedy approach. The new partition matrix a is initialized to the cost matrix c. Then each free parameter is updated in turn. Let a denote the current partition matrix, and the new partition matrix a will have ai|j = ai|j + i and aj |i = aj |i - i for all ~ ~ ~ j = i. Suppose i = - and interpret a as a cost ~ matrix ­ then there would be an infinitely negative cost to assigning a sample as class i, and thus every training sample would be assigned to class i. Suppose one increased i from negative infinity. For different values of i it would become more cost-effective to classify each of the training samples as a different class, call this classification choice g i (p(xn )), where ^ K i g (p(xn )) = arg mink=i j =1 ak|j pj (x). Let in de^ ^ note the changepoint value ­ for i < in training sample n would be assigned to class i and for i > in training sample n would be assigned to class g i (p(xn )). ^ We find these N changepoints in for n = 1, . . . , N , by solving the N equations, jK =1 order of class size from most populous to least. Preliminary experiments provided evidence that multiple passes through the parameters did not improve the final classification performance, and that performance was fairly robust to the parameter ordering. 6. Exp eriments Experiments with UCI benchmark datasets compare the proposed parallel-partition matrix method to MetaCost (MC) (Domingos, 1999) and to the LF method (Lachiche & Flach, 2003). For two-class problems the proposed partition matrix methods are equivalent to ROC analysis and therefore only multi-class problems are considered here. There are two basic variants of MetaCost: the first variant is that the probability estimates are based on training samples not including the sample, while the other variant is that all-inclusive estimates are made. The results reported here are the better of the two variants for each dataset. Randomized ten-fold cross-validation was done 100 times for each method and each dataset. In the crossvalidation, 1/10 of the data was set aside as test data. For MetaCost, 100 resamples were generated using the remaining nine folds. For the proposed methods and the LF method the remaining nine folds were sub ject to a nine-fold cross-validation so that 8/10 of the data (eight folds) were used to estimate the probabilities for each of the nine folds. Then the partition matrix a and LF parameters were estimated using the nine folds' probability estimates. Finally, the learned costsensitive classifier was applied to the withheld 1/10 of the test data. Experiments were done with two different probability estimation methods. For datasets with discrete features, multinomial naive Bayes was used with Laplacian correction for estimating probabilities. Any continuous features used with naive Bayes were quantized to 21 values. For datasets with continuous features, regularized quadratic discriminant analysis (QDA) (Friedman, 1989) was used. Each class's estimated covariance matrix was regularized as, d^ trace M L ^ ^ ¯ = (1 - - )M L + + I, ^ where M L is the maximum likelihood estimate of the ¯ full covariance matrix, is the pooled maximum likelihood estimate, d is the dimensionality of the feature vector, and and were increased from zero until the ^ condition number of was less than 106 . a g i (p(x ^ n ))|j jK ^ pj (xn ) = =1 a i|j ^ + in pj (xn ). Re-order the training data by their changepoints, so that {xk , yk } denotes the training point with the k th largest changepoint. Then select N where, n n N = arg min cgi (p(xn ))|yn + ci|yn . ^ N0 =1,2,...,N