Risk minimization, probability elicitation, and cost-sensitive SVMs

Hamed Masnadi-Shirazi HMASNADI @ UCSD . EDU Nuno Vasconcelos NUNO @ UCSD . EDU Statistical Visual Computing Laboratory, University of California, San Diego, La Jolla, CA 92039

Abstract
A new procedure for learning cost-sensitive SVM classifiers is proposed. The SVM hinge loss is extended to the cost sensitive setting, and the cost-sensitive SVM is derived as the minimizer of the associated risk. The extension of the hinge loss draws on recent connections between risk minimization and probability elicitation. These connections are generalized to costsensitive classification, in a manner that guarantees consistency with the cost-sensitive Bayes risk, and associated Bayes decision rule. This ensures that optimal decision rules, under the new hinge loss, implement the Bayes-optimal costsensitive classification boundary. Minimization of the new hinge loss is shown to be a generalization of the classic SVM optimization problem, and can be solved by identical procedures. The resulting algorithm avoids the shortcomings of previous approaches to cost-sensitive SVM design, and has superior experimental performance.

on cost insensitive algorithms for classifier design can be highly sub-optimal. While this makes it obviously important to develop cost-sensitive extensions of state-of-the-art machine learning techniques, the current understanding of such extensions is limited. In this work we consider the support vector machine (SVM) architecture (Cortes & Vapnik, 1995). Although SVMs are based on a very solid learning-theoretic foundation, and have been successfully applied to many classification problems, it is not well understood how to design cost-sensitive extensions of the SVM learning algorithm. The standard, or cost-insensitive, SVM is based on the minimization of a symmetric loss function (the hinge loss) that does not have an obvious cost-sensitive generalization. In the literature, this problem has been addressed by various approaches, which can be grouped into three general categories. The first is to address the problem as one of data processing, by adopting resampling techniques that under-sample the majority class and/or over-sample the minority class (Kubat & Matwin, 1997; Chawla et al., 2002; Akbani et al., 2004). Resampling is not easy when the classification unbalance is due to either different misclassification costs (not clear what the class probabilities should be) or an extreme unbalance in class probabilities (sample starvation for classes of very low probability). It also does not guarantee that the learned SVM will change, since it could have no effect on the support vectors. The second class of approaches (Amari & Wu, 1999; Wu & Chang, 2003; 2005) involves kernel modifications. These methods are based on conformal transformations of the input or feature space, by modifying the kernel used by the SVM. They are somewhat unsatisfactory, due to the implicit assumption that a linear SVM cannot be made cost-sensitive. It is unclear why this should be the case. The third, and most widely researched, approach is to modify the SVM algorithm in order to achieve cost sensitivity. This is done in one of two ways. The first is a naive method, known as boundary movement (BM-SVM), which shifts the decision boundary by simply adjusting the threshold of the standard SVM (Karakoulas & Shawe-Taylor, 1999). Under Bayesian decision theory, this would be the optimal

1. Introduction
The most popular strategy for the design of classification algorithms is to minimize the probability of error, assuming that all misclassifications have the same cost. The resulting decision rules are usually denoted as cost-insensitive. However, in many important applications of machine learning, such as medical diagnosis, fraud detection, or business decision making, certain types of error are much more costly than others. Other applications involve significantly unbalanced datasets, where examples from different classes appear with substantially different probability. It is well known, from Bayesian decision theory, that under any of these two situations (uneven costs or probabilities), the optimal decision rule deviates from the optimal costinsensitive rule in the same manner. In both cases, reliance
Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

Risk minimization, probability elicitation, and cost-sensitive SVMs

strategy if the class posterior probabilities were available. However, it is well known that SVMs do not predict these probabilities accurately. While a literature has developed in the area of probability calibration (Platt, 2000; Elkan, 2001), calibration techniques do not aid the cost-sensitive performance of threshold manipulation. This follows from the fact that all calibration techniques rely on an invertible (monotonic and one-to-one) transformation of the SVM output. Because the manipulation of a threshold at either the input or output of such a transformation produces the same receiver-operating-characteristic (ROC) curve, calibration does not change cost-sensitive classification performance. The boundary movement method is also obviously flawed when the data is non-separable, in which case costsensitive optimality is expected to require a modification of both the normal of the separating plane w and the classifier threshold b. The second proposal to modify SVM learning is known as the biased penalties (BP-SVM) method (Bach et al., 2006; Lin et al., 2002; Davenport et al., 2006; Wu & Srihari, 2003; Chang & Lin, 2001). This consists of introducing different penalty factors C1 and C-1 for the positive and negative SVM slack variables during training. It is implemented by transforming the primal SVM problem into   1 i  i + C-1 arg min ||w||2 + C C1 w,b, 2
{i|yi =1} {i|yi =-1}

mal. A sufficient condition for the cost-sensitive Bayesoptimality of the predictor is then provided, as well as necessary conditions for conditional risks that approximate the cost-sensitive Bayes risk. Together, these conditions enable the design of a new hinge loss which is minimized by an SVM that 1) implements the cost-sensitive Bayes decision rule, and 2) approximates the cost-sensitive Bayes risk. It is also shown that the minimization of this loss is a generalization of the classic SVM optimization problem, and can be solved by identical procedures. The resulting algorithm avoids the shortcomings of previous methods, producing cost-sensitive decision rules for both cases of separable and inseparable training data. Experimental results show that these advantages result in better cost-sensitive classification performance than previous solutions. The paper is organized as follows. Section 2 briefly reviews the probability elicitation view of loss function design (Masnadi-Shirazi & Vasconcelos, 2009). Section 3 then generalizes the connections between probability elicitation and risk minimization to the cost-sensitive setting. In Section 4, these connections are used to derive the new SVM loss and algorithm. Finally, Section 5 presents an experimental evaluation that demonstrates improved performance of the proposed cost sensitive SVM over previous methods.

s.t. yi (w x + b)  1 - i .

T

2. Probability elicitation and the risk
A classifier h maps a feature vector x  X to a class label y  {-1, 1}. This mapping can be written as h(x) = sign[f (x)] for some function f : X  R, which is denoted as the classifier predictor. Feature vectors and class labels are drawn from probability distributions PX (x) and PY (y) respectively. Given a non-negative loss function L(x, y), the classifier is optimal if it minimizes the risk R(f ) = EX,Y [L(h(x), y)]. This is equivalent to minimizing the conditional risk EY |X [L(h(x), y)|X = x] = PY |X (1|x)L(h, 1) +(1 - PY |X (1|x))L(h, -1), (2)

(1)

The biased penalties method also suffers from an obvious flaw, which is converse to that of the boundary movement method: it has limited ability to enforce cost-sensitivity when the training data is separable. For large slack penalty C, the slack variables i are zero-valued and the optimization above degenerates into that of the standard SVM, where the decision boundary is placed midway between the two classes (rather than assigning a larger margin to one of them). In this work we propose an alternative strategy for the design of cost-sensitive SVMs. This strategy is fundamentally different from previous attempts, in the sense that is does not directly manipulate the standard SVM learning algorithm. Instead, we extend the SVM hinge loss, and derive the optimal cost-sensitive learning algorithm as the minimizer of the associated risk. The derivation of the new cost-sensitive hinge loss draws on recent connections between risk minimization and probability elicitation (Masnadi-Shirazi & Vasconcelos, 2009). Such connections are generalized to the case of cost-sensitive classification. It is shown that it is always possible to specify the predictor and conditional risk functions desired for the SVM classifier, and derive the loss for which these are opti-

for all x  X . Classifiers are frequently designed to be optimal with respect to the zero-one loss L0/1 (f, y) = = 1 - sign(yf ) 2 0, if y = sign(f ); 1, if y = sign(f ),

(3)

where we omit the dependence on x for notational simplicity. The associated conditional risk is C0/1 (, f ) =  = 1 + sign(f ) 1 - sign(f ) + (1 - ) 2 2 1 - , if f  0; (4) , if f < 0,

Risk minimization, probability elicitation, and cost-sensitive SVMs

and  = 1 . Examples of optimal predictors include f  = 2  2 - 1 and f  = log 1- . The associated optimal classifier h = sign[f  ] is the well known Bayes decision rule, and the associated minimum conditional (zero-one) risk is
 C0/1 () = 

with (x) = PY |X (1|x). This risk is minimized by any predictor f such that   f (x) > 0 if (x) >  f (x) = 0 if (x) =  (5)  f (x) < 0 if (x) < 

It follows from the theorem that, starting from any convex J(), it is possible to derive I1 (·), I-1 (·) so that (10) holds. The next theorem connects this result to the problem of classifier design. Theorem 2. (Masnadi-Shirazi & Vasconcelos, 2009) Let J() be as defined in (10) and f a continuous function. If the following properties hold 1. J() = J(1 - ), 2. f is invertible with symmetry f -1 (-v) = 1 - f -1 (v), (13)

1 1 - sign(2 - 1) + 2 2 1 1 + sign(2 - 1) . (1 - ) 2 2

(6)

then the functions I1 (·) and I-1 (·) derived with (11) and (12) satisfy the following equalities I1 () I-1 () with (v) = -J[f -1 (v)] - (1 - f -1 (v))J  [f -1 (v)]. (16) = -(f ()) = -(-f ()), (14) (15)

A number of other losses have been proposed in the literature. Popular examples include the exponential loss of boosting, binomial loss of logistic regression, or hinge loss of SVMs. These losses are of the form L (f, y) = (yf ), for different functions (·). The associated conditional risk C (, f ) = (f ) + (1 - )(-f ). is minimized by the predictor
 f () = arg min C (, f ) f

(7)

(8)

 leading to the minimum conditional risk function C () =  C (, f ).

Conditional risk minimization is closely related to classical probability elicitation in statistics (Savage, 1971). Here, the goal is to find the probability estimator  that maximizes ^ the expected reward I(,  ) = I1 (^) + (1 - )I-1 (^), ^   (9)

This theorem connects (9) and (7), establishing a new path for the design of learning algorithms. Rather than specifying a loss  and minimizing C (, f ), so as to obtain  whatever optimal predictor f and minimum expected risk    C () results, it is possible to specify f and C () and  derive, from (16) with J() = -C (), the underlying   loss . The only conditions are that C () = C (1 - )  and that (13) holds for f . Note that 1) the symmetry of (13) guarantees that f meets the necessary conditions of (5) for predictor optimality1 , and 2) the condition of   C () = C (1 - ) encodes the fact that there is no preference for different types of errors2 .

3. Cost sensitive losses and classifier design
In this section we extend the connections between risk minimization and probability elicitation to the cost-sensitive setting. We start by reviewing cost-sensitive losses. 3.1. Cost-sensitive losses The cost-sensitive extension of the zero-one loss is LC1 ,C-1 (f, y) = 1 - sign(f ) 1 + sign(f ) 1 - sign(yf ) C1 + C-1 2 2 2  0, if y = sign(f );  C1 , if y = 1 and sign(f ) = -1 = (17)  C-1 , if y = -1 and sign(f ) = 1,
1 2

where I1 (^) is the reward for prediction  when event y =  ^ 1 holds and I-1 (^) the corresponding reward when y =  -1. The functions I1 (·), I-1 (·) should be such that the expected reward is maximal when  = , i.e. ^ I(,  )  I(, ) = J(),  ^ (10)

with equality if and only if  = . The following theorem ^ establishes the conditions under which this holds. Theorem 1. (Savage, 1971) Let I(,  ) and J() be as ^ defined in (9) and (10). Then 1) J() is convex and 2) (10) holds if and only if I1 () = J() + (1 - )J  () I-1 () = J() - J  (). (11) (12)

see Theorem 4. the risk, or expected loss, is the same for any two x1 and x2 at the same distance from the boundary, where distance is measured is units of posterior probability (|(x) - 1/2|).

Risk minimization, probability elicitation, and cost-sensitive SVMs

where C1 is the cost of a false negative, or miss, and C-1 that of a false positive. The associated conditional risk is CC1 ,C-1 (, f ) = 1 - sign(f ) 1 + sign(f ) C1  + (1 - )C-1 = 2 2 C-1 (1 - ), if f  0; (18) = C1 , if f < 0, and is minimized by any predictor that satisfies (5) with  = C1C-1-1 . Examples of optimal predictors include +C

2. set 1 (g()) = -I1 () and -1 (-g()) = -I-1 ().
 Then g() = f,C1 ,C-1 () if and only if J() =  -C,C1 ,C-1 ().

C1 f  () = (C1 + C-1 ) - C-1 and f  () = log (1-)C-1 . The associated optimal classifier h = sign[f  ] is the costsensitive Bayes decision rule, and the associated minimum conditional (cost-sensitive) risk is  CC1 ,C-1 () = C1 

The theorem shows that any loss with components i (·) designed according to steps 1. and 2. satisfies (21)-(23), when   g() = f,C1 ,C-1 () and J() = -C,C1 ,C-1 (). This  implies that it is possible to specify any pair f,C1 ,C-1 (),  C,C1 ,C-1 () and derive the underlying loss. The next  question is how to choose the best pair of f,C1 ,C-1 (),  and C,C1 ,C-1 (). The following theorem provides a sufficient condition for  the Bayes-optimality of f,C1 ,C-1 (). Theorem 4. Any invertible predictor f () with symmetry f -1 (-v) = 2C-1 - f -1 (v) C1 + C-1 (24)

1 1 - sign [f  ()] + 2 2 1 1 + sign [f  ()] C-1 (1 - ) 2 2

(19)

with f  () = (C1 + C-1 ) - C-1 . To extend the other losses used in machine learning to the cost-sensitive setting, we consider the following set of loss functions L,C1 ,C-1 (f, y) = C1 ,C-1 (yf ) = 1 (f ), -1 (-f ), if y = 1 (20) if y = -1.

satisfies the necessary and sufficient conditions for costsensitive optimality of (5) with  = C1C-1-1 . +C
 Hence, the specification of f,C1 ,C-1 () as any predictor that satisfies (24) guarantees that the conditional risk is minimized by the cost-sensitive Bayes decision rule. The  specification of C,C1 ,C-1 () determines the risk of the optimal classifier. The goal is to approximate as best as possible the cost-sensitive Bayes risk, given in (19). The next theorem highlights some fundamental properties of this risk.

The associated conditional risk C,C1 ,C-1 (, f ) = 1 (f ) + (1 - )-1 (-f ) is minimized by the predictor
 f,C1 ,C-1 () = arg min C,C1 ,C-1 (, f ) f

(21)

Theorem 5. The risk of (19) has the following properties: (22) 1. a maximum at   =
C-1 C1 +C-1

leading to the minimum conditional risk
 C,C1 ,C-1 ()

= +

 1 (f,C1 ,C-1 ())  (1 - )-1 (-f,C1 ,C-1 ()).

1 2. symmetry defined by,   0, C1 +C-1 ,

(23)

C  (  - C-1 ) = C  (  + C1 ) ,

(25)

3.2. Cost-sensitive learning algorithms It is currently not known which loss functions i (·) in (20) best extend the ones used in the design of cost-insensitive algorithms, so as to produce costsensitive extensions of boosting, or SVM classifiers. We address this problem by extending the approach of (Masnadi-Shirazi & Vasconcelos, 2009). Theorem 3. Let g() be any invertible function, J() any convex function, and i (·) determined by the following steps: 1. use (11) and (12) to obtain the I1 () and I-1 (), and let C,C1 ,C-1 (, f ) be defined by (21).

As noted by the following lemma, property 2. is in fact a generalization of property 1. Lemma 6. Any concave function with the symmetry of (25) also has property 1. of Theorem 5. Property 1. assigns the largest risk to the locations on the classification boundary. This can be seen as a mini mal requirement for consistency of any C,C1 ,C-1 () with Bayesian decision theory. Enforcing Property 2. further guarantees that the optimal risk has the symmetry of the cost-sensitive Bayes risk. Theorem 5 hence suggests the following risk taxonomy.
 Definition 1. A minimum risk C,C1 ,C-1 () is of

Risk minimization, probability elicitation, and cost-sensitive SVMs

1. Type-I if it satisfies property 1. but not 2. of Theorem 5. 2. Type-II if it satisfies both properties 1. and 2. Risks of type-II are closer approximations to the costsensitive Bayes risk than those of type I. The combination of Theorems 3-5 leads to a generic procedure for the design of cost-sensitive classification algorithms, consisting of the following steps
 1. select a predictor f,C1 ,C-1 () that satisfies (24).

equivalent to (26) when C1 = C-1 = 1. Finally, steps 1. and 2. of Theorem 3 produce the loss C1 ,C-1 (yf ) = exp(-C1 f ), exp(C-1 f ), if y = 1 if y = -1 (28)

proposed in (Masnadi-Shirazi & Vasconcelos, 2007). The resulting cost-sensitive boosting algorithm currently holds the best performance in the literature.

4. Cost sensitive SVM
We next consider the case of the cost-sensitive SVM. We start by extending the hinge loss, using the framework of the previous section, and then derive the cost-sensitive SVM optimization problem. 4.1. Cost-sensitive hinge-loss We start by recalling that the SVM minimizes the risk of the hinge loss (yf ) = 1 - yf + , where x+ = max(x, 0). This risk is minimized by (Zhang, 2004)
 f () = sign(2 - 1)

2. select a concave minimum conditional risk  C,C1 ,C-1 () of type-I or type-II, which reduces  to C () when C1 = C-1 = 1.
 3. use (11) and (12) with J() = -C,C1 ,C-1 () to obtain I1 () and I-1 ().  4. find i (·) so that I1 () = -1 (f,C1 ,C-1 ()) and  I-1 () = --1 (-f,C1 ,C-1 ()).

(29)

5. derive an algorithm to minimize the conditional risk of (21). We next illustrate the practical application of this framework by showing that the cost-sensitive exponential loss of (Masnadi-Shirazi & Vasconcelos, 2007) can be derived from a minimal conditional risk of Type-I. 3.3. Cost-sensitive exponential loss We start by recalling that AdaBoost is based on the loss (yf ) = exp(-yf ), for which it can be shown that
 C () = 

leading to the minimum conditional risk
 C () = 1 - |2 - 1| = 1 - sign(2 - 1)+ + (1 - )1 + sign(2 - 1)+ .

Again, we replace the optimal cost-insensitive predictor by its cost-sensitive counterpart
 f,C1 ,C-1 () = sign((C1 + C-1 ) - C-1 ).

(30)

which is easily shown to satisfy (5). This suggests the costsensitive minimum conditional risk
 C,C1 ,C-1 () =

1- + (1 - )   1  . and f = log 2 1-

 1- (26)

(31)

e - d · sign((C1 + C-1 ) - C-1 )+ + (1 - )b + a · sign((C1 + C-1 ) - C-1 )+ , which can be shown to satisfy (25) if and only if de ab and a+b C-1 = . C1 d+e (32)

 A natural cost-sensitive extension is f,C1 ,C-1 () = C1 1 C1 +C-1 log (1-)C-1 , which is easily shown to sat  isfy (24). Noting that C () =  exp(-f ) + (1 -  ) exp(f ), suggests the cost-sensitive extension

After steps 1. and 2. of Theorem 3, C1 ,C-1 (yf ) = e - df + , b + af + , if y = 1 if y = -1. (33)

 C,C1 ,C-1 () = 

C1 (1 - )C-1

-C1 C1 +C-1

+ .(27)

(1 - )

C1 (1 - )C-1 )

C-1 C1 +C-1

This does not have the symmetry of (25) but satisfies property 1. of Theorem 5. Hence, it is a Type-I risk. It is also

This loss has four degrees of freedom, which control the margin and slope of the hinge components associated with the two classes: positive examples are classified with mare gin d and hinge loss slope d, while for negative examples b the margin is a and slope a.

Risk minimization, probability elicitation, and cost-sensitive SVMs

4.2. Cost-sensitive SVM learning We consider the case where errors in the positive class are b e weighted more heavily, leading to the inequalities a  d and d  a. Choosing e = d = C1 normalizes the margin e of positive examples to unity ( d = 1). Selecting b = 1 then fixes the scale of the negative component of the hinge loss, leading to a = 2C-1 - 1. The resulting cost sensitive SVM minimal conditional risk is
 C,C1 ,C-1 () =

a smaller margin on negative examples. On the other hand,  and  control the relative weights of margin violations, assigning more weight to positive violations. This allows control of cost-sensitivity when the data is not separable. Obviously, this primal problem could be defined through heuristic arguments. However, it would be difficult to justify precise choices for the parameters of (37). Furthermore, the derivation above guarantees that the optimal classifier implements the Bayes decision rule of (5) with  = C1C-1-1 , and its risk is a type-II approximation to +C the cost-sensitive Bayes risk. No such guarantees would be possible for an heuristic solution. To obtain some intuition about the cost-sensitive extension, we consider the synthetic problem of Figure 1, where the two classes are linearly separable. The figure shows three separating lines. The green line is an arbitrary separating line that does not maximize the margin. The red line is the standard SVM solution, which has maximum margin and is equally distant from the nearest examples of the two classes. The blue line is the solution of (36) for C1 = 4 and C-1 = 2 (the C parameter is irrelevant when the data is separable). It is also a maximum margin solution, but trades-off the distance to positive and negative examples so as to enforce a larger positive margin, as specified. Overall, an increase in C-1 guarantees a larger positive margin. For a given C-1 , increasing C1 (so that C1  2C-1 - 1) increases the cost of errors on positive examples, enabling control of the miss rate when the classes are not separable. Finally, the dual and kernelized formulation of the cost sensitive SVM can be obtained with the standard procedures, leading to

(34)

C1 - C1 · sign((C1 + C-1 ) - C-1 )+ + (1 - )1 + (2C-1 - 1) · sign((C1 + C-1 ) - C-1 )+ with C-1  1 and C1  2C-1 - 1, so as to satisfy (32). Figure 1 presents plots of (34) and (33), for both C1 = 4, C-1 = 2 and the cost insensitive case of C1 = 1, C-1 = 1 (standard SVM). Note that, for the cost-sensitive SVM, the positive class has a unit margin, while the negative class has a smaller margin of 1 . Also, the slope of the positive 3 component of the loss is 4 while the negative component has a smaller slope of 3. In this way, the loss assigns a higher cost to errors in the positive class when the data is not separable, while enforcing a larger margin for positive examples when the data is separable. Replacing the standard hinge loss with (33) in the standard SVM risk (Moguerza & Munoz, 2006) arg min
w,b {i|yi =1}

C1 - C1 (wT xi + b)+

(35)

+

1 + (2C-1 - 1)(wT xi + b)+ + µ||w||2 ,
{i|yi =-1}

leads to the primal problem 1 arg min ||w||2 + C  w,b 2  i
{i|yi =1}

(36) 

arg max
i i

i - 1 2

yi - 1 yi + 1 - 2 2(2C-1 - 1) i j yi yj K(xi , xj )

(38)

+
{i|yi =-1}

s.t. (w xi + b)  1 - i ; (wT xi + b)  - + i ; with  = C1  = 2C-1 - 1

T

yi = 1 yi = -1

i 

i

j

s.t.
i

i yi = 0 0  i  CC1 ; yi = 1 0  i  C(2C-1 - 1); yi = -1.

=

1 2C-1 - 1

.

(37) This reduces to the standard SVM dual when C1 = C-1 = 1. Note that the derivation of the cost-sensitive SVM from a suitable loss function leads to an algorithm that performs regardless of the separability of the data and slack penalty, unlike the previous BM-SVM and BP-SVM algorithms. The improved performance of CS-SVM on real world data sets is demonstrated in the next section.

This is a quadratic programming problem similar to that of the standard cost-insensitive SVM with soft margin weight parameter C. In this case, cost-sensitivity is controlled by the parameters , , and . The parameter  is responsible for cost-sensitivity in the separable case. Under the constraints C1  1, C1  2C-1 -1 of a type-II risk, it imposes

Risk minimization, probability elicitation, and cost-sensitive SVMs

5 2.5 2 C*() 1.5 2 1 0.5 0 0 0.2 0.4  0.6 0.8 1 1.5 1 0.5 0 -3 -2 -1 0 1 2 3 1 2 3 4 3.5 3 2.5 4

Positive Class Negative Class SVM Cost SVM Non Max Margin Classifier

0 1 0.8 0.6 0.4 0.2 0 0 0.2 0.4  0.6 0.8 1 4 3.5 3 2.5 2 1.5 1 0.5 0 -3 -2 -1 0 1 2 3 -4 -4 -3 -2 -1 0 1 2 3 4 -3 -2 -1

 Figure 1. Left: concave C,C1 ,C-1 () function and corresponding cost sensitive SVM loss function, top: C1 = 4, C-1 = 2, bottom: C1 = C-1 = 1. Right: linearly separable cost sensitive SVM.

5. Experimental results
The performance of the CS-SVM was evaluated with two sets of experiments. The first was based on ten binary UCI data sets (Newman et al., 1998): Pima-diabetes, breast cancer diagnostic, breast cancer prognostic, original Wisconsin breast cancer, liver disorder, sonar, echo-cardiogram, Cleveland heart disease, tic-tac-toe and Haberman's survival. The goal was to learn the SVM of lowest total error rate, given a target detection rate. In all cases, leave one out cross validation was used to find the best cost estimate. We considered detection rates between 80% and 95%, with increments of 2.5%, and set C, C1 , C-1 and b (SVM threshold) for each method so as to achieve the smallest false positive rate on the validation set. The total error was computed for each detection rate, and the mean of these errors is reported in Table-2. Results are reported for the proposed CS-SVM, the BM-SVM (Karakoulas & Shawe-Taylor, 1999) and the BP-SVM (Bach et al., 2006; Lin et al., 2002; Davenport et al., 2006; Wu & Srihari, 2003; Chang & Lin, 2001). While the table confirms the previous observation that the BP-SVM outperforms the BM-SVM (Bach et al., 2006; Lin et al., 2002; Davenport et al., 2006; Wu & Srihari, 2003; Chang & Lin, 2001), none of them matches the CS-SVM. This is most interesting given the fact that CS-SVM has the same computational complexity and number of tuning parameters as the BP-SVM. Overall, CS-SVM has the smallest error on 7 of the 10 datasets, sometimes by a very substantial margin. CS-SVM and BP-SVM have equal error on 2 datasets, and BP and BM-SVMs have a slight advantage on Wisconsin. The second set of experiments was based on the German Credit data set (Geibel et al., 2004; Newman et al., 1998).

C*()

Table 1. Total loss in $ for each method on the German Credit dataset.

Method Loss $

CS-SVM 550$

BP-SVM 878$

SVM 878$

This dataset has 700 examples of good credit customers and 300 examples of bad credit customers. Each example is described by 24 attributes, and the goal is to identify bad costumers, to be denied credit. This data set is particularly interesting for cost-sensitive learning because it provides a cost matrix for the different types of errors. Classifying a good credit customer as bad (a false-positive) incurs a loss of 1. Classifying a bad credit customer as good (a miss) incurs a loss of 5. Hence, on this dataset, the leave one out cross validation of CS-SVM and BP-SVM parameters C1 was subject to the constraint C-1 = 5. A cost insensitive SVM was also trained. Table 1 presents the loss achieved by each method. Note that BP-SVM does not produce any improvement with respect to the cost insensitive SVM. On the other hand, the loss achieved with CS-SVM is 328$ smaller, i.e. a substantial reduction of cost by 37.36%.

6. Conclusion
In this work, we have extended the recently introduced probability elicitation view of loss function design to the cost sensitive classification problem. This extension was applied to the SVM problem, so as to produce a costsensitive hinge loss function. A cost-sensitive SVM learning algorithm was then derived, as the minimizer of the associated risk. Unlike previous SVM algorithms, the one

Risk minimization, probability elicitation, and cost-sensitive SVMs

Table 2. mean error for each UCI data set and cost sensitive SVM method.
Dataset CS-SVM BP-SVM BM-SVM Survive 195.8 199.6 201.8 Liver 163.8 167.2 169.2 Echo 40 43 45 Pima 313.2 416 416 Wisc 33.2 32.8 32.8 Tic 536 536 538 Heart 68.4 69.4 73.2 Diag 33.8 33.8 33.8 Prag 107.2 115.2 126 Sonar 65.6 75.2 76.4

now proposed enforces cost sensitivity for both separable and non-separable training data, enforcing a larger margin for the preferred class, independent of the choice of slack penalty. It also offers guarantees of optimality, namely classifiers that implement the cost-sensitive Bayes decision rule and approximate the cost-sensitive Bayes risk. Empirical evidence confirms its superior performance, when compared to previous methods.

Kubat, Miroslav and Matwin, Stan. Addressing the curse of imbalanced training sets: One-sided selection. In ICML, pp. 179­186, 1997. Lin, Yi, Lee, Yoonkyung, and Wahba, Grace. Support vector machines for classification in nonstandard situations. Machine Learning, 46:191­202, 2002. Masnadi-Shirazi, Hamed and Vasconcelos, Nuno. Asymmetric boosting. In ICML, 2007. Masnadi-Shirazi, Hamed and Vasconcelos, Nuno. On the design of loss functions for classification: theory, robustness to outliers, and savageboost. In NIPS, pp. 1049­ 1056, 2009. Moguerza, Javier M. and Munoz, Alberto. Support vector machines with applications. Statistical Science, 21:322, 2006. Newman, D.J., Hettich, S., Blake, C.L., and Merz, C.J. UCI repository of machine learning databases, 1998. Platt, J. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Adv. in Large Margin Classifiers, 2000. Savage, Leonard J. The elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66:783­801, 1971. Wu, G. and Chang, E. Adaptive feature-space conformal transformation for imbalanced data learning. In ICML, pp. 816­823, 2003. Wu, G. and Chang, E. Kba: kernel boundary alignment considering imbalanced data distribution. IEEE Transactions on Knowledge and Data Engineering, 17:786­ 795, 2005. Wu, X. and Srihari, R. New -support vector machines and their sequential minimal optimization. In ICML, 2003. Zhang, Tong. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 2004.

References
Akbani, Rehan, Kwek, Stephen, and Japkowicz, Nathalie. Applying support vector machines to imbalanced datasets. In European Conference on Machine Learning (ECML), pp. 39­50, 2004. Amari, S. and Wu, S. Improving support vector machine classifiers by modifying kernel functions. Neural Networks, 12(6):783­789, 1999. Bach, Francis R., Heckerman, David, and Horvitz, Eric. Considering cost asymmetry in learning classifiers. The Journal of Machine Learning Research, 7:1713­1741, 2006. Chang, Chih-Chung and Lin, Chih-Jen. LIBSVM: a library for support vector machines, 2001. Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. Smote: Synthetic minority oversampling technique. Journal of Artificial Intelligence Research, 16:321­357, 2002. Cortes, Corinna and Vapnik, Vladimir. Support-vector networks. Machine Learning, 20:273­297, 1995. Davenport, M.A., Baraniuk, R.G., and Scott, C.D. Controlling false alarms with support vector machines. In ICASSP, 2006. Elkan, C. The foundations of cost-sensitive learning. In Joint Conference on Artificial Intelligence, 2001. Geibel, Peter, Brefeld, Ulf, and Wysotzki, Fritz. Perceptron and svm learning with generalized cost models. Intelligent Data Analysis, 8:439­455, 2004. Karakoulas, G. and Shawe-Taylor, J. optimizing classifiers for imbalanced training sets. In NIPS, 1999.