-Support Vector Machine as Conditional Value-at-Risk Minimization Akiko Takeda Keio University, 3-14-1 Hiyoshi, Kouhoku, Yokohama, Kanagawa 223-8522, Japan takeda@ae.keio.ac.jp Masashi Sugiyama sugi@cs.titech.ac.jp Tokyo Institute of Technology, 2-12-1 O-okayama, Meguro-ku, Tokyo 152-8552, Japan Abstract The -support vector classification ( -SVC) algorithm was shown to work well and provide intuitive interpretations, e.g., the parameter roughly specifies the fraction of support vectors. Although corresponds to a fraction, it cannot take the entire range between 0 and 1 in its original form. This problem was settled by a non-convex extension of -SVC and the extended method was experimentally shown to generalize better than original -SVC. However, its good generalization performance and convergence properties of the optimization algorithm have not been studied yet. In this paper, we provide new theoretical insights into these issues and propose a novel -SVC algorithm that has guaranteed generalization performance and convergence properties. data separation error (Cortes & Vapnik, 1995). This soft-margin formulation is commonly referred to as C SVC since the trade-off is controlled by the parameter C . C -SVC was shown to work very well in a wide range of real-world applications (Scholkopf & Smola, ¨ 2002). An alternative formulation of the soft-margin idea is SVC (Scholkopf et al., 2000)--instead of the parameter ¨ C , -SVC involves another trade-off parameter that roughly specifies the fraction of support vectors (or sparseness of the solution). Thus, the -SVC formulation provides us richer interpretation than the original C -SVC formulation, which would be potentially useful in real applications. Since the parameter corresponds to a fraction, it should be able to be chosen between 0 and 1. However, it was shown that admissible values of are actually limited (Crisp & Burges, 2000; Chang & Lin, 2001). To cope with this problem, Perez-Cruz et al. (2003) introduced the notion of negative margins and proposed extended -SVC (E -SVC) which allows to take the entire range between 0 and 1. They also experimentally showed that the generalization performance of E -SVC is often better than that of original -SVC. Thus the extension contributes not only to elucidating the theoretical property of -SVC, but also to improving its generalization performance. However, there remain two open issues in E -SVC. The first issue is that the reason why a high generalization performance can be obtained by E -SVC was not completely explained yet. The second issue is that the optimization problem involved in E -SVC is nonconvex and theoretical convergence properties of the E -SVC optimization algorithm have not been studied yet. The purpose of this paper is to provide new theoretical insights into these two issues. After reviewing existing SVC methods in Section 2, we 1. Intro duction Support vector classification (SVC) is one of the most successful classification algorithms in modern machine learning (Scholkopf & Smola, 2002). SVC finds a hy¨ perplane that separates training samples in different classes with maximum margin (Boser et al., 1992). The maximum margin hyperplane was shown to minimize an upper bound of the generalization error according to the Vapnik-Chervonenkis theory (Vapnik, 1995). Thus the generalization performance of SVC is theoretically guaranteed. SVC was extended to be able to deal with nonseparable data by trading the margin size with the Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). -Support Vector Machine as Conditional Value-at-Risk Minimization elucidate the generalization performance of E -SVC in Section 3. We first show that the E -SVC formulation could be interpreted as minimization of the conditional value-at-risk (CVaR), which is often used in finance (Rockafellar & Uryasev, 2002; Gotoh & Takeda, 2005). Then we give new generalization error bounds based on the CVaR risk measure. This theoretical result justifies the use of E -SVC. In Section 4, we address non-convexity of the E -SVC optimization problem. We first give a new optimization algorithm that is guaranteed to converge to one of the local optima within a finite number of iterations. Based on this improved algorithm, we further show that the global solution can be actually obtained within finite iterations even though the optimization problem is non-convex. Finally, in Section 5, we give concluding remarks and future prospects. Proofs of all theorems and lemmas are sketched in Appendix unless mentioned. 2.2. Supp ort Vector Classification The Vapnik-Chervonenkis theory (Vapnik, 1995) showed that a large margin classifier has a small generalization error. Motivated by this theoretical result, Boser et al. (1992) developed an algorithm for finding the hyperplane (w, b) with maximum margin: 1 min w w,b 2 2 s.t. yi ( w, xi + b) 1, i M . (2) This is called (hard-margin) support vector classification (SVC) and valid when the training samples are linearly separable. In the following, we omit "i M " in the constraint for brevity. 2.3. C -Supp ort Vector Classification Cortes and Vapnik (1995) extended the SVC algorithm to non-separable cases and proposed trading the margin size with the data separation error (i.e., "softmargin"): 1 min w w,b, 2 2 2. Supp ort Vector Classification In this section, we formulate the classification problem and briefly review support vector algorithms. 2.1. Classification Problem Let us address the classification problem of learning a decision function h from X ( IRn ) to {±1} based on training samples (xi , yi ) (i M := {1, ..., m}). We assume that the training samples are i.i.d. following the unknown probability distribution P (x, y ) on X × {±1}. The goal of the classification task is to obtain a classifier h that minimizes the generalization error (or the risk): 1 R[h] := 2 |h(x) - y |dP (x, y ), which corresponds to the misclassification rate for unseen test samples. +C s.t. yi ( w, xi + b) 1 - i , im i i 0, =1 where C (> 0) controls the trade-off. This formulation is usually referred to as C -SVC, and was shown to work very well in various real-world applications (Scholkopf ¨ & Smola, 2002). 2.4. -Supp ort Vector Classification -SVC is another formulation of soft-margin SVC (Scholkopf et al., 2000): ¨ 1 min w w,b,, 2 2 - + s.t. yi ( w, xi + b) - i , m 1i i m =1 i 0, 0, where ( IR) is the trade-off parameter. Scholkopf et al. (2000) showed that if the -SVC solu¨ tion yields > 0, C -SVC with C = 1/(m) produces the same solution. Thus -SVC and C -SVC are equivalent. However, -SVC has additional intuitive interpretations, e.g., is an upper bound on the fraction of margin errors and a lower bound on the fraction of support vectors (i.e., sparseness of the solution). Thus, the -SVC formulation would be potentially more useful than the C -SVC formulation in real applications. 2.5. E -SVC Although has an interpretation as a fraction, it cannot always take its full range between 0 and 1 (Crisp & Burges, 2000; Chang & Lin, 2001). For the sake of simplicity, we generally focus on linear classifiers, i.e., h(x) = sign( w, x + b), (1) where w ( IRn ) is a non-zero normal vector, b ( IR) is a bias parameter, and sign( ) = 1 if 0 and -1 otherwise. Most of the discussions in this paper can be directly applicable to non-linear kernel classifiers (Scholkopf & ¨ Smola, 2002). Thus we may not lose generality by restricting ourselves to linear classifiers. -Support Vector Machine as Conditional Value-at-Risk Minimization 2.5.1. Admissible Range of C For an optimal solution {i }m 1 of dual C -SVC, let i= m 1i C , C m =1 i By this modification, a non-trivial solution can be obtained even for [0, min ]. This modified formulation is called extended -SVC (E -SVC). The E -SVC optimization problem is non-convex due to the equality constraint w 2 = 1. Perez-Cruz et al. (2003) proposed the following iterative algorithm for computing a solution. First, for some initial w, solve the problem (3) with w 2 = 1 replaced by w, w = 1. Then, using the optimal solution w, update w by for = 9/10, and iterate this procedure until convergence. Perez-Cruz et al. (2003) experimentally showed that the generalization performance of E -SVC with [0, min ] is often better than that with (min , max ], implying that E -SVC is a promising classification algorithm. However, it is not clear how the notion of negative margins influences on the generalization performance and how fast the above iterative algorithm converges. The goal of this paper is to give new theoretical insights into these issues. w - w + (1 - )w (4) (C ) := min := lim (C ) and max := lim (C ). C C 0 Then, Chang and Lin (2001) showed that for (min , max ], the optimal solution set of -SVC is the same as that of C -SVC with some C (not necessarily unique). In addition, the optimal ob jective value of -SVC is strictly negative. However, for (max , 1], -SVC is unbounded, i.e., there exists no solution; for [0, min ], -SVC is feasible with zero optimal objective value, i.e., we end up with just having a trivial solution (w = 0 and b = 0). 2.5.2. Increasing Upper Admissible Range It was shown by Crisp and Burges (2000) that max = 2 min(m+ , m- )/m, where m+ and m- are the number of positive and negative training samples. Thus, when the training samples are balanced (i.e., m+ = m- ), max = 1 and therefore can reach its upper limit 1. When the training samples are imbalanced (i.e., m+ = m- ), Perez-Cruz et al. (2003) proposed modifying the optimization problem of -SVC as 1 min w w,b,, 2 2 3. Justification of the E -SVC Criterion In this section, we give a new interpretation of E -SVC and theoretically explain why it works well. 3.1. New Interpretation of E -SVC as CVaR minimization Let f (w, b; x, y ) be the margin error for a sample (x, y ): y ( w, x + b) . f (w, b; x, y ) := - w Let us consider the distribution of margin errors over all training samples: (|w, b) := P {(xi , yi ) | f (w, b; xi , yi ) }. For [0, 1), let (w, b) be the 100 -percentile of the margin error distribution: (w, b) := min{ | (|w, b) }. Thus only the fraction (1 - ) of the margin error f (w, b; xi , yi ) exceeds the threshold (w, b) (see Figure 1). (w, b) is commonly referred to as the valueat-risk (VaR) in finance and is often used by security houses or investment banks to measure the market risk of their asset portfolios (Rockafellar & Uryasev, 2002; Gotoh & Takeda, 2005). s.t. yi ( w, xi + b) - i , 1i - + m+ :y i 1i i + m- :y =1 i 0, i i =-1 0, i.e., the effect of positive and negative samples are balanced. Under this modified formulation, max = 1 holds even when training samples are imbalanced. For the sake of simplicity, we assume m+ = m- in the rest of this paper; when m+ = m- , all the results can be simply extended in a similar way as above. 2.5.3. Decreasing Lower Admissible Range When [0, min ], -SVC produces a trivial solution (w = 0 and b = 0) as shown in Chang and Lin (2001). To prevent this, Perez-Cruz et al. (2003) proposed allowing the margin to be negative and enforcing the norm of w to be unity: m 1i min - + i m =1 w,b, , s.t. yi ( w, xi + b) - i , i 0, w 2 = 1. (3) -Support Vector Machine as Conditional Value-at-Risk Minimization frequency probability : Non-convex Convex Figure 1. An example of the distribution of margin errors f (w, b; xi , yi ) over all training samples. (w, b) is the 100 -percentile called the value-at-risk (VaR), and the mean (w, b) of the -tail distribution is called the conditional VaR (CVaR). Figure 2. A profile of the CVaR 1- (w , b ) as a function of . As shown in Section 4, the E -SVC optimization problem can be cast as a convex problem if ( , max ], while it is essentially non-convex if (0, ). 3.2. Justification of E -SVC We have shown the equivalence between E -SVC and minCVaR. Here we derive new bounds of the generalization error based on the notion of CVaR and try to justify the use of E -SVC. When training samples are linearly separable, the margin error f (w, b; xi , yi ) is negative for all samples. Then, at the optimal solution (w , b ), the CVaR 1- (w , b ) is always negative. However, in nonseparable cases, 1- (w , b ) could be positive particularly when is close to 0. Regarding the CVaR, we have the following lemma. Lemma 2 1- (w , b ) is continuous with respect to and is strictly decreasing when is increased. Let be such that 1- (w , b ) = 0 if such exists; we set = max if 1- (w , b ) > 0 for all and we set = 0 if 1- (w , b ) < 0 for all . Then we have the following relation (see Figure 2): 1- (w , b ) < 0 for ( , max ], 1- (w , b ) > 0 for (0, ). Below, we analyze the generalization error of E -SVC depending on the value of . 3.2.1. Justification When ( , max ] Theorem 3 Let ( , max ]. Suppose that support X is in a bal l of radius R around the origin. Then, for al l (w, b) such that w = 1 and 1- (w, b) < 0, there exists a positive constant c such that the fol lowing bound hold with probability at least 1 - : R[h] + G(1- (w, b)), where G( ) = 2 (7) Let us consider the -tail distribution of f (w, b; xi , yi ): 0 for < (w, b), (|w, b) := (|w,b)- for (w, b). 1- Let (w, b) be the mean of the -tail distribution of f (w, b; xi , yi ) (see Figure 1 again): (w, b) := E [f (w, b; xi , yi )], where E denotes the expectation over the distribution . (w, b) is called the conditional VaR (CVaR). By definition, the CVaR is always larger than or equal to the VaR: (w, b) (w, b). (5) Let us consider the problem of minimizing the CVaR (w, b) (which we refer to as minCVaR): min (w, b). w,b Then we have the following theorem. Theorem 1 The solution of the minCVaR problem (6) is equivalent to the solution of the E -SVC problem (3) with = 1 - . Theorem 1 shows that E -SVC actually minimizes the CVaR 1- (w, b). Thus, E -SVC could be interpreted as minimizing the mean margin error over a set of "bad" training samples. In contrast, the hardmargin SVC problem (2) can be equivalently expressed in terms of the margin error as min max f (w, b; xi , yi ). w,b iM Thus hard-margin SVC minimizes the margin error of the single "worst" training sample. This analysis shows that E -SVC can be regarded as an extension of hard-margin SVC to be less sensitive to an outlier (i.e., the single "worst" training sample). (6) m 42 2 . c (R + 1)2 2 log2 (2m) - 1 + log 2 -Support Vector Machine as Conditional Value-at-Risk Minimization The generalization error bound in (7) is furthermore upper-bounded as + G(1- (w, b)) + G(1- (w, b)). G( ) is monotone decreasing as | | increases. Thus, the above theorem shows that when 1- (w, b) < 0, the upper bound + G(1- (w, b)) is lowered if the CVaR 1- (w, b) is reduced. Since E -SVC minimizes 1- (w, b) (see Theorem 1), the upper bound of the generalization error is also minimized. 3.2.2. Justification When (0, ] Our discussion below depends on the sign of 1- (w, b). When 1- (w, b) < 0, we have the following theorem. Theorem 4 Let (0, ]. Then, for al l (w, b) such that w = 1 and 1- (w, b) < 0, there exists a positive constant c such that the fol lowing bound holds with probability at least 1 - : R[h] + G(1- (w, b)). A proof of the above theorem is omitted since the proof follows a similar line to the proof of Theorem 3. This theorem shows that when 1- (w, b) < 0, the upper bound + G(1- (w, b)) is lowered if 1- (w, b) is reduced. On the other hand, Eq.(5) shows that the VaR 1- (w, b) is upper-bounded by the CVaR 1- (w, b). Therefore, minimizing 1- (w, b) by E SVC may have an effect of lowering the upper bound of the generalization error. When 1- (w, b) > 0, we have the following theorem. Theorem 5 Let (0, ]. Then, for al l (w, b) such that w = 1 and 1- (w, b) > 0, there exists a positive constant c such that the fol lowing bound hold with probability at least 1 - : R[h] - G(1- (w, b)). Moreover, the lower bound of R[h] is bounded from above as - G(1- (w, b)) - G(1- (w, b)). A proof of the above theorem is also omitted since the proof resembles to Theorem 3. Theorem 5 implies that the lower bound - G(1- (w, b)) of the generalization error is upper-bounded by - G(1- (w, b)). On the other hand, Eq.(5) and 1- (w, b) > 0 yields 1- (w, b) > 0. Thus minimizing 1- (w, b) by E SVC may contribute to lowering the lower bound - G(1- (w, b)) of the generalization error. 4. New Optimization Algorithm As reviewed in Section 2.5, E -SVC involves a nonconvex optimization problem. In this section, we give a new efficient optimization procedure for E -SVC. Our proposed procedure involves two optimization algorithms depending on the value of . We first describe the two algorithms and then show how these two algorithms are chosen for practical use. 4.1. Optimization When ( , max ] Lemma 6 When ( , max ], the E -SVC problem (3) is equivalent to m 1i i min - + m =1 w,b,, s.t. yi ( w, xi + b) - i , i 0, w 2 1. (8) This lemma shows that the equality constraint w 2 = 1 in the original problem (3) can be replaced by w 2 1 without changing the solution. Due to the convexity of w 2 1, the above optimization problem is convex and therefore we can easily obtain the global solution by a standard optimization software. 4.2. Optimization When (0, ] If (0, ], the E -SVC optimization problem is essentially non-convex and therefore we need a more elaborate algorithm. 4.2.1. Local Optimum Search Here, we propose the following iterative algorithm for finding a local optimum. Algorithm 7 (The E -SVC lo cal search algorithm for (0, ]) optimum Step 1: Initialize w. Step 2: Solve the following linear program: m 1i min - + i m =1 w,b,, (9) s.t. yi ( w, xi + b) - i , i 0, w, w = 1, and let the optimal solution be (w, b, , ). Step 3: If w = w, terminate and output w. Otherwise, update w by w - w/ w . Step 4: Repeat Steps 2­3. The linear program (9) is the same as the one proposed by Perez-Cruz et al. (2003), i.e., the equality constrained w 2 = 1 of the original problem (3) is -Support Vector Machine as Conditional Value-at-Risk Minimization replaced by w, w = 1. The updating rule of w in Step 3 is different from the one proposed by PerezCruz et al. (2003) (cf. Eq.(4)). We define a "corner" (or "0-dimensional face") of E SVC (3) as the intersection of an edge of the polyhedral cone formed by linear constraints of (3) and w 2 = 1. Under the new update rule, the algorithm visits a corner of E -SVC (3) in each iteration. Since E -SVC has finite corners, we can show that Algorithm 7 with the new update rule terminates in a finite number of iterations, i.e., less than or equal to the number of corners of E -SVC. Theorem 8 Algorithm 7 terminates within a finite number of iterations of Steps 2­3. Furthermore, a solution of the modified E -SVC algorithm is a local minimizer if it is unique and non-degenerate. 4.2.2. Global Optimum Search Next, we show that the global solution can be actually obtained within finite iterations, despite the nonconvexity of the optimization problem. A naive approach to searching for the global solution is to run the local optimum search algorithm many times with different initial values and choose the best local solution. However, there is no guarantee that this naive approach can find the global solution. Below, we give a more systematic way to find the global solution based on the following lemma. Lemma 9 When (0, ], the E -SVC problem (3) is equivalent to m 1i min - + i m =1 w,b,, (b) Facial cut Concavity cut Figure 3. A 0-dimensional face (a) and three proper faces e (bold solid lines) of D are identified in D. If the corner (a) is found in Step 2, a concavity cut is constructed. If the corner (b) is found, a facial cut is constructed. If these two e cuts are added to D, the remaining area includes no face of D. (a) Let D be the feasible set of E -SVC (3). Below, we summarize the E -SVC training algorithm based on the cutting plane method, which is an efficient method of tracing faces. Algorithm 10 (The E -SVC global optimum search algorithm for (0, ]) Step 1: D - D. Step 2: Find a local solution by Algorithm 7. Step 3: Identify a face of D in D that corresponds the local solution. Step 4a: If the face is a corner, construct a "concavity cut". Step 4b: If the face is a proper face, construct a "facial cut". Step 5: Add the cut to the problem (9) and D. Step 6: Repeat Steps 2­5 until D includes no face of D. Step 7: Output the best local optimal solution as the global solution. If the local solution obtained in Step 2 is a corner of D (i.e., the local solution is not on any cutting plane as (a) in Figure 3), a concavity cut (Horst & Tuy, 1995) is constructed. The concavity cut has a role of removing the local solution, i.e., a 0-dimensional face of D and its neighborhood. Otherwise, a facial cut (Ma jthay & Whinston, 1974) is constructed to eliminate the proper face (see (b) in Figure 3). Since the total number of distinct faces of D is finite in the current setting and a facial cut or a concavity cut eliminates at least one face at a time, Algorithm 10 is guaranteed to terminate within finite iterations (precisely, less than or equal to the number of all dimensional faces of E -SVC). Furthermore, since the addition of a concavity cut or a facial cut does not remove local solutions which are better than the best local solution found so far, Algorithm 10 is guaranteed to s.t. yi ( w, xi + b) - i , i 0, w 2 1. (10) Lemma 9 could be proved in a similar way as Lemma 6, so we omit the proof. This lemma shows that the equality constraint w 2 = 1 in the original E -SVC problem (3) can be replaced by w 2 1 without changing the solution if (0, ]. The problem (10) is called a linear reverse convex program (LRCP), which is a class of non-convex problems consisting of linear constraints and one concave inequality ( w 2 1 in the current case). The feasible set of the problem (10) consists of a finite number of faces. For LRCPs, Horst and Tuy (1995) showed that the local optimal solutions correspond to 0-dimensional faces (or corners). This implies that all the local optimal solutions of the E -SVC problem (10) can be traced by checking all the faces. -Support Vector Machine as Conditional Value-at-Risk Minimization trace all sufficient local solutions. Thus we can always find a global solution within finite iterations by Algorithm 10. A more detailed discussion on the concavity cut and the facial cut is shown in Horst and Tuy (1995) and Ma jthay and Whinston (1974), respectively. 4.3. Choice of Two Algorithms We have two convergent algorithms when ( , max ] and (0, ]. Thus, choosing a suitable algorithm depending on the value of would be an ideal procedure. However, the value of the threshold is difficult to explicitly compute since it is defined via the optimal value 1- (w , b ) (see Figure 2). Therefore, it is not straightforward to choose a suitable algorithm for a given . When we use E -SVC in practice, we usually compute the solutions for several different values of and choose the most promising one based on, e.g., crossvalidation. In such scenarios, we can properly switch two algorithms without explicitly knowing the value of --our key idea is that the solution of the problem (8) is non-trivial (i.e., w = 0) if and only if ( , max ]. Thus if the solutions are computed from large to small , the switching point can be identified by checking the triviality of the solution. The proposed algorithm is summarized as follows. Algorithm 11 (The E -SVC (max ) 1 > 2 > · · · > k > 0) algorithm for the other hand, the problem (8) and -SVC have the zero optimal value in (0, ] and [0, min ], respectively. Thus, although the definitions of and min are different, they would be essentially the same. We will study the relation between and min in more detail in the future work. References Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. COLT (pp. 144­152). ACM Press. Chang, C.-C., & Lin, C.-J. (2001). Training -support vector classifiers: Theory and algorithms. Neural Computation, 13, 2119­2147. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273­297. Crisp, D. J., & Burges, C. J. C. (2000). A geometric interpretation of -SVM classifiers. NIPS 12 (pp. 244­250). MIT Press. Gotoh, J., & Takeda, A. (2005). A linear classification model based on conditional geometric score. Pacific Journal of Optimization, 1, 277­296. Horst, R., & Tuy, H. (1995). Global optimization: Deterministic approaches. Berlin: Springer-Verlag. Ma jthay, A., & Whinston, A. (1974). Quasi-concave minimization sub ject to linear constraints. Discrete Mathematics, 9, 35­59. Perez-Cruz, F., Weston, J., Hermann, D. J. L., & Scholkopf, B. (2003). Extension of the -SVM range for ¨ classification. Advances in Learning Theory: Methods, Models and Applications 190 (pp. 179­196). Amsterdam: IOS Press. Rockafellar, R. T., & Uryasev, S. (2002). Conditional value-at-risk for general loss distributions. Journal of Banking & Finance, 26, 1443­1472. Scholkopf, B., Smola, A., Williamson, R., & Bartlett, P. ¨ (2000). New support vector algorithms. Neural Computation, 12, 1207­1245. Scholkopf, B., & Smola, A. J. (2002). Learning with ker¨ nels. Cambridge, MA: MIT Press. Vapnik, V. N. (1995). The nature of statistical learning theory. Berlin: Springer-Verlag. Step 1: i - 1. Step 2: Compute (w , b ) for i by solving (8). Step 3a: If w = 0, accept (w , b ) as the solution for i , increment i, and go to Step 2. Step 3b: If w = 0, reject (w , b ). Step 4: Compute (w , b ) for i by Algorithm 10. Step 5: Accept (w , b ) as the solution for i , increment i, and go to Step 4 unless i > k . 5. Conclusions We characterized the generalization error of E -SVC in terms of the conditional value-at-risk (CVaR, see Figure 1) and showed that a good generalization performance is expected by E -SVC. We then derived a globally convergent optimization algorithm even though the optimization problem involved in E -SVC is nonconvex. We introduced the threshold based on the sign of the CVaR (see Figure 2). We can check that the problem (8) is equivalent to -SVC in the sense that they share the same negative optimal value in ( , max ] and (min , max ], respectively (Gotoh & Takeda, 2005). On A. Sketch of Pro of of Theorem 1 Let (w , b , ) be the optimal solution of min F (w, b, ), w,b, where, for [X ]+ := max{X, 0}, P [f (w, b; xi , yi ) - ]+ . F (w, b, ) := + iM (1 - )m (11) (12) -Support Vector Machine as Conditional Value-at-Risk Minimization Then Rockafellar and Uryasev (2002) showed that F (w , b , ) = (w , b ) = min (w, b), w,b i.e, the problems (6) and (11) are equivalent. Introducing slack variables i , imposing w = 1 (which does not change the solution essentially; only the scale is changed), and letting = 1 - and = - in Eq.(11), we establish the theorem. 2 Now let us set (13) Then we can show that 1 |{i m = -1- (w, b). | yi ( w, xi + b) < -1- (w, b)}| . We omit its proof due to lack of space. Then we obtain the upper bound + G(1- (w, b)); the upper bound + G(1- (w, b)) is clear from Eq.(5). B. Sketch of Pro of of Lemma 2 Since Eq.(11) only involves continuous functions, continuity of F (w , b , ) with respect to is clear. From Eq.(13), (w , b ) is also continuous. Let (w i , bi , i ) be the optimal solutions of Eq.(11) for 0 < 1 < 2 < 1. Then we have 1 (w 1 , b 1 ) = F1 (w 1 , b 1 , 1 ) F1 (w 2 , b 2 , 2 ) < F2 (w 2 , b 2 , 2 ) = 2 (w 2 , b 2 ), D. Sketch of Pro of of Lemma 6 Since the difference between the problems (3) and (8) is only the norm constraint of w, it is enough to show that for ( , max ], w 2 = 1 holds at the optimal solution (w , b , , ) of the problem (8). For such , 1- (w , b ) < 0 holds, i.e., the optimal value of E -SVC is negative. If we suppose w 2 < 1, another feasible solution (w , b , , )/ w achieves a smaller optimal value than (w , b , , ). This contradicts to the optimality of (8), and hence w 2 = 1 is proved. where the first inequality is due to optimality of (w 1 , b 1 , 1 ) and the second strict inequality is clear from Eq.(12). Thus (w , b ) is strictly increasing with respect to , implying that 1- (w , b ) is strictly decreasing with respect to . E. Sketch of Pro of of Theorem 8 bbb Let (wk , bk , bk , k ) be an optimal solution of the linear program (9) in the k-th iteration. Then, a feasible solution of E -SVC (3) is given by bbb Since (wk , bk , bk , k ) is at a corner of the feasible set of the ebe linear program (9), (wk , ek , ek , k ) is also a corner of the feasible set of E -SVC (3). Let q (·) be the ob jective function of E -SVC (3), which is also the ob jective function of the linear program (9). Then we have q (ek-1 , k-1 ) > q (bk , k ) q (ek , k ) = q (bk , k )/ wk , e b e b b ebe bbb (wk , ek , ek , k ) = (wk , bk , bk , k )/ wk . b C. Sketch of Pro of of Theorem 3 e For a homogeneous classifier h(x) = sign( w, x ), the folee lowing lemma holds: Lemma 12 (Scholkopf et al., 2000) Suppose that support ¨ e e X of x is in a bal l of radius R around the origin. Then, for e al l w such that w = 1, there exists a positive constant c e such that the fol lowing bound holds with probability at least 1 - : R[h] |{i | yi w, xi < }| ee e m v ! u u 2 4c2 R2 e 2 t log2 (2m) - 1 + log + . m 2 e , where the first inequality comes from the optimality of (bk , k ) of the linear program (9). The second inequality b b comes from wk > 1, which is ensured by wk-1 , wk = 1. b e Thus the algorithm finds a distinct corner of E -SVC (3) in each iteration. Since the number of corners of E -SVC (3) is finite, the algorithm terminates within finite iterations. Let d = (w b ) be a perturbation from the solution d = (w , b , , ) of Algorithm 7. Note that d is an optimal solution of the linear program (9) e with w = w . Using the Karush-Kuhn-Tucker (KKT) optimality conditions, we can express the increase q of the ob jective value as 1X i q := - + m iM 1 y1 x1 . . . ym xm O ,, « 0 C B y . . . ym - w w , = d 0 1 @ 1 ... 1 0 A µ I I can be regarded as homogeneous. The assumption that all the data points x live in a centered ball of radius R implies e that all the data points x live in a centered ball of radius p e R = R2 + 1. e Let w = (w b) 1+b2 e and x = (x , 1) . Then our classifier (1) When all the data points x live in a centered ball of radius R, we can assume without loss of generality that |b| R. Then we have 1 1 + b2 1 + R2 = . 2 2 e 2 The assumption w = 1 implies w = 1. Then we can e apply Lemma 12 to the current setting. The condition ee yi w, xi < results in e p yi ( w, xi + b) < 1 + b2 := . e where IRm , µ IRm , and 0 are KKT mul+ + tipliers. If d is a feasible perturbation (i.e., d + d is feasible), we can show that q > 0 (we omit its proof due to lack of space), which implies that d is locally optimal.