Stopping Conditions for Exact Computation of Leave-One-Out Error in Support Vector Machines

 Vojtech Franc1 VO J T E C H . F R A N C @ FI R S T. F R AU N H O F E R . D E Pavel Laskov1 , 2 PAV E L . L A S K OV @ FI R S T. F R AU N H O F E R . D E Klaus-Robert Muller1 , 3 ¨ K L AU S @ FI R S T. F R AU N H O F E R . D E 1 Fraunhofer Institute FIRST, Kekulestr. 7, 12489 Berlin, Germany 2 University of Tuebingen, Wilhelm Schickard Instutute for Computer Science, Sand 13, 72070 Tuebingen, Germany 3 Technical University of Berlin, Dept. of Computer Science, Franklinstr. 28/29, 10587 Berlin, Germany

Abstract
We propose a new stopping condition for a Support Vector Machine (SVM) solver which precisely reflects the objective of the Leave-OneOut error computation. The stopping condition guarantees that the output on an intermediate SVM solution is identical to the output of the optimal SVM solution with one data point excluded from the training set. A simple augmentation of a general SVM training algorithm allows one to use a stopping criterion equivalent to the proposed sufficient condition. A comprehensive experimental evaluation of our method shows consistent speedup of the exact LOO computation by our method, up to the factor of 13 for the linear kernel. The new algorithm can be seen as an example of constructive guidance of an optimization algorithm towards achieving the best attainable expected risk at optimal computational cost.

of a solver attainable within a given time budget (Bottou & Bousquet, 2008). The asymptotic analysis in (Bottou & Bousquet, 2008) provides upper bounds on the time required to reach a certain expected risk by a given algorithm. From the practical point of view, it is desirable to have constructive influence over a learning algorithms, by choosing its parameters, such as e.g. learning rate or stopping conditions, to reach the best attainable expected risk. The present contribution provides an example of such a constructive mechanism by developing optimal stopping conditions for SVM training using a particular estimator of an expected risk ­ the leave-one-out (LOO) error. Although exact computation of a LOO error is hardly used for large-scale learning due to its computational burden, our method is feasible for "small-scale" learning with "expensive" data (e.g. in bioinformatics or finance), especially when accurate estimation of expected risk is required. The LOO is known to provide an unbiased estimator of the generalization error (Lunts & Brailovskiy, 1967). The naive computation of the LOO error, i.e. by explicit relearning after exclusion of each single example, is in all but the simplest cases impractical. The problem of speeding up a computation of a LOO error has received significant attention. The following approaches exist: · LOO bounds provide an estimate of the LOO error given an optimal solution of the SVM training problem ((Joachims, 2000; Vapnik & Chapelle, 2000; Jaakkola & Haussler, 1999; Zhang, 2001)). These bounds are computationally efficient but imprecise. In practice, if an accurate estimate of the classification accuracy is needed, exact computation of the LOO error is unavoidable. · Incremental SVM (Cauwenberghs & Poggio, 2000) allows one to exactly determine for each candidate

1. Introduction
The interrelation between a computational complexity and a generalization ability of learning algorithms has seldom been considered in machine learning. Since the solutions to a majority of learning problems are obtained by iterative optimization algorithms, solution accuracy plays an important role in the estimation of expected risk (Bartlett & Mendelson, 2006). In practice, the available computational resources necessitate a tradeoff between approximation accuracy determined by the choice of a class of functions, estimation error determined by a finite set of examples, and an optimization error determined by the accuracy
Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).


Stopping Conditions for Exact Computation of Leave-One-Out Error in Support Vector Machines

training point ­ after obtaining the optimal SVM solution ­ whether or not it will be a LOO error. This approach avoids explicit re-training, but incremental unlearning of points is complicated and requires special organization of matrix operations (Laskov et al., 2 0 0 7 ). · Loose stopping conditions based on the -KKT allow one to speed up the LOO computation before obtaining an optimal solution. Such methods, (e.g. (Lee & Lin, 2000; Martin et al., 2004)) use fairly simple heuristics, but lack theoretical justification that would connect the  to a precision of the LOO computation. As it is illustrated in the examples in Section 2, these methods can also be imprecise. In this contribution we propose a new stopping condition for an SVM solver which precisely reflects the objective of the LOO error computation. Our main result, given in Theorem 1, provides a sufficient condition for which the output on an intermediate SVM solution is identical to the output of the optimal SVM solution with one data point excluded from the training set. Although this sufficient condition cannot be computed in practice, we propose a simple augmentation of a general SVM training algorithm which allows one to use a stopping criterion equivalent to the proposed sufficient condition.

¨ I = {1, . . . , m}. By the Representer theorem (Scholkopf & Smola, 2002), the optimal SVM classifier f (x; w  , b ) can be expressed in the form  i i yi k (x, xi ) + b  0 ,  +1 if iI f (x; , b) = i yi k (x, xi ) + b < 0 ,  -1 if
I

(2 ) where  = (1 , . . . , m )T  Rm , b  R. Substituting (2) to (1) allows to find the optimal SVM classifier by solving a convex QP task ( , b ,  ) = argmin F (, b, )
(,b,)A

(3 )

where the convex objective function reads F (, b,  ) = 1i 2 j i j yi yj k (xi , xj ) + C i i ,

I I

I

and the convex feasible set A contains all (  Rm , b  R,   Rm ) satisfying  j j yj k (xi , xj ) + b 1 - i , i  I , yi
I

i


0,

iI.

2. Leave-One-Out Error Estimate For Support Vector Machines Classifier
Let X be a set of inputs and Y = {-1, +1} a set of labels of an analyzed object. Let further TX Y = {(x1 , y1 ), . . . , (xm , ym )}  (X × Y )m be a finite training set i.i.d. sampled from unknown P (x, y ). The goal is to learn a classifier f : X  Y V inimizing the probability m of misclassification R[f ] = (y , f (x))dP (x, y ) where V (y , y  ) = 1 for y = y  and V (y , y  ) = 0 otherwise. The SVMs represent the input states in the Reproducing Kernel Hilbert Space (RKHS) via a map  : X  H which is implicitly defined by a kernel function k : X × X  R (Scholkopf & Smola, 2002). The classifier is assumed to ¨ be linear, i.e., f (x; w , b) = w, (x) + b, where w  H, b  R are unknown parameters and ·, · denotes an inner product in RKHS. Because R[f ] cannot be minimized directly due to the unknown P (x, y ), the SVMs replace R[f ] by a regularized risk its minimization leads to ( 1 i 2  ^ (yi , f (xi ; w, b)) w H +C V (w , b ) = argmin w H,bR 2 I 1) wh e r e C  R+ is a regularization constant, ^ V (yi , f (xi ; w , )) = max(0, 1 - yi f (xi ; w, b)) is a convex piece-wise linear approximation of V (yi , f (xi ; w, b)) and

Minimizing the regularized risk (1) (or QP task (3) respectively) allows to find parameters (, b) of the SVM classifier provided the hyper-parameters, i.e., the kernel function k and the regularization constant C , are known. This is not the case in practice and the hyper-parameters (C, k ) must be optimized as well. A common approach is to select the best (C, k ) from a given finite set  by minimizing some performance measure. The set  is usually created by reasonably discretizing the hyper-parameters space. As the performance measure, the LOO error RLOO [f ] is a common choice. Let ( (r), b (r),   (r)) denote the optimal solution of the primal QP task (3) with r-th example removed from the training set which is equivalent to solving the task ( (r), b (r),   (r)) = argmin F (, b, ) ,
(,b,)A(r )

(4 )

where A(r) = A  {(, b, ) | r = 0}, i.e., A(r) denotes the original feasible set A enriched by an additional constraint r = 0. The LOO error estimator is defined as RLOO [f (·;  , b )] = 1r m V (yr , f (xr ;  (r), b (r))) .
I

(5 ) The major practical disadvantage of the LOO error is its high computational cost. A naive approach to compute LOO requires solving m different QP tasks (4). In some


Stopping Conditions for Exact Computation of Leave-One-Out Error in Support Vector Machines

cases, however, the value f (xr ;  (r), b (r)) can be immediately derived from the optimal solution ( , b ,   ) computed from the entire training set. Table 1 summarizes the known sufficiency checks; the implication 1 is generally known and the implications 2, 3 are due to (Joachims, 2 0 0 0 ). 1 . If  r = 0 then yr = f (xr ;  (r), b (r)).
 

where B is a convex feasible set which contains all   Rm satisfying i i yi = 0 , and 0  i  C , i  I .
I

2. If yr = f (xr ;  , b ) then yr = f (xr ;  (r), b (r)). 3. Let R2 be an upper bound on k (x, x) - k (x, x ), x, x  X , and let ( , b ,  ) be a stable solution which means that there exist at least one i  I such  that 0 <  < C . In this case, if 2 R2 + r < 1 r i   then yr = f (xr ;  (r), b (r)).
Table 1. Sufficiency checks for computing f (xr ;  (r ), b (r )) directly from ( , b ,   ).

We will use G() to denote the objective function of (6). Having  computed, the remaining primal variables (b ,   ) can be obtained easily from the Karush-KuhnTucker (KKT) optimality conditions (e.g., (Boyd & Vandenberghe, 2004)). Similarly, the minimizer  (r) of the QP task (4) is obtained by solving the dual  (r) = argmax G() ,
B(r )

(7 )

A portion of the training examples for which the sufficiency checks apply depends on the problem at hand (for empirical study see (Martin et al., 2004)). (Joachims, 2000) proposed using the sufficiency checks to compute an upper bound on the LOO error called  -estimator. It has been empirically shown, that in general the  -estimator is not sufficiently precise for the hyper-parameter tuning (Duan et al., 2003). Algorithm 1 shows a standard procedure of computing the LOO error exactly with the use of the sufficiency checks to reduced the number of cases when the solution of the QP task (4) is required. Algorithm 1 Computation of the LOO Error 1: Solve the QP task (3) to obtain ( , b ,  ). 2: Apply the sufficiency checks from Table 1 to compute f (xr ;  (r), b (r)) from ( , b ,   ). 3: For examples unresolved in Step 2 solve the QP task (4) to obtain ( (r), b (r)). 4: Compute the LOO error by (5) using f (xr ;  (r), b (r)), r  I obtained in Steps 2 an d 3 . Algorithm 1 requires the use of an optimization method solving the QP tasks (3) and (4) exactly, i.e., producing the optimal solutions. Although such optimization methods exist, they are applicable only for very small problems. In practice, the QP tasks are solved only approximately via their dual representation which is more suitable for optimization due to a simpler feasible set. In particular, the minimizer  of the primal QP task (3) can be equivalently computed by solving the dual QP task , i 1i j  i j yi yj k (xi , xj )  = argmax i - 2 B
I I I

where B (r) = B  { | r = 0} and the primal variables (b (r),   (r)) can be again obtained by the KKT conditions. From the optimization point of view, the QP tasks (6) and (7) are equivalent since the latter can be converted to the former simply by excluding the r-th variable. Thus we now concentrate only on the optimization of the QP task (6). Algorithm 2 Commonly used iterative QP solver
1: Initialize t := 0 and (t)  B . 2: t := t + 1 . 3: Update (t-1)  (t) , i.e., find (t)  B such that

G((t-1) ) < G((t) ).
4: If (t) satisfies the -KKT conditions (8) halt other-

wise go to 2. A framework of a commonly used QP solver optimizing (6) describes Algorithm 2. Among the most popular methods which fit to the framework of Algorithm 2 belong the Sequential Minimal Optimizer (SMO) (Platt, 1998), SVM light (Joachims, 1998) and other decomposition methods (e.g. (Vapnik, 1995; Osuna et al., 1997)). All these solvers iteratively increase the dual criterion G() until the solution satisfies stopping conditions. A relaxed version of the KKT optimality conditions is the most frequently used stopping criterion: j et   0 be a prescribed l number and i () = 1 - yi I j yj k (xi , xj ); then a vector   B satisfies the relaxed KKT conditions (e.g. (Keerthi et al., 2001)) if there exist b  R such that i () + byi  -i () + byi  | - i () + byi |   , if  , if  , if i = 0 , i = C , 0 < i < C . (8 )

(6 )

The tightness of the stopping conditions (8) is controlled by   0; hereafter we will refer to (8) as the -KKT conditions. An advantage of the -KKT is their simplicity and a low computational overhead: O(m) operations since i () is usually available during the course of the


Stopping Conditions for Exact Computation of Leave-One-Out Error in Support Vector Machines

RLOO

We will illustrate the impact of  when the LOO error estimator is used for a model selection. Let  be a given finite set of hyper-parameters  = (C, k ). Let RLOO ( , ) denote the LOO error estimate computed for given  using Algorithm 1 with a QP solver in Algorithm 2. Thus the estimated LOO error RLOO (, ) is a function of both the hyper-parameters  and . For a fixed value  > 0, the model selection produces the hyper-parameters

0 -4

-3

-2

-1

0 0

log 10 () German
0.35 0.3 0.25 0.2 14000 12000 10000 8000 6000 4000 2000 -3 -2 -1 0 0

 
0.1 0.05

Figure 1 plots the behavior of RLOO ( (), ) and RLOO ( (10-4 ), ), as well as the cost of the LOO error computation as a function of  for three datasets selected from the IDA repository (cf. Section 5). The "golden truth" expected risk is given by the left-most plots in the graphs (using  = 10-4 for both model selection and risk estimation). The dashed line representing RLOO ( (10-4 ), ) shows that the expected risk is slightly overestimated provided we use high accuracy for model selection and variable accuracy for risk estimation. The solid line representing RLOO ((), ) shows that a lowaccuracy LOO computation used in model selection eventually results in overfitting, as a model is selected that grossly underestimates the expected risk. Interestingly, both plots coincide until a certain breakdown point beyond which the low-accuracy LOO estimation runs aground. The breakdown point varies between 0.001 and 0.1 depending on a dataset. This suggests that a commonly used  = 0.001 is a reasonable setting to obtain an accurate estimate. It is, however, clear from the timing plots that knowing the right accuracy could significantly lower the computational cost.

0 -4

log 10 () Image
0.04 x 10 4
4

RLOO

0.02

2

0 -4

-3

-2

-1

0 0

log 10 () Legend R L O O ( ( ),  ) RLOO ( (10-4 ), ) CPU time

3. Exact Computation of the LOO Error
In this section we show that a response of the optimal classifier f (xr ;  (r), b (r)), required when the LOO error is being computed, can be obtained without the need to solve the QP task (4) (or its dual (7)) optimally. Let us define
Figure 1. Influence of the parameter  of the -KKT conditions on the LOO error estimate and the required computational time for three data sets (Banana, German and Image) selected from IDA repository.

time [s]

time [s]

 () = argmin RLOO (, ) .

(9 )

0.15

time [s]

QP solver. A disadvantage of the -KKT conditions is a tricky choice of . Provided  = 0, the solution  satisfying the -KKT conditions is guaranteed to be optimal. The practically applicable QP solvers, however, are guaranteed to halt in a finite number of iterations only for  > 0. A typically used value is  = 0.001, e.g., in software packages S V M light ((Joachims, 1998)) or svmlib ((Chang & Lin, 2001)). To our knowledge, there is no theoretical result connecting  > 0 to the value of RLOO [f (·;  , b )] which is the only desired outcome of the entire computation.

Banana
0.1 4000

RLOO
0.05

2000


Stopping Conditions for Exact Computation of Leave-One-Out Error in Support Vector Machines

three convex sets , ( i i yi k (xi , xr ) + b > 0 A+ (r) = A(r)  , b, ) | I , ( i i yi k (xi , xr ) + b = 0 A0 (r) = A(r)  , b, ) | I . ( i i yi k (xi , xr ) + b < 0 A- (r) = A(r)  , b, ) |
I

The problem (12) is a convex QP task its dual reads i , 1i j i j yi yi k  (xi , xj ) i -   (r) = max 2   B 0 (r )
I I I

(1 0 ) Notice, that to compute f (xr ;  (r), b (r)) we do not necessarily need to know the optimal ( (r), b (r),   (r)) but it suffices to determine whether ( (r), b (r),   (r)) belongs to A+ (r)  A0 (r) or to A- (r). Our method is based on a simple observation which can be formally stated by the following theorem: ^b^ Theorem 1 For any (, ^,  )  A(r) which satisfy ^b^ F (, ^, ) <
(,b,)A0 (r )

(1 3 ) where k  (xi , xj ) = k (xi , xj ) - k (xr , xi ) - k (xr , xj ) - k (xr , xr ) and B0 (r) is a convex feasible set which contains all   Rm satisfying 0  i  C , i  I \ { r } , a n d r = 0 .

We will use H () to denote the objective function of (13). By the weak duality theorem, the inequality F (, b, )  H ( ) holds for any (, b, )  A0 (r) and   B0 (r). This allows us to derive the following useful corollary: ^b^ Corollary 1 For any (, ^,  )  A0 (r) and   B0 (r) which satisfy ^ ^b^ F (, ^,  ) < H ( ) , (1 4 ) ^b the equation f (xr ; , ^) = f (xr ;  (r), b (r)) holds. Notice, that the condition (14) is satisfiable except for a very rare degenerate cases. It is easy to show, that if the condition (14) is not satisfiable then the error estimate V (yr ; f (xr ;  (r), b (r)) is unstable anyway since there exists an optimal classifier f (x;  (r), b (r)) its separating hyperplane passes through the tested point xr .

min

F (, b, ) ,

(1 1 )

^b the equation f (xr ; , ^) = f (xr ;  (r), b (r)) holds. Proof 1 We proof Theorem 1 by transposition: we show ^b that f (xr ; , ^) = f (xr ;  (r), b (r)) implies the assumption (11) is violated. Without loss of generality let ^b f (xr ; , ^) = +1 and f (xr ;  (r), b (r)) = -1. Let us define three vectors       ^   (r) 0 ^ b  =  ^  ,   (r) =  b (r)  ,  0 =  b0  . ^   (r) 0  With a slight abuse of notation, we will handle F as a function of a single argument   R2m+1 . Then the assump^b tions f (xr ; , ^) = +1 and f (xr ;  (r), b (r)) = -1 is ^ equivalent to   A+ (r)  A0 (r) and   (r)  A- (r). ^ From (10) it follows that for any   A+ (r)  A0 (r)  and  (r)  A- (r) there exists   [0, 1] such that ^ 0 = ((1 -  ) +    (r))  A0 (r). Since F is convex, ^   [0, 1] and F (  )  F () we can write F (0 )   ^ (1 -  )F ( ) +  F (  (r)) F ^ max (), F (  (r)) = ^ F () ,

4. Algorithm
A direct application of Corollary 1 would require solving a mixed set of one quadratic and many linear inequalities. We are not aware of any simple and efficient algorithm to solve such task. Instead, we show how to use Corollary 1 to derive a novel stopping condition for a standard iterative QP solver (cf. Algorithm 2). Algorithm 3 Proposed QP solver
1: Initialize t := 0, (t)  B (r) and  (t)  B0 (r). 2: t := t + 1. 3: Update (t-1)  (t) , i.e., find (t)  A(r) such 4: 5: 6: 7:

which shows that there exist 0  A0 (r) such that ^ F ( 0 )  F ( ). Using the original notation, this is equiv^b^ alent to F (0 , b0 , 0 )  F (, ^, ). However, this contradicts the assumption (11) which was to be shown. ^b^ By Theorem 1, any triplet (, ^, )  A(r) satisfying the ^b condition (11) determines a classifier f (x; , ^) which has the same response on the input xr as the optimal classifier f (x;  (r), b (r)). From a practical point of view, this result cannot be used directly due to the unknown value of the right hand side of the inequality (11), i.e.,
(,b,)A0 (r )

that G((t-1) ) < G((t) ). If (t) satisfies the -KKT conditions then halt otherwise continue to Step 5. For fixed (t) compute feasible b(t) and  (t) minimizing F ((t) , b,  ). Update  (t-1)  (t) , i.e., find (t)  B0 (r) such that H ( (t-1) ) < H ( (t) ). If F ((t) , b(t) ,  (t) ) < H ( (t) ) holds then halt otherwise go to Step 2.

min

F (, b,  ) .

(1 2 )

The proposed method is described by Algorithm 3 which, compared to a standard QP solver, involves three additional Steps 5, 6 and 7. In Step 5, the algorithm computes the


Stopping Conditions for Exact Computation of Leave-One-Out Error in Support Vector Machines

primal variables ( (t) , b(t) ) minimizing F ((t) , b(t) ,  (t) ) which amounts to a simple optimization problem since (t) is known. In Step 6, the algorithm maximizes the auxiliary criterion H ( (t) ) w.r.t.  (t) . Finally, in Step 7, the algorithm checks whether H ((t) ) has become greater than F ((t) , b(t) ,  (t) ); as soon as this occurs the algorithm halts since f (xr ; (t) , b(t) ) = f (xr ;  (r), b (r)) is guaranteed according to Corollary 1. The -KKT conditions are retained in Step 3 of Algorithm 3 since the condition (14) need not be satisfiable in general. The proposed Algorithm 3 is intended to be used for computation of f (xr ;  (r), b (r)), i.e., it is called in Step 4 of Algorithm 1 calculating the LOO error, as a replacement for the standard QP solver. In terms of accuracy of computing the LOO error, the proposed algorithm cannot perform worse than the standard one. If the -KKT conditions are satisfied earlier than the proposed stopping condition then both the solvers find an identical classifier. In the opposite case, however, the response of the classifier found by the proposed algorithm is guaranteed to be optimal. Albeit the proposed algorithm provides a theoretical guarantee for the found solution to be optimal, from the practical point of view both the algorithms will produce an identical LOO error estimate for a sufficiently low . We will empirically show, however, that the proposed algorithm is numerically more efficient though it optimizes two QP tasks simultaneously compared to the standard approach. The higher efficiency is achieved by the proposed stopping condition which is often satisfied earlier than the -KKT condition. To increase the numerical performance we also implemented the following simple efficiency test. The proposed algorithm is not applied on a single example but rather on a set of examples which cannot be resolved by the sufficiency checks. We experimentally observed, that the efficiency of the proposed algorithm can be reliably estimated from a few examples. This allows us to switch to using the standard QP solver when the efficiency of the proposed algorithm is low. The efficiency test, implemented in Step 4 of Algorithm 1, works as follows: We apply the proposed Algorithm 3 on the first M examples. Let MP rec denote the number of examples for which Algorithm 3 halt in Step 7 (i.e., the proposed stopping condition was applied). If MP rec /M < 0.5 we switch from using Algorithm 3 to using the standard Algorithm 2. We empirically found M = 10 to be a good choice number in all our experiments.

repository 1 . The standard approach computes the LOO error using the procedure described by Algorithm 1. An iterative QP solver with the -KKT conditions (Algorithm 2) is called whenever the solution of the QP task is required. In particular, we used the Improved SMO algorithm (Keerthi et al., 2001) to implement the QP solver. In addition, we implemented -seeding approach (DeCoste & Wagstaff, 2000) which re-uses the solution  (obtained in Step 1 of Algorithm 1) to efficiently set up the initial solution of the QP solver (initialization of (0) in Step 1 of Algorithm 2). The proposed approach uses the same procedure for computing the LOO error except for a different QP solver used in Step 4 of Algorithm 1. As the QP solver, we applied the proposed Algorithm 3 which involves optimization of the QP tasks G((t) ) w.r.t. (t)  B (r) and H ( (t) ) w.r.t.  (t)  B0 (r) required in Step 3 and Step 6, respectively. We again used the Improved SMO algorithm to optimize G((t) ) w.r.t. (t)  B (r) and its straightforward modification to optimize H ( (t) ) w.r.t.  (t)  B0 (r) (B0 (r) does not contain the equality constraint thus a single variable can be updated). The experiments were carried out in Matlab 6 environment runnig on the Linux machine with the AMD K8 2.2GHz processor. Algorithms 1,2 and 3 were implemented in C. The IDA repository consists of 13 artificial and real-world binary classification problems collected from UCI, DELVE and STATLOG repositories (c.f. (Ratsch et al., 2001)). For ¨ each dataset, there are 100 random realizations of training and testing set (except for Image and Splice sets, where it is 20). The training parts of the first 5 realizations are used for model selection. The best hyper-parameters (C, k ) were selected from a finite set  by minimizing an average LOO ^ ^ error RLOO . The average LOO error RLOO is computed over the 5 realizations. We considered two separate model selection problems for the linear and the RBF kernel. In the case of the liner kernel, the model was select from  = {C | C = 10i , i = -2, . . . , 2}×{k | k (x, x ) = x, x } and, in the case of the i RBF kernel  = {C | C = 10 7 log(500) , i = 0, . . . , 7}× i {k | k (x, x ) = exp(-2 7 11 x-x 2), i = 0, . . . , 7}. Having the model selected, the classifier is trained for all 100 realizations of the training sets and the testing error is computed on the corresponding testing set. The reported testing ^ errors RT S T are averages accompanied with the standard deviations computed over the 100 realizations. ^ Table 2 shows the average LOO errors RLOO and the test^ T S T for the best selected models. We experiing errors R mentally verified, that both the standard and the proposed
1

5. Experiments
In this section, we experimentally evaluate the proposed method for computing the LOO error compared to the standard approach on the datasets from the IDA benchmark

http://ida.first.fraunhofer.de/projects/bench/benchmarks.htm


Stopping Conditions for Exact Computation of Leave-One-Out Error in Support Vector Machines

approach yielded identical LOO errors since a low value of  = 0.001 was used in the -KKT conditions. Therefore the errors listed in Table 2 apply for both the approaches. We also found that the classification errors for the RBF kernel are very similar to the errors reported in (Ratsch et al., ¨ 2001) for the SVM classifier with the RBF kernel tuned by the 5-fold cross-validation. Interestingly, the linear kernel achieves in some cases comparable performance as the more complex RBF kernel. We can also observe, that the ^ average LOO errors RLOO for the linear kernel are very ^ good estimators of the testing errors RT S T . Table 3 summarizes the numerical efficiency of the proposed approach and the standard one. The efficiency was measured in terms of the computational time and the number of kernel evaluations. The reported T ime is the overall computational time spent by a given algorithm to calculate all the LOO errors needed for the model selection. E.g., in the case of the RBF kernel it was necessary to compute 5 × 64 = 320 LOO error estimates (5 stands for the number of the training set realizations and 64 is the cardinality of ). Similarly, the number of kernel evaluations K erE v al is the overall value normalized to the number of training data m, i.e., K erE v al is the number of columns of the kernel matrix. In the case of the standard approach, we listed the absolute values of T ime and K erE v al. In the case of the proposed approach, we listed the gained speed up computed as the ratio S tandard/P roposed. The last column of Table 3 contains the value P rec being the percentage of the cases when the proposed stopping condition was satisfied earlier then the -KKT conditions, i.e., in P rec cases the computed LOO error is theoretically guaranteed to be optimal. It can be seen, that the proposed method was never slower (up to the rounding error in computing the speed up) than the standard algorithm both in terms of the computational time and the kernel evaluations. A higher performance was achieved for the linear kernel compared to the RBF kernel. For the linear kernel, the proposed approach was on average 4 times faster than the standard approach. The best performance was achieved for the Image dataset when the speed up was nearly 13. For RBF kernel, the average speed up was slightly higher than 2, and, in the best case the speed up was 5 for the Banana dataset. It shows that while the efficiency gained for the RBF kernel is only moderate, in the case of the linear kernel it is much appealing.

Banana Breast Diabetis Flare German Heart Image Ring. Splice Thyroid Titanic Twono. Wave.

Classification performance Linear kernel RBF kernel ^ ^ ^ ^ RLOO RTST RLOO RTST 41.40 47.80 (±4.58) 8.55 10.43 (±0.44) 27.20 29.00 (±4.83) 23.00 26.06 (±4.91) 22.05 23.44 (±1.70) 21.50 23.27 (±1.65) 32.85 32.33 (±1.82) 32.31 34.04 (±2.04) 24.91 24.06 (±2.22) 23.71 23.61 (±2.23) 14.24 15.22 (±3.22) 13.53 15.55 (±3.36) 15.48 15.34 (±0.84) 2.94 3.15 (±0.63) 23.45 24.59 (±0.67) 1.10 1.60 (±0.11) 15.36 16.20 (±0.59) 10.60 10.95 (±0.64) 8.43 10.16 (±2.60) 2.29 4.87 (±2.28) 21.20 23.01 (±4.62) 15.47 23.99 (±3.47) 2.50 2.90 (±0.27) 2.25 2.59 (±0.18) 10.90 12.95 (±0.54) 8.85 10.50 (±0.43)

Table 2. Classification performance of the best models selected by minimizing the LOO error estimate.

fitting due to imprecise optimization. Our experiments on 13 datasets from the IDA repository achieved the average speedup of 2 to 4 times and the maximal speedup of up to the factor of 13. These results demonstrate the importance of investigating relationships between the optimization accuracy and the expected risk estimation in machine learning, as suggessed by recent work (Bartlett & Mendelson, 2006; Bottou & Bousquet, 2008). To our knowledge, the new algorithm is the first theoretically justified constructive instrument to guide an optimization algorithm ­ for the particular case of the SVM QP solver and the LOO error ­ towards achieving the best attainable expected risk at optimal computational cost. Future work should explore more general mechanisms of relating parameters of optimization algorthms deployed in machine learning with the estimation of expected risk.

Acknowledgements
VF was supported by Marie Curie Intra-European Fellowship grant SCOLES (MEIF-CT-2006-042107). This work was also supported by Bundesministerium fur Bil¨ dung und Forschung under the project REMIND (FKZ 01IS07007A).

References
Bartlett, P., & Mendelson, S. (2006). Empirical minimization. Probability Theory and Related Fields, 135, 311­ 334. Bottou, L., & Bousquet, O. (2008). The tradeoffs of large scale learning. In Advances in neural information processing systems, vol. 20. Cambridge, MA: MIT Press. to appear. Boyd, S., & Vandenberghe, L. (2004). Convex optimiza-

6. Conclusions
The new stopping conditions for an SVM solver proposed in this contribution allow to determine an optimal solution accuracy needed for exact computation of a LOO error. Our new algorithm allows one to significantly reduce complexity of the LOO error computation without a risk of over-


Stopping Conditions for Exact Computation of Leave-One-Out Error in Support Vector Machines Linear Kernel Standard approach Proposed approach Time KerEval Time KerEval Prec [s] ×106 speedup speedup [%] 728.0 109.5 9.9 12.3 56.9 128.7 34.6 2.0 2.1 64.2 3705.3 432.1 6.3 7.2 93.1 61.5 4.7 1.0 1.1 59.3 47603.8 3685.7 4.2 4.4 82.0 280.7 90.1 1.9 2.1 83.1 114373.9 4793.8 12.9 14.9 98.3 11632.8 1536.0 2.8 3.0 91.8 41072.8 2222.1 1.5 1.5 74.5 21.6 8.2 3.2 5.2 66.0 0.8 0.1 1.0 1.0 17.6 52.8 8.3 1.9 2.2 89.1 3422.8 503.3 2.4 2.6 90.7 RBF Kernel Standard approach Proposed approach Time KerEval Time KerEval Prec [s] ×106 speedup speedup [%] 1565.1 222.4 5.1 6.1 81.9 265.9 74.0 1.5 1.6 55.6 3163.6 382.3 1.4 1.5 54.8 2263.0 189.1 2.2 2.4 50.3 6513.7 549.4 1.1 1.1 31.3 59.5 20.7 1.1 1.2 50.3 10852.7 514.5 2.9 3.1 75.3 277.2 47.4 1.4 1.4 39.1 6360.3 453.8 1.0 1.0 14.6 10.3 3.3 1.5 2.4 80.3 25.5 7.9 5.7 2.7 50.2 175.3 36.8 1.0 1.0 22.2 345.6 59.4 1.1 1.1 30.9

Banana Breast Diabetis Flare German Heart Image Ring. Splice Thyroid Titanic Twono. Wave.

Joachims, T. (1998). Making large-scale support vector machine learning practical. In Scholkopf et B. al. (Ed.), ¨ Advances in kernel methods: Support vector machines. MIT Press, Cambridge, MA. Joachims, T. (2000). Estimating the generalization performance of a SVM efficiently. ICML. San Francisco, CA. Keerthi, S., Shevade, S., Bhattacharyya, C., & Murthy, K. (2001). Improvements to Platt's SMO algorithm for SVM classifier design. Neural Computation, 13, 637­ 649. Laskov, P., Gehl, C., Kruger, S., & Muller, K.-R. (2007). ¨ ¨ Incremental support vector learning: analysis, implementation and applications. Journal of Machine Learning Research, 7, 1900­1936. Lee, J., & Lin, C. (2000). Automatic model selection for support vector machines (Technical Report). Dept. of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan,. Lunts, A., & Brailovskiy, V. (1967). Evaluation of attributes obtained in statistical decision rules. Engineering Cybernetics, 3, 98­109. Martin, M., Keerthi, S., Ong, C., & DeCoste, D. (2004). An efficient method for computing leave-one-out error in support vector machines with gaussian kernels. IEEE TNN, 15, 750­757. Osuna, E., Freund, R., & Girosi, F. (1997). An improved training algorithms for support vector machines. Neural Networks for Signal Processing VII ­ Proceedings of the 1997 IEEE Workshop (pp. 276­285). IEEE. Platt, J. (1998). Fast training of SVMs using sequential minimal optimization. In Scholkopf et B. al. (Ed.), Ad¨ vances in kernel methods: Svm learning. MIT Press. Ratsch, G., Onoda, T., & Muller, K. (2001). Soft margins ¨ ¨ for AdaBoost. Machine Learning, 43, 287­320. Scholkopf, B., & Smola, A. (2002). Learning with kernels. ¨ MIT Press. Vapnik, V. (1995). The nature of statistical learning theory. Springer. Vapnik, V., & Chapelle, O. (2000). Bounds on error expectation for SVM. In Smola et A. al. (Ed.), Advances in large margin classifiers, 261­280. MIT Press. Zhang, T. (2001). A leave-one-out cross validaton bound for kernel methods with application in learning. COLT. Springer.

Banana Breast Diabetis Flare German Heart Image Ring. Splice Thyroid Titanic Twono. Wave.

Table 3. Efficiency of computing the LOO error for the standard and the proposed approach

tion. Cambridge University Press. Cauwenberghs, G., & Poggio, T. (2000). Incremental and decremental support vector machine learning. NIPS (pp. 409­415). MIT Press. Chang, C., & Lin, C. (2001). LIBSVM: a library for support vector machines. Software available at http: //www.csie.ntu.edu.tw/cjlin/libsvm. DeCoste, D., & Wagstaff, K. (2000). Alpha seeding for support vector machines. Int. Conf. on Knowledge Disc. and Data Mining. Duan, K., Keerti, S., & Poo, A. (2003). Evaluation of simple performance measures for tuning SVM hyperparamaters. Neurocomputing, 15, 487­507. Jaakkola, T., & Haussler, D. (1999). Probabilistic kernel regression models. Conference on AI and Statistics. Morg a n Ka u f m a n n .