Optimized Cutting Plane Algorithm for Support Vector Machines Vojtech Franc Soeren Sonnenburg Fraunhofer Institute FIRST, Kekulestr. 7, 12489 Berlin, Germany VO J T E C H . F R A N C @ FI R S T. F R AU N H O F E R . D E S O E R E N . S O N N E N B U R G @ FI R S T. F R AU N H O F E R . D E Abstract We have developed a new Linear Support Vector Machine (SVM) training algorithm called OCAS. Its computational effort scales linearly with the sample size. In an extensive empirical evaluation OCAS significantly outperforms current state of the art SVM solvers, like SVMlight , SVMperf and BMRM, achieving speedups of over 1,000 on some datasets over SVMlight and 20 over SVMperf , while obtaining the same precise Support Vector solution. OCAS even in the early optimization steps shows often faster convergence than the so far in this domain prevailing approximative methods SGD and Pegasos. Effectively parallelizing OCAS we were able to train on a dataset of size 15 million examples (itself about 32GB in size) in just 671 seconds -- a competing string kernel SVM required 97,484 seconds to train on 10 million examples subsampled from this dataset. (P) w,,b min s.t. 1 w 2 2 2 + m CX i , for w Rn , Rm , b R + m i=1 yi ( w, xi + b) 1 - i , i = 1, . . . , m (D) max sm R m m i=1 i=1 i - 1 2 m i=1 m j =1 i j yi yj xi , xj C m, .t. : i yi = 0, 0 i i = 1...m Due to the central importance of SVMs, many techniques have been proposed to solve the SVM problem. As in practice only limited precision solutions to (P) and (D) can be obtained they may be categorized into approximative and accurate. Approximative Solvers make use of heuristics (e.g. learning rate, number of iterations) to obtain (often crude) approximations to the QP-solution. They have very low per-iteration cost and low total training time. Especially for large scale problems, they are claimed to be sufficiently precise while delivering the best performance vs. training time trade-off (Bottou & Bousquet, 2008), which may be attributed to the robust nature of large margin SVM solutions. However while they are fast in the beginning they often fail to achieve precise solution. Among the to-date most efficient solvers are Pegasos (Shwartz et al., 2007) and SGD (Bottou & Bousquet, 2008), which are based on stochastic (sub-)gradient descent. Accurate Solvers In contrast to approximative solvers, accurate methods solve a QP up to a given precision , where commonly denotes the violation of the relaxed KKT conditions (Joachims, 1999) or the (relative) duality gap. Accurate methods often have good asymptotic convergence properties, and thus for small converge to very precise solutions being limited only by numerical precision. Classical examples are off-the-shelf optimizers (e.g. MINOS, CPLEX, LOQO). However it is usually infeasible to use standard optimization tools for solving the SVM training problems (D) on datasets containing more than a few thousand examples. So-called decomposition techniques as chunking (e.g. used in (Joachims, 1999)), or SMO (used in 1. Introduction Many applications in e.g. Bioinformatics, IT-Security and Text-Classification come with huge amounts (e.g. millions) of data points, which are indeed needed to obtain stateof-the-art results. They therefore require computationally extremely efficient methods capable of dealing with ever growing data sizes. Support Vector Machines (SVM) e.g. (Cortes & Vapnik, 1995; Cristianini & Shwawe-Taylor, 2000) have proven to be powerful tools for a wide range of different data analysis problems. Given labeled training examples {(x1 , y1 ), . . . (xm , ym )} (Rn × {-1, +1})m and a regularization constant C > 0 they learn a linear classification rule h(x) = sgn( w , x + b ) by solving the quadratic SVM primal optimization problem (P) or its dual formulation (D) allowing the use of kernels. Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). Optimized Cutting Plane Algorithm for Support Vector Machines (Chang & Lin, 2001)) overcome this limitation by exploiting the special structure of the SVM problem. The key idea of decomposition is to freeze all but a small number of optimization variables (working set) and to solve a sequence of constant-size problems (subproblems of the SVM dual). While decomposition based solvers are very flexible as they are working in the dual and thus allow the use of kernels they become computationally intractable with a few hundred thousand examples. This limitation can be explained as follows: Decomposition methods exploit the fact that the optimal solution of (P) does not change if inactive constraints at the optimum are removed, they are therefore only efficient if the number of active constraints is reasonably small. Unfortunately, the number of active constraints is lower bound by the portion of misclassified examples, which is proportional to the number of examples m. Thus decomposition methods are computationally prohibitive for large-scale problems (empirically about 10%-30% of the training points become active constraints). This poses a challenging task for even current state-of-theart SVM solvers such as SVMlight (Joachims, 1999), Gradient Projection-based Decomposition Technique-SVM (GPDT-SVM) (Zanni et al., 2006), LibSVM (Chang & Lin, 2001). As improving training times using the dual formulation is hard, the research focus has shifted back to the original SVM primal problem. The importance of being able to efficiently solve the primal problem for large datasets is documented by a number of very recently developed methods, e.g. SVMLin (Sindhwani & Keerthi, 2007; Chapelle, 2007), LibLinear (Lin et al., 2007), SVMperf (Joachims, 2006) and BMRM (Teo et al., 2007). In the following we will focus on finding accurate solutions of the unconstrained linear SVM primal problem1 1 , w = argmin F (w) := 2 w 2 + C R(w) (1) w m 1 where R(w) = m i=1 max{0, 1 - yi w, xi } (2) is a convex risk approximating the training error. Among the up to date most efficient accurate SVM primal problem (1) solvers are the Cutting Plane Algorithm (CPA) based methods put forward in (Joachims, 2006; Teo et al., 2007) and implemented in SVMperf and BMRM. The idea of CPAs is to approximate the risk R by a piece-wise linear function defined as the maximum over a set of linear underestimators, in CPA terminology called cutting planes. In (Joachims, 2006; Teo et al., 2007) it was shown that their number does not depend on the number of training examples m and that very few such cutting planes are needed in practice to sufficiently approximate (1). Note that we focus on the linear rule without a bias. The bias can be included by adding a constant feature to each training example xi . 1 In this work we propose a new method, called the Optimized Cutting Plane Algorithm for SVMs (OCAS). We empirically show that OCAS converges on a wide variety of large-scale datasets even considerably faster than SVMperf , BMRM and SVMlight , achieving speedups of several orders of magnitude on some problems. We also demonstrate that OCAS even in the early optimization steps shows faster convergence than the so far in this domain dominating approximative methods. Finally we critically analyze all solvers w.r.t. classification performance in an extensive model selection study. The report is organized as follows. CPA is described in Section 2. In Section 3, we point out a source of inefficiency of CPA and propose a new method, OCAS, to alleviate the problem and prove linear convergence. An extensive empirical evaluation is given in Section 4 and concludes the paper. 2. Cutting Plane Algorithm Recently, the Cutting Plane Algorithm (CPA) based largescale solvers, SVMperf (Joachims, 2006) and BMRM (Teo et al., 2007), have been proposed. SVMperf implements CPA specifically for the linear SVM problem (1). Decoupling regularizer and loss function, BMRM generalizes SVMperf to a wide range of losses and regularizers making it applicable to many machine learning problems, like classification, regression, structure learning etc. It should be noted that BMRM using the two norm regularizer . 2 and hinge loss (i.e. SVM problem (1)) coincides with SVMperf . It was shown that SVMperf and BMRM by far outperform the decomposition methods like SVMlight on large-scale problems. The rest of this section describes the idea behind CPA for the standard SVM setting (1) in more detail. In CPA terminology, the original problem (1) is called the master problem. Using the approach of (Teo et al., 2007) one may define a reduced problem of (1) which reads wt = argmin Ft (w) := w 1 2 w 2 + C Rt (w) . (3) Problem (3) is obtained from the master problem (1) by substituting a piece-wise linear approximation Rt for the risk R while leaving the regularization term unchanged, i.e. only the complex part of the objective F is approximated. The approximation Rt is derived as follows. Since the risk R is a convex function, it can be approximated at any point w by a linear under estimator R(w) R(w ) + a , w - w , w Rn , (4) where a is any subgradient of R at the point w . We will use a shortcut b = R(w ) - a , w to abbreviate (4) as R(w) a , w +b . In CPA terminology, a , w +b = 0 Optimized Cutting Plane Algorithm for Support Vector Machines is called a cutting plane. A subgradient a of R at the point w can be obtained as 1 m 1i if yi w , xi 1 , a =- i yi xi , i = 0 if yi w , xi > 1 . m =1 (5) To get a better approximation of the risk R than a single cutting plane, one may use a collection of cutting planes { ai , w + bi = 0 | i = 1, . . . , t} at t distinct points {w1 , . . . , wt } and take their point-wise maximum a . 0 (6) Rt (w) = max , max i , w + bi i=1,...,t approximates the master objective F at the current solution wt . However, there is no guarantee that such cutting plane will be an active constraint in the vicinity of the optimum w , nor must the new solution wt+1 of the reduced problem improve the master objective. In fact it often occurs that F (wt+1 ) > F (wt ). To speed up the convergence of CPA, we propose a new method which we call the Optimized Cutting Plane Algorithm for SVMs (OCAS). Unlike standard CPA, OCAS aims at simultaneously optimizing the master and reduced problem's objective functions F and Ft , respectively. In addition, OCAS tries to select such cutting planes that have higher chance to actively contribute to the approximation of the master objective function F around the optimum w . In particular, we propose the following three changes to CPA. Change 1 We maintain the best so far solution wb obt tained during the first t iterations, i.e. F (wb ), . . . , F (wb ) t 1 forms a monotonically decreasing sequence. Change 2 The new best so far solution wb is found by t searching along a line starting at the previous best solution wb-1 crossing the reduced problem's solution wt , i.e. , t wb = min F (wb-1 (1 - k ) + wt k ) , t t k0 The zero cutting plane is added to the maximization as the risk R is always greater or equal to zero. It follows directly from (4) that the approximation Rt lower bounds R and thus also Ft lower bounds F . To select the cutting planes, CPA starts from t = 0 (no cutting plane) and then it iterates two steps: 1. Compute wt by solving the reduced problem (3), which can be cast as a standard QP with t variables. 2. Add a new cutting plane (at+1 , bt+1 ) to approximate the risk R at the current solution wt . A natural stopping condition for CPA is based on evaluating the -optimality condition F (wt ) - Ft (wt ) which, if satisfied, guarantees that F (wt ) - F (w ) holds. 2 (Joachims, 2006) proved that for arbitrary > 0 CPA converges to the -optimal solution after O( 1 ) iterations, i.e. 2 it does not depend on the number of examples m. An improved analysis of the CPA published recently (Teo et al., 2007) shows that the number of iterations scales only with O( 1 ). More important, in practice CPA usually requires only tens of iterations to reach a sufficiently precise solution. (7) which can be solved exactly in O(m log m) time (see Appendix A). Change 3 The new cutting plane is selected to approximate the master objective F at a point wc which lies in a t vicinity of the best so far solution wb . In particular, the t point wc is computed as t wc = wb (1 - ) + wt , t t (8) 3. Optimized Cutting Plane Algorithm for SVMs (OCAS) We first point out a source of inefficiency appearing in CPA and then propose a new method to alleviate the problem. CPA selects a new cutting plane such that the reduced problem objective function Ft (wt ) monotonically increases with w.r.t. the number of iterations t. However, there is no such guarantee for the master problem objective F (wt ). Even though it will ultimately converge to the minimum F (w ), its value can heavily fluctuate between iterations. The reason for these fluctuations is the following. CPA selects at each iteration t the cutting plane which perfectly An alternative stopping condition advocated in (Joachims, 2006) halts the algorithm when R(wt ) - Rt (wt ) . It can ^ be seen that both the stopping conditions become equivalent if we set = C . ^ 2 where (0, 1] is a prescribed parameter. Having the point wc , the new cutting plane is computed using Equation (5) such that F (wc ) = Ft+1 (wc ). Note that although t t the theoretical bound on the number of iterations (see Theorem 1) does not depend on its value has impact on the convergence speed in practice. We found that the value = 0.1 works consistently well in all experiments. Algorithm 1 describes the proposed OCAS. Figure 1 shows the impact of the proposed changes to the convergence. OCAS generates a monotonically decreasing sequence of master objective values F (wb ), . . . , F (wb ) and a monot 1 tonically strictly increasing sequence of reduced objective values F1 (w1 ), . . . , Ft (wt ). Similar to CPA, a natural stopping condition for OCAS reads F (wb ) - Ft (wt ) , t (9) where > 0 is a prescribed precision parameter. Satisfying the condition (9) guarantees that F (wb ) - F (w ) t holds. Optimized Cutting Plane Algorithm for Support Vector Machines Algorithm 1 Optimized Cutting Plane Algorithm 1: Set t = 0 (i.e. there is no cutting plane at the beginning) and wb = 0. 0 2: repeat 3: Compute wt by solving the reduced problem (3). 4: Compute a new best so far solution wb using the t line-search (7). 5: Add a new cutting plane which approximates the risk R at the point wc given by (8), i.e. , t m 1 at+1 = - m i=1 i yi xi , c bt+1 = R(wt ) - at+1 , wc , t 1 if yi wc , xi 1 , t where i = 0 if yi wc , xi > 1 . t 6: t := t + 1 7: until a stopping condition is satisfied 3.1. Time Complexity and Parallelization By Theorem 1 the number of iterations of OCAS does not depend on the number of examples m. Hence the overall time complexity is given by the effort required per iteration which is O(mn + m log m) O(mn) (in practice log(m) n, where n is the dimensionality of the data). The per-iteration complexity of the subtasks and the way how they can be effectively parallelized is detailed below: Output computation involves computation of the dot products wt , xi , i = 1, . . . , m, which requires O(s) time, where s equals the number of non-zero elements in the training examples. Distributing the computation s equally to p processor leads to O( p ) time. Line-search The dominant part is sorting |K | numbers (K m, see Appendix A for details) which can be done in O(|K | log |K |). A speedup can be achieved by parallelizing the sorting function to using p processors, re| . ducing complexity to O K | log |K | Note that our imp plementation of OCAS uses quicksort, whose worst case complexity is O(|K |2 ), although its expected run-time is O(|K | log |K |). Cutting plane computation The dominant part requires m 1 computing the sum - m i=1 i yi xi which can be done in O(s ), where s is the number of non-zero elements in the training examples for which i is non-zero. Using p processors leads to O( sp ) time. Reduced problem The size of the reduced problem (3) is upper bound by the number of iterations which is invariant against the dataset size, hence it requires O(1) time. Though solving the reduced problem cannot be easily parallelized, it does not constitute the bottleneck as the number of iterations required in practice is small (cf. Table 2). Theorem 1 For any > 0, C > 0, (0, 1], and any training set {(x1 , y1 ), . . . , (xm , ym )}, Algorithm 1 satisfies the stopping condition (9) after at most max 2C 8C 3 Q2 , 2 , (10) iterations where Q = maxi=1,...,m xi . Proof The proof is along the lines of the convergence analysis of the standard CPA (Joachims, 2006). First, it can be shown that violated condition (9) guarantees that adding a new cutting plane (at , bt ) leads to an improvement of the reduced objective t = Ft+1 (wt+1 ) - Ft (wt ) which 2. is not less than min 2 , 8 2 Second, by exploiting that Q 0 Ft (wt ) F (w ) and F (w ) F (0)t = C one can conclude that the sum of improvements i=0 t cannot be greater than C . Combining these two results gives immediately the bound (10). For more details we refer to our technical report (Franc & Sonnenburg, 2007). 105 4. Experiments We now compare current state-of-the-art SVM solvers (SGD, Pegasos, SVMlight , SVMperf , BMRM3 on a variety of datasets with the proposed method (OCAS) using 5 carefully crafted experiments measuring: 1. Training time and objective for optimal C 2. Speed of convergence (time vs. objective) 3. Time to perform a full model selection 4. Scalability w.r.t. dataset size 5. Effects of parallelization To this end we implemented OCAS and the standard CPA4 in C. We use the very general compressed sparse column 3 SGD version 1.1 (svmsgd2) http://leon. bottou.org/projects/sgd, SVMlight 6.01 and SVMperf 2.1 http://svmlight.joachims.org, pegasos http://ttic.uchicago.edu/shai/code/, BMRM version 0.01 http://users.rsise.anu.edu. au/chteo/BMRM.html. 4 To not measure implementation specific effects (solver, dotproduct computation) etc. CPA F (wt ) Ft (wt ) 104 103 OCAS b F (wt est ) Ft (wt ) 102 0 10 20 30 40 50 iterations t Figure 1. Convergence behaviour of the standard CPA vs. OCAS. The bound on the maximal number of iterations of OCAS coincides with the bound for CPA given in (Joachims, 2006). Despite the same theoretical bounds, in practice OCAS converges significantly faster compared to CPA (cf. Table 2 in the experiments section). Optimized Cutting Plane Algorithm for Support Vector Machines Objective [(obj-min)/obj] 0.6 Objective [(obj-min)/obj] (CSC) representation to store the data. Here each element is represented by an index and a value (each 64bit). To solve the reduced problem (3), we use our implementation of improved SMO (Fan et al., 2005). 4.1. Experimental Setup The datasets used throughout the experiments are summarized in Table 1. We augmented the Cov1, CCAT, Astro datasets from (Joachims, 2006) by the MNIST, a artificial dense and two larger bioinformatics splice datasets for worm and human. The artificial dataset was generated Astro 1 cpa sgd ocas 1 CCAT cpa sgd ocas 0.8 0.8 0.6 0.4 0.4 0.2 0.2 0 10 -1 10 Time [s] 0 0 10 0 10 Time [s] 1 10 2 Covertype 1 cpa sgd ocas Objective [(obj-min)/obj] 1 MNIST cpa sgd ocas Objective [(obj-min)/obj] 0.8 0.8 Table 1. Datasets used in the experimental evaluation. Sp denotes the average number of non-zero elements of a dataset in percent. Split describes the size of the train/validation/test sets in percent. Datasets are available from the following urls: MNIST http://yann.lecun.com/exdb/mnist/, Cov1 http://kdd.ics.uci.edu/databases/covertype/ covertype.html, CCAT http://www.daviddlewis. com/resources/testcollections/rcv1/, Worm and Human http://www.fml.tuebingen.mpg.de/ raetsch/projects/lsmkl Objective [(obj-min)/obj] Objective [(obj-min)/obj] Dataset Examples 70,000 MNIST Astro 99,757 Artificial 150,000 Cov1 581,012 CCAT 804,414 Worm 1,026,036 Human 15,028,326 Dim 784 62,369 500 54 47,236 804 564 Sp 19 0.08 100 22 0.16 25 25 Split 77/09/14 43/05/52 33/33/33 81/09/10 87/10/03 80/05/15 0.6 0.6 0.4 0.4 0.2 0.2 0 10 0 10 Time [s] 2 0 10 0 10 Time [s] 2 Worm 1 cpa sgd ocas Artificial 1 cpa sgd ocas 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 10 0 10 Time [s] 2 10 4 0 10 0 10 Time [s] 2 from two Gaussians with different diagonal covarience matrices of multiple scale. If not otherwise stated experiments were performed on a 2.4GHz AMD Opteron Linux machine. We disabled the bias term in the comparison. As stopping conditions we use the defaults: light = gpdt = 0.001, perf = 0.1 and bmrm = 0.001. For OCAS we used the same stopping condition which is implemented in pe SVMperf , i.e., F (w)-Ft (w) 10rf = 10-3 . Note that C 0 these have a very different meaning denoting the maximum KKT violation for SVMlight , the maximum tolerated violation of constraints for SVMperf and for the BMRM the relative duality gap. For SGD we fix the number of iterations to 10 and for Pegasos we use 100/, as suggested in (Shwartz et al., 2007). For the regularization parameter C and we use the following relations: = 1/C, Cperf = C /100, Cbmrm = C and Clight = C m. Throughout experiments we use C as a shortcut for Clight .5 5 The exact cmdlines are: svm perf learn -l 2 -m 0 -t 0 --b 0 -e 0.1 -c Cperf , pegasos -lambda -iter 100/ -k 1, svm learn -m 0 -t 0 -b 0 -e 1e-3 -c Clight , bmrm-train -r 1 -m 10000 -i 999999 -e 1e-3 -c Cbmrm , svmsgd2 -lambda -epochs 10 Figure 2. Objective value vs. Training time of CPA (red), SGD (green) and OCAS (blue) measured for a different number of training examples.The dashed line shows the time required to run SGD for 10 iterations. OCAS was stopped when the precision fell below 10-6 or the training time for CPA was achieved. In all cases OCAS achieves the minimal objective value and is even on half of the datasets already in the beginning outperforming all other methods including SGD. 4.2. Evaluation In the following paragraphs we run and evaluate the aforementioned experiments 1­5. Training time and objective for optimal C We trained all methods on all except the human splice dataset using the training data and measured training time (in seconds) and computed the unconstrained objective value F (w) The obtained results are displayed in Table 2. The proposed method ­ OCAS ­ consistently outperforms all its competitors of the accurate solver category on all benchmark datasets in terms of training time while obtaining a comparable (often the best) objective value. BMRM and SVMperf implement the same CPA algorithm but due to implementation specific details results can be different. Our implementation of CPA gives very similar results (not shown).6 Note that for SGD, Pegasos (and SVMperf 2.0 ­ 6 In contrast to SVMperf , BMRM and our implementation of Optimized Cutting Plane Algorithm for Support Vector Machines svmlight svmperf bmrm ocas pegasos sgd astro 2.0939e+03 2972 22 2.1180e+03 38 2 2.1152e+03 42 2 2.1103e+03 21 1 2.1090e+03 2689K 4 2.2377e+03 10 1 ccat 8.1235e+04 77429 5295 8.1744e+04 228 228 8.1682e+04 327 248 8.1462e+04 48 25 8.1564e+04 70M 127 8.2963e+04 10 4 cov1 2.5044e+06 1027310 41531 2.5063e+06 520 152 2.5060e+06 678 225 2.5045e+06 80 10 2.5060e+06 470M 460 2.6490e+06 10 1 mnist 6.7118e+05 622391 2719 6.7245e+05 1295 228 6.7250e+05 2318 4327 6.7158e+05 137 10 Error 270M 647 1.3254e+06 10 1 worm 3.2224e+04 2029 4436 3.1920e+04 125 258 4.6212e+04 82M 213 2.1299e+05 10 9 artificial 1.3186e+02 709 162 1.3172e+02 76 13 1.3120e+03 25K 1 1.8097e+02 10 2 Table 2. Comparison of OCAS against other SVM solvers. "-" means not converged, blank not attempted. Shown in bold is the unconstrained SVM objective value Eq. (1). The two numbers below the objective value denote the number of iterations (left) and the training time in seconds (right). Lower timing and objective values mean "better." All methods solve the unbiased problem. As convergence criteria the standard settings described in Section 4.1 are used. On MNIST pegasos ran into numerical problems. OCAS clearly outperforms all of its competitors in the accurate solver category by a large margin achieving similar and often lowest objective value. The objective value obtained by SGD and Pegasos is often far away from the optimal solution, cf. text for a further discussion. not shown) the objective value sometimes deviates significantly from the true objective. As a result the learned classifier may differ substantially from the optimal parameter w . However as training times for SGD are significantly below all others it remains unclear whether SGD achieves the same precision using less time when run for further iterations. An answer to this question is given in the next paragraph. Speed of convergence (time vs. objective) To address this problem we re-ran the best methods CPA, OCAS and SGD, recording intermediate progress, i.e. while optimization record time and objective for several time points. The results are shown in Figure 2. Ocas was stopped when reaching the maximum time or a precision of 1 - F (w )/F (w) 10-6 and was in all cases achieving the minimum objective. In three of the six datasets OCAS not only as expected at a later time point achieves the best objective but already from the very beginning. Further analysis made clear that OCAS wins over SGD in cases where large C were used and thus the optimization problem is more difficult. Still plain SGD outcompetes even CPA. One may argue that practically the true objective is not the unconstrained SVM-primal value (1), but the performance on a validation set, i.e. optimization is stopped when the validation error won't change. One should however note that one in this case does not obtain an SVM but some classifier instead. Then a comparison should not be limited to SVM solvers but should be open to any other large scale approach, like on-line algorithms (e.g. perceptrons) too. We argue that to compare CPA did not converge for large C on worm even after 5000 iterations. Most likely the core solver of SVMperf is more robust. SVM solvers in a fair way one needs to compare objective values. As it is still interesting to see how the methods perform w.r.t. classification performance we analyze them under this criterion in the next paragraph. Time to perform a full model selection When using SVMs in practice, their C parameter needs to be tuned in model selection. We therefore train all methods using different settings7 for C on the training part of all datasets, evaluate them on the validation set and choose the best model to do predictions on the test set. As performance measure we use the area under the receiver operator characteristic curve (auROC) (Fawcett, 2003). Again among the accurate methods OCAS outperforms its competitors by a large margin, followed by SVMperf . Note that for all accurate methods the performance is very similar and has little variance. Except for the artificial dataset plain SGD is clearly fastest while achieving a similar accuracy. However the optimal parameter settings for accurate SVMs and SGD are different. Accurate SVM solvers use a larger C constant than SGD. For lower C the objective function is dominated by the regularization term w . A potential explanation is that SGDs update rule puts more emphasize on the regularization term and SGD when not run for a large number of iterations does imply early stopping. Scalability w.r.t. Dataset Size In this section, we investigate how computational time of OCAS, CPA and SGD scales with the number of examples on the worm splice dataset, for sizes 100 to 1, 026, 036. We again use our implementation of CPA that shares essential sub-routines with OCAS. Results are shown and discussed in Figure 3. For Worm and Artificial we used C = 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, for CCAT, Astro, Cov1 C = 0.1, 0.5, 1, 5, 10 and for MNIST C = 1, 5, 10, 50, 100. 7 Optimized Cutting Plane Algorithm for Support Vector Machines avg svm perf svmlight svmperf bmrm ocas pegasos sgd astro 98.15 ± 0.00 1 152 1 13 1 17 1 4 98.15 1 59 98.13 0.5 1 ccat 98.51 ± 0.01 1 124700 1 1750 1 2735 1 163 98.51 1 2031 98.52 1 20 cov1 83.92 ± 0.01 10 282703 5 781 10 1562 50 51 83.89 5 731 83.88 1 5 mnist 95.86 ± 0.01 10 9247 10 887 10 20278 10 35 95.84 5 2125 95.71 1 3 worm 99.45 ± 0.00 1 22983 0.1 1438 99.27 5 1438 99.43 0.005 69 artificial 86.38 ± 0.02 0.005 24520 0.005 6740 78.35 5 201 80.88 0.005 7 Table 3. Comparison of OCAS against other SVM solvers. "-" means not converged, blank not attempted. Shown in bold is the area under the receiver operator characteristic curve (auROC) obtained for the best model obtained via model selection over a wide range of regularization constants C. In each cell, numbers on the left denote the optimal C, numbers on the right the training time in seconds to perform the whole model selection. As there is little variance, for accurate SVM solvers only the mean and standard deviation of the auROC are shown. SGD is clearly fastest achieving similar performance for all except for the artificial dataset. However often a smaller C than the ones chosen by accurate SVMs is selected -- an indication that the learned decision function is only remotely SVM-like. Among the accurate solvers OCAS clearly outperforms its competitors. It should be noted that training times for all accurate methods are dominated by training for large C (see Table 2 for training times for the optimal C). For further discussion see the text. 10 4 10 Time [s] 2 cpa sgd ocas linear 10 0 C PUs speedup line search (s) at (s) output (s) total (s) 1 1 238 270 2476 3087 2 1.77 184 155 1300 1742 4 3.09 178 80 640 998 8 4.5 139 49 397 684 16 4.6 117 45 410 671 10 -2 10 -4 Table 4. Speedups due to parallelizing OCAS achieved on 15 million human splice dataset. 2 10 10 Dataset Size 4 10 6 Figure 3. This figure displays how the methods scale with dataset size on the worm splice dataset. The slope of the "lines" in this figure denotes the exponent e in O(me ), where the black line denotes linear effort O(m). Both OCAS and SGD scale about linearly. Note that SGD is much faster (as it runs for a fixed number of iterations and thus does early stopping). Effects of Parallelization As OCAS training times are very low on the above datasets, we also apply OCAS to the 15 million human splice dataset. Using a 2.4GHz 16Core AMD Opteron Linux machine we run OCAS using C = 0.0001 on 1 to 16 CPUs and show the accumulated times for each of the subtasks, the total training time and the achieved speedup w.r.t. the single CPU algorithm in Table 4. Also shown is the time accumulated for each of the threads. As can be seen -- except for the line search -- computations distribute nicely. Using 8 CPU cores the speedup saturates at a factor of 4.5, most likely as memory access becomes the bottleneck (for 8 CPUs output computation creates a load of 28GB/s just on memory reads). the early optimization steps shows often faster convergence than the so far in this domain dominating approximative methods. By parallelizing the subtasks of the algorithm, OCAS gained additional speedups of factors up to 4.6 on a multi-core multiprocessor machine. Using OCAS we were able to train on a dataset of size 15 million examples (itself about 32GB in size) in just 671 seconds. As extensions to one and multi-class are straight forward, we plan to implement them in the near future. Furthermore OCAS can be extended to work with a bias term. Finally it will be future work to investigate how the kernel framework can be incorporated into OCAS and how the O( 1 ) result of (Teo et al., 2007) can be applied to OCAS. An implementation of OCAS is available within the shogun toolbox http://www.shogun- toolbox.org and as a separate library from http://ida.first.fraunhofer. de/franc/ocas. Acknowledgements The authors gratefully acknowledge partial support from the PASCAL Network of Excellence (EU 506778). VF was supported by Marie Curie Intra-European Fellowship grant SCOLES (MEIF-CT-2006-042107). We thank A. Zien, ¨ G. Ratsch and G. Blanchard for great discussions. 5. Conclusions We have developed a new Linear SVM solver called OCAS, which outperforms current state of the art SVM solvers by several orders of magnitude. OCAS even in Optimized Cutting Plane Algorithm for Support Vector Machines References Bottou, L., & Bousquet, O. (2008). The tradeoffs of large scale learning. In NIPS 20. MIT Press. Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: a library for svms. http://www.csie.ntu.edu.tw/cjlin/libsvm. Chapelle, O. (2007). Training a Support Vector Machine in the Primal. Neural Comp., 19, 1155­1178. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273­297. Cristianini, N., & Shwawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge, UK: CUP. Fan, R.-E., Chen, P.-H., & Lin, C.-J. (2005). Working set selection using second order information for training svm. Journal of Machine Learning Research, 6, 1889­1918. Fawcett, T. (2003). Roc graphs: Notes and practical considerations for data mining researchers. Technical Report HPL2003-4). HP Laboratories, Palo Alto, CA, USA. Franc, V., & Sonnenburg, S. (2007). Optimized cutting plane algorithm for SVMs. Research Report; Electronic Publication 1). Fraunhofer Institute FIRST. Joachims, T. (1999). Making large­scale SVM learning practical. Advances in Kernel Methods -- Support Vector Learning (pp. 169­184). Cambridge, MA, USA: MIT Press. Joachims, T. (2006). Training linear svms in linear time. KDD'06. Lin, C.-J., Weng, R. C., & Keerthi, S. S. (2007). Trust region newton methods for large-scale logistic regression. ICML '07 (pp. 561 ­ 568). ACM New York. Shwartz, S.-S., Singer, Y., & Srebro, N. (2007). Pegasos: Primal estimated sub-gradient solver for svm. ICML '07 (pp. 807­ 814). ACM Press. Sindhwani, V., & Keerthi, S.-S. (2007). Newton methods for fast solution of semi-supervised linear svms. In Large scale kernel machines. MIT Press. Teo, C. H., Le, Q., Smola, A., & Vishwanathan, S. (2007). A scalable modular convex solver for regularized risk minimization. KDD'07. Zanni, L., Serafini, T., & Zanghirati, G. (2006). Parallel software for training. JMLR, 7, 1467­1492. yi xi , wb-1 ). Hence the line-search (7) involves solving t k = argmink0 G(k ) and computing wb = wb-1 (1 - t t k ) + wt k . As function G is convex the unconstrained minimum of G is attained at the point k at which the subdifferential G(k ) contains zero, i.e. 0 Gmk ). The ( subdifferential of G is G(k ) = k A0 + B0 + i=1 gi (k ), 0 if k Bi + Ci < 0 , Bi if k Bi + Ci > 0 , gi (k ) = [0, Bi ] if k Bi + Ci = 0 . Note that the subdifferential is not a function as there ex G(k ) |Bi3 | ki1 |Bi2 | 0 ki2 = k ki3 k |Bi1 | Figure 4. Illustration of the subgradient G(k) of the objective function G(k) minimized in the line-search. ist k for which G(k ) is an interval. The first term of the subdifferential G(k ) is an ascending linear function k A0 + B0 since A0 must be greater than zero (A0 is zero only if the algorithm has converged but then the line-search is not invoked). The term gi (k ) is either constantly zero, if Bi = 0, or it is a step-like jump whose value changes at C the point ki = - Bi . The value of gi (k ) w.r.t. k is sumi marized in Table 5. Hence the subdifferential G(k ) is a Bi = 0 Bi < 0 Bi > 0 k < ki 0 Bi 0 k = ki 0 [Bi , 0] [0, Bi ] k > ki 0 0 Bi Table 5. The value of gi (k) with respect to k. A. Computing Line-search efficiently The line-search (7) is an essential procedure of OCAS which is called at every iteration. We show that the linesearch can be solved exactly in O(m log m) time. First, we introduce a more compact notation for the objective function of the line-search problmm (7) F (wb-1 (1 - k ) + e t wt k ) by G(k ) = g0 (k ) + i=1 gi (k ) where g0 (k ) = 12 2 k A0 + k B0 + C0 , gi (k ) = max{0, k Bi + Ci }, A0 = wb-1 - wt 2, B0 = wb-1 , wt - wb-1 , C0 = t t t 1b C C b wt-1 2 , Bi = yi xi , wt-1 - wt and Ci = (1 - 2 m m monotonically increasing function as is illustrated in Figure 4. To solve k = argmink0 G(k ) we proceed as follows: If max( G(0)) is strictly greater than zero then the unconstrained minimum argmink G(k ) is at a point less or equal to 0. Thus the constrained minimum is attained at the point k = 0. If max( G(0)) is less then zero then the optimum k corresponds to the unconstrained optimum argmink G(k ) attained at the intersection between the graph of G(k ) and the x-axis. This point can be found efficiently by sorting K = {ki | ki > 0, i = 1, . . . , m} and checking the condition 0 G(k ) for k K and for k in the intervals which split the domain (0, ) in the points K . These computation are dominated by sorting the numbers K which takes O(|K | log |K |).