A Randomized Algorithm for Large Scale Support Vector Learning Anonymous Author(s) Affiliation Address email Abstract We propose a randomized algorithm for large scale SVM learni ng which solves the problem by iterating over random subsets of the data. Cru cial to the algorithm for scalability is the size of the subsets chosen. In the context of text classification we show that, by using ideas from random projections, a sampl e size of O(log n) can be used to obtain a solution which is close to the optimal with a high probability. Experiments done on synthetic and real life data sets demonstrate that the algorithm scales up SVM learners, without loss in accuracy. 1 Introduction Consider a training data set D = {(xi , yi )}, i = 1 . . . n and yi = {+1, -1}, where xi Rd are data points and yi specify the class labels, the problem of learning the classifier, y = sign(wT x + b), can be narrowed down to computing {w, b} such that it has good generalization ability. The SVM formulation for classification, which will be called C - S V M , for determining {w, b} is given by [1]: C-SVM-1: in 1 M inimiz e(w,b,) ||w||2 + C i 2 =1 At optimality w is given by S ubj ect to : yi (w · xi + b) 1 - i , i 0, i = 1 . . . n w= i :i >0 i yi xi , 0 i C Consider the set S = {xi |i > 0}; the elements of this set are called the Support Vectors. Note that S completely determines the solution of C - S V M .The set S may not be unique, though w is. Define a parameter to be the minimum cardinality over all S . See that does not change with number of examples n, and is often much less than n. More generally, the C - S V M problem can be seen as an instance of Abstract Optimization Problem(AOP) [2, 3, 4]. An AOP is defined as follows: An AOP is a triple (H, <, ) where H is a finite set, < a total ordering on 2H , and an oracle that, for a given F G H , either reports F = min< {F |F G} or returns a set F G with F < F. Every AOP has a combinatorial dimension associated with it; the combinatorial dimension captures the notion of number of free variables for that AOP. An AOP can be solved by a randomized algorithm by selecting subsets of size greater than the combinatorial dimension of the problem [2]. For SVM, is the combinatorial dimension of the problem; by iterating over subsets of size greater than , the subsets chosen using random sampling, the problem can be solved efficiently [3, 4]; this 1 algorithm was called RandSVM by the authors. Apriori the val ue of is not known, but for linearly separable classification problems the following holds: 2 d + 1. This follows from the fact that the dual problem is the minimum distance between 2 non-overlapping convex hulls[5]. When the problem is not linearly separable, the authors use the re duced convex hull formulation [5] to come up with an estimate of the combinatorial dimension; this estimate is not very clear and much higher than d1 . The algorithm RandSVM2 iterates over subsets of size proportional to 2 . RandSVM is not practical because of the following reasons: the sample size is too large in case of high dimensional data sets, the dimension of feature space is usually unknown when using kernels, and the reduced convex hull method used to calculate the combinatorial dimension, when the data is not separable in the feature space, is not really useful as th e number obtained is very large. This work overcomes the above problems using ideas from rand om projections[6, 7, 8] and randomized algorithms[9, 10, 2, 11]. As mentioned by the authors of RandSVM, the biggest bottleneck in their algorithm is the value of as it is too large. The main contribution of this work is, usin g ideas from random projections, the conjecture that if RandSVM is s olved using equal to O(log n), then the solution obtained is close to optimal with high probability(Theorem 3), particularly for linearly separable and almost separable data sets. Almost separable data sets are those which become linearly separable when a small number of properly chosen data p oints are deleted from them. The second contribution is an algorithm which, using ideas from randomized algorithms for Linear Programming(LP), solves the SVM problem by using samples of size linear in . This work also shows that the theory can be applied to non-linear kernels. 2 A NEW RANDOMIZED ALGORITHM FOR CLASSIFICATION This section uses results from random projections, and rand omized algorithms for linear programming to develop a new algorithm for solving large scale SVM classification problems. In Section 2.1, we discuss the case of linearly separable data and estimate a the number of support vectors required such that the margin is preserved with high probability, and show that this number is much smaller than the data dimension d, using ideas from random projections. In Section 2.2, we look at how the analysis applies to almost separable data and present the main result of the paper(Theorem 2). The section ends with a discussion on the application of the theory to non-linear kernels. In Section 2.3, we present the randomized algorithm from SVM learning. 2.1 Linearly separable data We start with determining the dimension k of the target space such that on performing a random projection to the space, the Euclidean distances and dot products are preserved. The appendix contains a few results from random projections which will be used in th is section. For a linearly separable data set D = {(xi , yi ), i = 1, . . . , n}, xi Rd , yi {+1, -1}, the C-SVM formulation is the same as C-SVM-1 with i = 0, i = 1 . . . n. By dividing all the constraints with ||w||, the problem can be reformulated as follows: C-SVM-2a: M aximiz e(w,b,l) l ^ w b 1 where w = ||w|| , ^ = ||w|| , and l = ||w|| . l is the margin induced by the separating hyperplane, that ^ b is, it is the distance between the 2 supporting hyperplanes. S ubj ect to : yi (w · xi + ^) l, i = 1 . . . n, ||w|| = 1 ^ ^ b The determination of k proceeds as follows. First, for any given value of k , we show the change in the margin as a function of k when the data points are projected onto the k dimensional subspace and the problem solved. From this, we determine the value k (k << d) which will preserve margin with a very high probability. In a k dimensional subspace, there are at the most k + 1 support vectors. Using the idea of orthogonal extensions(definition appears later in this section), we prove that whe n the problem is solved in the original space, using an estimate of k + 1 on the number of support vectors, the margin is preserved with a very high probabilit y. 1 2 Details of this calculation are present in the supporting material Presented in supporting material 2 Let w and x , i = 1, . . . , n be the projection of w and xi , i = 1, . . . , n respectively onto a k ^ i dimensional subspace (as in Lemma 2, Appendix). The classification problem in the projected space with the data set being D = {(x , yi ), i = 1, . . . , n}, x Rk , yi {+1, -1} can be written i i as follows: C-SVM-2b: M aximiz e(w ,^,l ) l b S ubj ect to : yi (w · x + ^) l , i = 1 . . . n, ||w || 1 b i where l = l(1 - ), is the distortion and 0 < < 1. The following theorem predicts, for a given value of , the k such that the margin is preserved with a high probability upo n projection. Theorem 1. Let L = max||xi ||, and (w , b , l ) be the optimal solution for C-SVM-2a. Let R be a T T x random d × k matrix as given in Lemma 2(Appendix). Let w = Rw and x = Rk i , i = 1, . . . , n. i k n If k 82 (1 + (1+lL ) )2 log 4 , 0 < < 1, 0 < < 1, then the following bound holds on the 2 optimal margin lP obtained by solving the problem C-SVM-2b: 2 P (lP l (1 - )) 1 - Proof. From Corollary 1 of Lemma 2(Appendix), we have w · xi - (1 + L2 ) w · x w · xi + (1 + L2 ) i 2 2 which holds with probability at least 1 - 4e- 8 , for some > 0. Consider some example xi with 2k yi = 1. Then the following holds with probability at least 1 - 2e- 8 w · x + b w · xi - (1 + L2 ) + b l - (1 + L2 ) i 2 2 2k l - 2 (1 + L2 ) w · x + b i ||w|| ||w|| ( ( Note that from Lemma 1(Appendix), we have 1 - )||w || ||w|| 1 + )||w ||, with 2k 1 + . probability at least 1 - 2e- 8 . Since ||w || = 1, we have 1 - ||w|| Hence Dividing the above by ||w||, we have w · x + b i ||w|| l - 2 (1 + L2 ) 1+ (l - (1 + L2 ))( 1 - ) l (1 - (1 + L2 ))(1 - ) 2 2l 1 + L2 )) l (1 - - (1 + L2 )) l (1 - (1 + 2l 2l This holds with probability at least 1 - 4e- 8 . A similar result can be derived for a point xj for which yj = -1. The above analysis guarantees that by projecting onto a k dimensional space, there w e exists at least one hyperplane ( ||w|| , ||be|| ), which guarantees a margin of l (1 - ) where e w (1 + 2k 2k 1 + L2 ) 2l (1) with probability at least 1 - n4e- 8 . The margin obtained by solving the problem C-SVM-2b, lP can only be better than this. So the value of k is given by: n4e - 2 k 2 8 (1+ 1+L )2 2l k 3 8(1 + (1+L2 ) 2 2l ) 2 log 4n (2) So by randomly projecting the points onto a k dimensional subspace, the margin is preserved with a high probability. This result is similar to the result s in large scale learning using random projections[8, 12]. But there are fundamental differences between the method proposed in this paper and the previous methods: no random projection is actual ly done here, and no black box access to the data distribution is required. We use Theorem 1 to dete rmine an estimate on the number of support vectors such that margin is preserved with a high pro bability, when the problem is solved in the original space. This is given in Theorem 2 and is the main contribution of this section. The theorem is based on the following fact: in a k dimensional space, the number of support vectors is upper bounded by k + 1. We show that this k + 1 can be used as an estimate of the number of support vectors in the original space such that the solution obtained preserves the margin with a high probability. We start with the following definition. Definition An orthogonal extension of a k - 1-dimensional flat( a k - 1 dimensional flat is a k - 1-dimensional affine space) hp = (wp , b), where wp = (w1 , . . . , wk ), in a subspace Sk of dimension k to a d - 1-dimensional hyperplane h = (w, b) in d-dimensional space, is defined as ^ follows. Let R Rd×d be a random projection matrix as in Lemma 2. Let R Rd×k be a another random projection matrix which consists of only the the first k columns of R. Let xi = RT xi and ^ ^T x = R k xi .Let wp = (w1 , . . . , wk ) be the optimal hyperplane classifier with margin lP for the i points x , . . . , x in the k dimensional subspace. Now define w to be all 0's in the last d - k coordin 1 nates and identical to wp in the first k coordinates, that is, w = (w1 , . . . , wk , 0, . . . , 0). Orthogonal extensions have the following key property. If (wp , b) is a separator with margin lp for the projected points, then its orthogonal extension (w, b) is a separator with margin lp for the original points,that is , if yi (wp · x + b) l, i = 1, . . . , n, then yi (w · xi + b) l, i = 1, . . . , n ^ i An important point to note, which will be required when exten ding orthogonal extensions to nonlinear kernels, is that dot products between the points are p reserved upon doing orthogonal projections, that is, xT x = xi T xj . ^^ i j Let L, l , , and n be as defined in Theorem 1. The following is the main result of this section. n Theorem 2. Given k 82 (1 + (1+lL ) )2 log 4 and n training points with maximum norm L in d 2 dimensional space and separable by a hyperplane with margin l , there exists a subset of k training points x1 . . . x where k k and a hyperplane h satisfying the following conditions: k 2 1. h has margin at least l (1 - ) with probability at least 1 - 2. x1 . . . x are the only training points which lie either on h1 or on h2 k Proof. Let w , b denote the normal to a separating hyperplane with margin l , that is, yi (w · xi + b ) l for all xi and ||w || = 1. Consider a random projection of x1 , . . . , xn to a k dimensional space and let w , z1 , . . . , zn be the projections of w , x1 , . . . , xn , respectively, scaled by 1/ k . By Theorem 1, yi (w · zi + b /||w ||) l (1 - ) holds for all zi with probability at least 1 - . Let h be the orthogonal extension of (w , b /||w |) to the full d dimensional space. Then h has margin at least l (1 - ), as required. This shows the first part of the claim. To prove the second part, consider the projected training po ints which lie on either of the two supporting hyperplanes. Barring degeneracies, there are at th e most k such points. Clearly, these will be the only points which lie on the orthogonal extension h, by definition. F rom the above analysis, it is seen that if k << d, then we can estimate that the number of support vectors is k + 1, and the algorithm RandSVM would take on average O(k log n) iterations to solve the problem [3, 4]. 2.2 Almost separable data In this section, we look at how the above analysis can be applied to almost separable data sets. We g call a data set almost separable if by removing a fraction = O( lon n ) of the points, the data set becomes linearly separable. 4 The C-SVM formulation when the data is not linearly separable(and almost separable) was given in C-SVM-1. This problem can be reformulated as follows: M inimiz e(w,b,) in =1 i 1 l This formulation is known as the Generalized Optimal Hyperplane formulation. Here l depends on the value of C in the C-formulation. At optimality, the margin l = l. The following theorem proves a result for almost separable data similar to the one proved i n Theorem 2 for separable data. S ubj ect to : yi (w · xi + b) l - i , i 0, i = 1 . . . n; ||w|| g lower bound on l as in the Generalized Optimal Hyperplane formulation and = O( lon n ), there exists a subset of k training points x1 . . . xk , k k and a hyperplane h satisfying the following conditions: Theorem 3. Given k 8 2 (1 + (1+L2 ) 2 2l ) log 4n + n, l being the margin at optimality, l the 1. h has margin at least l(1 - ) with probability at least 1 - 2. At the most 8(1+ (1+L2 ) 2 ) 2l 2 log 4n points lie on the planes h1 or on h2 3. x1 , . . . , x are the only points which define the hyperplane h, that is, they are the support k vectors of h. Proof. iLet the optimal solution for the generalized optimal hyperp lane formulation be (w , b , ). 1 w = i yi xi , and l = ||w || as mentioned before. The set of support vectors can be split into to 2 disjoint sets,S V1 = {xi : i > 0 and i = 0}(unbounded SVs) and S V2 = {xi : i > 0 and i > 0}(bounded SVs). :i >0 Now, consider removing the points in S V2 from the data set. Then the data set becomes linearly separable with margin l . Using an analysis similar to Theorem 1, and the fact that l l, we have the proof for the first 2 conditions. When all the points in S V2 are added back, at most all these points are added to the set of support vectors and the margin does not change; this is guaranteed by the fact that we have assumed the worst possible margin for proving conditions 1 and 2, and any value lower than this would violate H the constraints of the problem. This proves condition 3. ence the number of support vectors, such that the margin is p reserved with high probability, is k+1= 8 (1 + L2 ) 2 4n 8 (1 + L2 ) 2 4n (1 + ) log + n + 1 = 2 (1 + ) log + O(log n) 2 2l 2l (3) Using a non-linear kernel Consider a mapping function : Rd Rd , d > d, which maps a point xi Rd to a point zi Rd , where Rd is a Euclidean space. Let the points z1 , . . . , zn be projected onto a random k dimensional subspace as before. The lemmas in the appendix are applicable to these random projections[12]. The orthogona l extensions can be considered as an projection from the k dimensional space to the -space, such that the kernel function values are preserved. Then it can be shown that Theorem 3 applies when us ing non-linear kernels also. 2.3 A Randomized Algorithm The reduction in the sample size from 6d2 to 6k 2 is not enough to make RandSVM useful in practice as 6k 2 is still a large number. This section presents another rando mized algorithm which only requires that the sample size be greater than the n umber of support vectors. Hence a sample size linear in k can be used in the algorithm. This algorithm was first propose d to solve large scale LP problems[11]; it has been adapted for solving large scale SVM problems. 5 Algorithm 1 RandSVM-1(D,k,r) Require: D - The data set. Require: k - The estimate of the number of support vectors. Require: r - Sample size = ck , c > 0. 1: S = randomsubset(D , r ); // Pick a random subset, S , of size r from the data set D 2: S V = svmlearn({}, S ); // SV - set of support vectors obtained by solving the problem S 3: V = {x D - S |v iolates(x, S V )} //violator - nonsampled point not satisfying KKT conditions 4: while (|V | > 0 and |S V | < k ) do 5: R = randomsubset(V , r - |S V |); //Pick a random subset from the set of violators 6: S V = svmlearn(S V , R); //SV' - set of support vectors obtained by solving the problem S V R 7: SV = SV ; 8: V = {x D - (S V R)|v iolates(x, S V )}; //Determine violators from nonsampled set 9: end while 10: return S V Proof of Convergence: Let S V be the current set of support vectors. Condition |S V | < k comes from Theorem 3. Hence if the condition is violated, then the algorithm terminates with a solution which is near optimal with a very high probability. Now consider the case where |S V | < k and |V | > 0. Let xi be a violator(xi is a non-sampled point such that yi (wT xi + b) < 1). Solving the problem with the set of constraints as S V xi will only result, since SVM is an instance of AOP, in the increase(decrease) of the objective function of the primal(dual). As there are only finite number of basis f or an AOP, the algorithm is bound to terminate; also if termination happens with the number of vi olators equal to zero, then the solution obtained is optimal. Determination of k The value of k depends on the margin l which is not available in case of C -SVM. This can be handled only by solving for k as a function of , where is as defined in the appendix and Theorem 1. This can be done by combining Equation 1 with Equation 3: k 8 (1 + L2 ) 2 4n (1 + L2 ) 2 4n 4n 16 16 (1 + ) log ) log + O(log n) 2 (1 + ) 2 log 2 2l 2l (4) 3 Experiments This section discusses the performance of RandSVM in practice. The experiments were performed on 4 data sets: 3 synthetic and 1 real world. RandSVM was used with LibSVM as the solver when using a non-linear kernel; with SVMLight for a linear kernel .RandSVM has been compared with state of the art SVM solvers: LibSVM for non-linear kernels, and SVMPerf3 and SVMLin4 for linear kernels. Synthetic data sets The twonorm data set is a 2 class problem where each class is dr awn from a multivariate normal distribution with unit variance. Each vector is a 20 dimensi onal vector. One class has mean (a, a, . . . , a), and the other class has mean (-a, -a, . . . , -a), where a = 2/ 20. The ringnorm data set is a 2 class problem with each vector con sisting of 20 dimensions. Each class is drawn from a multivariate normal distribution. One class has mean 1, and covariance 4 times the identity. The other class has mean (a, a, . . . , a), and unit covariance where a = 2/ 20. The checkerboard data set consists of vectors in a 2 dimensio nal space. The points are generated in a 4 × 4 grid. Both the classes are generated from a multivariate uni form distribution; each point is (x1 = U (0, 4), x2 = U (0, 4)). The points are labeled as follows - if(x1%2 = x2%2), then the point is labeled negative, else the point is labeled positive. For each of the synthetic data sets, a training set of 10,00,000 points and a test set of 10,000 points was generated. A smaller subset of 1,00,000 points was chosen from training set for parameter tuning. From now on, the smaller training set will have a subs cript of 1 and the larger training set 3 4 http://svmlight.joachims.org/ http://people.cs.uchicago.edu/vikass/svmlin.html 6 Category twonorm1 twonorm2 ringnorm1 ringnorm2 checkerboard1 checkerboard2 CCAT C11 Kernel Gaussian Gaussian Gaussian Gaussian Gaussian Gaussian Linear Linear RandSVM 300 (94.98%) 437 (94.71%) 2637 (70.66%) 4982 (65.74%) 406 (93.70%) 814 (94.10%) 345 (94.37%) 449 (96.57%) LibSVM 8542 (96.48%) 256 (70.31%) 85124 (65.34%) 1568.93 (96.90%) X X SVMPerf X X X X X X 148 (94.38%) 120 (97.53%) SVMLin X X X X X X 429(95.1913%) 295 (97.71%) Table 1: The table gives the execution time(in seconds) and the classification accuracy(in brackets). The subscripts 1 and 2 indicate that the corresponding training set sizes are 105 and 106 respectively. A '-' indicates that the solver did not finish execution even after a running for a day. A 'X' indicates that the experiment is not applicable for the corresponding solver. The '' indicates that the solver used with RandSVM was SVMLight; otherwise it was LibSVM. will have a subscript of 2, for example, ringnorm 1 and ringnorm2 . Real world data set The RCV1 data set consists of 804,414 documents, with each document consisting of 47,236 features. Experiments were performed using 2 categories of the data set - CCAT and C11. The data set was split into a training set of 7,00,000 documents and a test set of 104,414 documents. Table 1 shows the kernels which were used for each of the data s ets. The parameters used ( and C for Gaussian kernels, and C for linear kernels) were obtained by tuning using grid searc h. Selection of k for RandSVM: The values of and were fixed to 0.2 and 0.9 respectively, for all the data sets. For linearly separable data sets, k was set to (16 log(4n/ ))/2 . For the others, k was set to (32 log(4n/ ))/2 . Discussion of results: Table 1, which has the timing and classification accuracy comparisons, shows that RandSVM can scale up SVM solvers for very large data sets. Using just a small wrapper around the solvers, RandSVM has scaled up SVMLight so that its performance is comparable to that of state of the art solvers such as SVMPerf and SVMLin. Similarly LibSVM has been made capable of quickly solving problems which it could not do before, even after executing for a day. In the case of ringnorm1 dataset, the time taken by LibSVM is very small. Hence not much advantage is gained by solving smaller sub-problems; this combined with the overheads involved in RandSVM resulted in such a slow execution. Hence RandSVM may not always be suited in the case of small datasets. It is clear, from the experiments on the synthetic data sets, that the execution times taken by RandSVM for training with 105 examples and 106 examples are not too far apart; this is a clear indication that the algorithm scales well with the increase in the training set size. All the runs of RandSVM except ringnorm1 terminated with the condition |S V | < k being violated. Since the classification accuracies obtained by using RandSVM and the baseline solvers are very close, it is clear that Theorem 3 holds in practice. 4 Conclusion It is clear from the experimental evaluations that randomiz ed algorithms can be used to scale up SVM solvers to large scale classification problems. If an estimate of the number of support vectors is obtained, then algorithm RandSVM-1 can be used for other SVM learning problems also, as they are usually instances of an AOP. The future work would be to apply the work done here to such problems. 7 A Some Results from Random Projections Here we review a few lemmas from random projections [8]. The f ollowing lemma discusses how the L2 norm of a vector is preserved when it is projected on a random s ubspace. Lemma 1. Let R = (rij ) be a random d × k matrix, such that each entry (rij ) is chosen indepenT dently according to N (0, 1). For any fixed vector u Rd , and any > 0, let u = R ku . Then E [||u ||2 ] = ||u||2 and the following bounds hold: P ( (1 - )||u||2 ||u ||2 (1 + )||u||2 ) 1 - 2e( 2 - 3 ) k 4 The following theorem and its corollary show the change in th e Euclidean distance between 2 points and the dot products when they are projected onto a lower dime nsional space [8]. Lemma 2. Let u, v Rd . Let u = R ku and v = R ku be the projections of u and v to Rk via a random matrix R whose entries are chosen independently from N (0, 1) or U (-1, 1). Then for any > 0, the following bounds hold T T P ( (1 - ) u - v 2 u - v 2 ) P ( u - v 2 (1 + ) u - v 2 ) 1 - e-( 1-e 2 - 3 ) k 4 and -( 2 - 3 ) k 4 Corollary 1. Let u, v be vectors in Rd s.t. u L1 , v L2 . Let R be a random matrix whose T T entries are chosen independently from either N (0, 1) or U (-1, 1). Define u = R ku and v = R kv . Then for any > 0, the following holds 2k P ( u · v - (L2 + L2 ) u · v u · v + (L2 + L2 ) ) 1 - 4e- 8 2 2 21 21 References [1] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995. [2] Bernd Gartner. A subexponential algorithm for abstract optimization problems. In Proceedings 33rd Symposium on Foundations of Computer Science, IEEE CS Press, 1992. [3] Jose L. Balcazar, Yang Dai, and Osamu Watanabe. A random sampling technique for training support vector machines. In ALT. Springer, 2001. [4] Jose L. Balcazar, Yang Dai, and Osamu Watanabe. Provably fast training algorithms for support vector machines. In ICDM, pages 43­50, 2001. [5] K. P. Bennett and E. J. Bredensteiner. Duality and geometry in SVM classifiers. In Proceedings of the International Conference on Machine Learning, 2000. [6] W. Johnson and J. Lindenstauss. Extensions of lipschitz maps into a hilbert space. Contemporary Mathematics, 1984. [7] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of the johnson-lindenstrauss lemma. Technical Report TR-99-006, Berkeley, CA, 1999. [8] R. I. Arriaga and S. Vempala. An algorithmic theory of lea rning: Random concepts and random projections. In Proceedings of the 40th Foundations of Computer Science, 1999. [9] Kenneth L. Clarkson. Las vegas algorithms for linear and integer programming when the dimension is small. Journal of the ACM, 42(2):488­499, 1995. [10] B. Gartner and E. Welzl. A simple sampling lemma: analysis and application in geometric optimization. In Proceedings of the 16th annual ACM symposium on Computational Geometry, 2000. [11] M. Pellegrini. Randomizing combinatorial algorithms for linear programming when the dimension is moderately high. In SODA '01: Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, 2001. [12] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. On kernels, margins and lowdimensional mappings. In Proc. of the 15th Conf. Algorithmic Learning Theory, 2004. 8