Multi-Class Pegasos on a Budget

Zhuang Wang ZHUANG@TEMPLE.EDU Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122 USA Koby Crammer Department of Electronic Engineering, The Technion, Haifa, 32000 Israel
KOBY@EE.TECHNION.AC.IL

Slobodan Vucetic VUCETIC@IST.TEMPLE.EDU Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122 USA

Abstract
When equipped with kernel functions, online learning algorithms are susceptible to the "curse of kernelization" that causes unbounded growth in the model size. To address this issue, we present a family of budgeted online learning algorithms for multi-class classification which have constant space and time complexity per update. Our approach is based on the multi-class version of the popular Pegasos algorithm. It keeps the number of support vectors bounded during learning through budget maintenance. By treating the budget maintenance as a source of the gradient error, we prove that the gap between the budgeted Pegasos and the optimal solution directly depends on the average model degradation due to budget maintenance. To minimize the model degradation, we study greedy multi-class budget maintenance methods based on removal, projection, and merging of support vectors. Empirical results show that the proposed budgeted online algorithms achieve accuracy comparable to non-budget multi-class kernelized Pegasos while being extremely computationally efficient.

training stream. Online algorithms often update the prediction model they maintain upon observing a new training example. Considering the potentially high stream rates, it becomes very important to update the model in a computationally efficient manner. The invention of the Support Vector Machines (SVMs) (Cortes & Vapnik, 1995) inspired a lot of interest in applying the kernel methods for online learning. A large number of online algorithms (e.g. perceptron by Rosenblatt (1958)) can be easily kernelized and result in prediction models that require storage of a subset of observed examples, called the Support Vectors (SVs). While kernelization allows solving highly nonlinear problems, it also introduces heavy computational burden. The main reason is that on noisy data the number of support vectors tends to grow without limit as the algorithm progresses. In addition to the danger of exceeding the physical memory, this also implies an unlimited growth in update and prediction time. We call this phenomenon the curse of kernelization. To solve the problem, budgeted online algorithms have been proposed to bound the number of SVs through budget maintenance. In practice, the assigned budget depends on the specific application requirements (e.g., memory limitations, processing speed, or data throughput). In this paper, we address the problem of online multiclass classification on a budget. The basis of our work is the popular SVM training algorithm Pegasos (ShalevShwartz et al., 2007). Pegasos runs iteratively and alternates between a stochastic sub-gradient descent step and a projection step. If only a single example is used in the stochastic sub-gradient descent step, the algorithm can be naturally used for online learning. It was shown that Pegasos converges toward the optimal solution of SVM as the number of examples grows. Although in its original, non-kernelized, form it has constant update time and constant space, when combined with kernels Pegasos suffers from the same computational problems as other kernelized online algorithms.

1. Introduction
Online learning is an important framework in which a potentially unlimited sequence of training examples is presented one example at a time and can only be seen in a single pass. This is opposed to offline learning where the whole collection of training examples is at hand. The objective is to learn an accurate prediction model from the ----------
Appearing in Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

Multi-Class Pegasos on a Budget The main contributions of this paper are as follows. First, we develop a multi-class version of Pegasos based on the multi-class SVM formulation by Crammer & Singer (2001). The resulting multi-class Pegasos has similar algorithmic structure as its binary version. The second contribution is a budgeted version of the kernelized multiclass Pegasos that has constant update time and constant space. This is achieved by controlling the number of support vectors with one of the several budget maintenance strategies. We analyze the properties of the budgeted multi-class Pegasos in the framework of online convex learning with errors in the gradients proposed by Sutskever (2009). We show that, in the limit, the gap between loss of the budgeted algorithm and loss of the optimal solution is upper-bounded by the average model degradation induced by budget maintenance. In the absence of budget maintenance the multi-class Pegasos inherits convergence properties of its binary predecessor. Having shown that the quality of budgeted multi-class Pegasos directly depends on the quality of budget maintenance, our final contribution is exploring computationally efficient methods to maintain a classifier with a low budget. This problem has been subject of much recent work. The most popular strategy consists of removing a single support vector when the budget is exceeded. For example, (Crammer et al., 2004) proposed to learn budgeted perceptrons by removing the SV that will be predicted with the largest confidence after its own removal. Other ideas include removal of the oldest SV (Dekel et al., 2008), a random SV (Cesa-Bianchi & Gentile, 2006; Vucetic et al., 2009), one with the smallest coefficient (Cheng et al., 2007), or using a validation data set (Weston et al., 2005; Wang & Vucetic, 2009). Recently studied alternatives to removal are projecting an SV prior to its removal (Orabona et al., 2009) and merging of two SVs into a new SV (Nguyen & Ho, 2005; Wang & Vucetic, 2009). Instead of considering budget maintenance and model update as separate steps, Wang & Vucetic (2010) proposed to define it as a joint optimization problem. It is worth mentioning that much work has been done on the related problem of reduction of trained kernel machines (Schlkopf et al., 1999). In this work we consider 3 major budget maintenance strategies: removal, projection, and merging. In case of removal, we show that it is optimal to remove the smallest support vector. Then, we show that optimal projection of one SV to the remaining ones is achieved by minimizing the accumulated loss of multiple sub-problems for each class, which extends the results of (Csató & Opper, 2001; Engel et al., 2002; Orabona et al., 2009) to the multi-class setting. In case of merging, when the Gaussian kernel is used, we show that the new SV is always on the line connecting two merged SVs, which generalizes the result of (Nguyen & Ho, 2005) to the multi-class setting. Both space and update time of the budgeted Pegasos scale quadratically with the budget size when projection is used and linearly when merging or removal are used.

2. Budgeted Multi-class Pegasos (BPegasosM)
In this paper we focus on the problem of multi-class online learning on a budget. We are given a sequence of examples S = {(x t , y t ), t = 1,..., N }, where instance xt  R d is a d-dimensional input vector and the label yt belongs to the multi-class set Y = {1,..., c} . In the online learning setting, examples arrive sequentially. The objective of online algorithms is to incrementally learn a function f (x) that accurately maps instances from R d to one of the possible labels in Y.

2.1 Multi-class Pegasos Our algorithm is an extension of the recently proposed SVM training algorithm called Pegasos (Shalev-Shwartz et al., 2007). Pegasos is an iterative algorithm which alternates between stochastic sub-gradient descent steps and projection steps. Pegasos can be naturally used as an online learning algorithm when only a single example is used in the calculation of the stochastic sub-gradient. To develop the multi-class Pegasos, we consider the multi-class SVM formulation by Crammer & Singer (2001). Let us define the multi-class model f (x) as
f (x) = arg max{ f
iY (i )

(x)} = arg max{( w (i ) ) T x} ,
iY

where f is the prototype of the i-th class and w(i) is its corresponding weight vector. By adding all class-specific weight vectors, we construct w = [w (1) ...w ( c ) ] as the d × c weight matrix of f (x). The predicted label of x is the class of the weight vector that achieves the maximum value (w (i ) )T x . Under this setup, Crammer and Singer (2001) defined the multi-class SVM objective function as

(i )

P(w ) =


2

|| w || 2 +

1 N

t =1 l (w; (x t , yt )),

N

(1)

where  is the slack parameter, the norm of the weight matrix w is calculated as

|| w || 2 = iY || w (i ) || 2 , and the multi-class hinge-loss function is defined as
l (w; (x t , yt )) = max 0,1 + f ( rt ) (x t ) - f ( yt ) (x t ) , where rt = arg max iY ,i  yt f (i ) (x t ).

(

)

(2)

In multi-class Pegasos, instead of the objective function (1), we use the instantaneous objective function Pt (w ) upon receiving the t-th example,

Multi-Class Pegasos on a Budget

Pt (w ) =


2

|| w || 2 +l (w; (x t , y t ) .

(3)

Similarly to its binary predecessor, the multi-class Pegasos works by iteratively executing the two-step updates. At t-th round, it first updates the current weight wt using the sub-gradient t of (3) as w t +1 = w t -  t  t , where  t = 1 /(t ) is the learning rate. The sub-gradient matrix t is defined as  t = [ t (1) ...  t (c ) ], where  t(i ) =  w ( i ) Pt ( w ) is a column vector. If loss (2) is equal to zero then  t (i ) = w t(i ) . If loss (2) is above zero, then
 t(i ) w t(i ) - x t , if i = y t   = w t(i ) + x t , if i = rt  (i ) otherwise. w t , 

negative  coefficient, while the remaining  are zero. We call such examples Support Vectors (SVs). We can represent f t(i ) (x) as the kernel expansion f t(i ) (x) = (w t(i ) ) T  (x) =  jI t  (ji ) k (x j , x) , where k s the Mercer kernel induced by  and It is the set of indexes of all the SVs of wt. From now on, we will represent a model using both the w and  notation interchangeably. 2.3 Budgeted Multi-Class Pegasos To maintain the fixed number of SVs, the algorithm executes a budget maintenance step whenever the number of SVs exceeds a pre-defined budget B (i.e. | I t +1 |> B ) by reducing the size of It+1 by one such that wt+1 is only spanned by B SVs. We summarize the proposed Budgeted Multi-class Pegasos (BPegasosM) algorithm in Algorithm 1. We denote the weight degradation caused by budget maintenance step at t-th round as wt, defined as the difference between model weights before and after Line 7 of Algorithm 1. Budget maintenance step (Line 7) is a critical design issue. All budget maintenance strategies mentioned in the Introduction are compatible with Algorithm 1 and can be implemented. We will describe several budget maintenance strategies for BPegasosM in Section 4. In the next section we theoretically analyze how the budget maintenance step influences performance of BPegasosM. Algorithm 1: Budgeted multi-class Pegasos (BPegasosM) Input: data S, kernel k, slack parameter  , budget B; Initialize: b = 0, w1 = 0; 0. for t = 1,2,... receive (xt, yt); 1. w t +1  (1 - t  )w t ; 2. if l (w t ; (x t , y t )) > 0 3. w t +1  w t +1 +  (x t ) t ; // add an SV 4. b = b+1; 5. if b > B 6. w t +1  w t +1 - w t ; //budget maintenance 7. b = b-1; 8. w t +1  t w t +1 ; 9. Output: ft+1(x)

This step is followed by projecting the weight wt+1 into the closed convex set C = {w : || w || 1 /  }. The above two steps are summarized as
w t +1 =  C (w t -  t  t ),

where  C (u) defines the closest point in C to u. We can rewrite the update rule of the multi-class Pegasos as
w t +1 = t ( w t -  t  t ) = t ((1 -  t  ) w t + x t  t ),

where t = min{1,1 /(  || w t -  t  t ||)} and  t is a row vector,  t = [ t(1) ... t(c ) ]. If loss (2) is equal to zero, then  = 0; otherwise,
 t ,  t(i ) = -  t , 0,  if i = yt if i = rt otherwise.

2.2 Kernelization Multi-class Pegasos can learn non-linear problems when the kernel trick is used. After introducing a nonlinear function  that maps x from the input to the feature space and replacing x with (x), w t(i ) can be described as
w t(i ) =  j =1  (ji )  (x j ) ,
t

where

 (ji ) =  (ji )   k  (1 -  k  ) .
k= j k = j +1

t

t

(4)

3. Analysis
We now analyze the convergence properties of BPegasosM under the framework of online convex programming with errors in the gradients by Sutskever (2009), which is a variant of the classical online convex programming framework by Zinkevich (2003). We start

We denote the row vector  j = [ (j1) ... (jc ) ] as the c classspecific coefficients of j-th SV. From (4) it can be seen that the example whose loss (2) is zero has all zero  coefficients and can therefore be ignored. An input example with positive loss has one positive and one

Multi-Class Pegasos on a Budget by showing that the averaged instantaneous objective of Algorithm 1 converges toward the optimal solution and that the gap between these two is upper-bounded by average gradient error (i.e. weighted averaged weight degradation) induced by budget maintenance. We first introduce the following lemma, generalizing a result of (Shalev-Shwartz et al., 2007). Without loss of generality we assume that ||  (x) || 1 . Lemma 1 The optimal solution of multi-class SVM, from problem (1), w * = arg min w P (w ) is in a closed convex set with radius 1 /  . Proof: The dual objective of the multi-class SVM problem (1) (Crammer & Singer, 2001) is to maximize
1 - 2   N N ( ( t =1  j =1 k (x t , x j )  ( yij) -  (ji ) )( yit ) -  t(i ) )     iY  -
N j =1

Algorithm 1). Define the gradient error as E t = w t /  t and E = tN || E t || / N . Let w * be the optimal solution =1 of (1). Then, we have 1 N

 Pt (w t ) 
t =1

N

1 N

 Pt (w* ) +
t =1

N

G 2 (1 + ln( N )) 2E , + 2 N  (8)

where G =  + 2 + 2 .

Proof: First, we write the update rule of BPegasosM by treating Et as the error in the gradient, w t +1 =  C (w t -  t  t ) , where  t =  t + Et . Let us define the relative progress toward the optimal solution w * at t-th round as
Dt = || w t - w * || 2 - || w t +1 - w * || 2 .

iY 

(i ) (i ) j yj

+1

s.t. j  {1,..., N }, i  Y ,  (ji )  0 and

(i ) iY  j = 1 / N ,

Dt can be lower bounded as
Dt =|| w t - w * ||2 - ||  C (w t - t  t ) - w* ||2 1 || w t - w* ||2 - || w t - t  t - w* ||2 = -t2 ||  t ||2 +2t T (w t - w* ) + 2t EtT (w t - w* ) t

(5)
(i (i where  y ) = 1 if i = yj and  y ) = 0 otherwise, and  (i ) j j j are optimization variables. Let us denote the optimal solution for  (i ) as ( (i ) ) *. The optimal solution of (1) j j can be written as (Crammer & Singer, 2001)

    2 -t2 G 2 + 2t  Pt (w t ) - Pt (w * ) + || w t - w* ||2  2  
- 4t || Et || /  . ( 9)

( w (i ) ) * =

1


N j =1

(

(i ) yj

- ( (ji ) ) * (x j ) .

)

(6)

Since the strong duality holds, the optimal primal and dual objectives coincide. Plugging (6) into (5) we get

In 1 we use the fact that ||  C (a ) - b |||| a - b || for all b  C and a, since C is convex. In 2, we bound ||  t || as ||  t ||||  t || + || Et ||  || w t || +2 + 2 =  + 2 + 2  G, where, by using (4), we observed that || Et || is upper bounded by 2 when the budget maintenance removes an arbitrary SV with index t ' ,
|| Et ||=|| t'
t  1  t -1 ||= 2 t '  (   j )  (1 -  j  )  /( )  t  j =t ' t j =t '+1  


2

|| w * || 2 +

1 N
2

 j =1 l (w * ; (x j , y j ))
N j =1

N

=-


2

|| w || -

*

iY (

(i ) * (i ) j ) yj

(7) +1

Rearranging (7) and applying l ( w * ; (x j , y j ))  0 and ( ( (ji ) ) *  yi )  0 we get
j

(  || w* || 2 = 1 -   ( (ji ) ) *  yi ) - j =1iY
j

N

1 N

 l (w * ; (x j , y j ))  1 ,
j =1

N

= 2 (   j )  2.
j =t '

t -1

(10)

We also apply the property of strong convexity to bound
 T (w t - w * )  Pt (w t ) - Pt (w * ) +  || w t - w * || 2 / 2 , t

leading to the desired bound.


With Lemma 1, we are ready to prove the following theorem, which is a variant of Theorem 1 in (ShalevShwartz et al., 2007) under the budgeted multi-class setting. Theorem 1 (Bounding average instantaneous objective of BPegasosM) Let BPegasosM (Algorithm 1) run on a sequence of examples S. Let w t be the weight degradation due to budget maintenance (Line 7 in

since Pt is -strongly convex function w.r.t. || w || 2 / 2 (according to Lemma 1 in (Shalev-Shwartz & Singer, 2007)) and  t =  w Pt (w ). We bound || w t - w * || 2 /  , since both w t and w * are in the closed convex set with radius 1 /  (Lemma 1). Dividing both sides of inequality (9) by 2 t and rearranging we obtain

Multi-Class Pegasos on a Budget

Pt (w t ) - Pt (w* ) 

Dt   G 2 2 || Et || - || w t - w* ||2 + t + . 2t 2 2 

M  tN 1 l ( w t ; ( x t , y t ))  tN 1 Pt ( w ) . = =

Summing over all t we get

Combining with the conclusion in Theorem 1 leads to the stated mistake bound.  It is easy to show that the other convergence properties of Pegasos (Theorem 2 and 3 in (Shalev-Shwartz et al., 2007)) are inherited by BPegasosM under the constraint of E . If there is no budget maintenance step (i.e. E = 0 ), we obtain the multi-class counterparts of Shalev-Shwartz et al.'s theorems. We omit this part due to the lack of space.

 Pt (w t ) -  Pt (w* )  
t =1 t =1 t =1

N

N

N

N Dt  -  || w t - w * ||2 2t t =1 2

G2 + 2


t =1

N

t

+

2


 || E
t =1

N

(11)

t

||.

We bound the first and second terms in (11) as
 1 1 1 N  Dt   -  || w t - w * || 2  =  ( -  ) || w 1 - w * || 2   2  2 t =1  t  1 
N  1   1 1 * 2 * 2 +  -    -   || w t - w || -  || w N +1 - w ||   t =2 t t -1 N   1 =3 - || w N +1 - w * || 2  0 (12) 2 N

4. Budget Maintenance Strategies
Theorem 1 indicates that budget maintenance should attempt to minimize the gradient error E . To minimize E in the setting of online learning on a budget, we propose to minimize the gradient error || Et || at each round. From the definition of || Et || in Theorem 1, minimizing || Et || is equivalent to minimizing weight degradation || w t ||,
min || w t || 2 .

In =3, the first and second components vanish after plugging  t  1 /(t ) . Next, we bound the third term in inequality (11) according to the divergence rate of harmonic series

(14)

4.1 Budget maintenance through removal If budget maintenance removes j-th SV, w t =  (x j ) j . Then, the optimal solution of (14) is removal of SV with the smallest norm, p= arg min jIt +1 ||  j ||2 k (x j , x j ). Let us consider the Gaussian kernel case where k (x j , x j ) = 1. Then, as seen from (4), the optimal removal always selects the oldest SV and this strategy becomes similar to Forgetron (Dekel et al., 2008). 4.2 Budget maintenance through projection Let us consider budget maintenance through projecting the p-th SV to the remaining SVs. The objective is to update  coefficients of the remaining SVs to best represent  coefficients of the p-th SV.

G2 2

t =1

 t =

N

G2 2

1 t =1 t

 
N

G2 (ln( N ) + 1) . 2

(13)

Combing inequality (12) and (13) with (11) and dividing two sides of inequality by N we get the stated bound.  Observe that as N grows, the second term in the right side of bound (8) converges to zero. Therefore, the averaged instantaneous loss of Algorithm 1 converges toward the averaged instantaneous loss of optimal solution, and the gap between these two is upper bounded by the averaged gradient error E . This result directly indicates that an optimal budget maintenance strategy is to minimizes E . Corollary 1 (Mistake bound) Assume that the conditions stated in Theorem 1 hold, then the number of mistakes made by BPegasosM on the sequence S is

min  ||  (pi ) (x p ) -
 iY

G 2 (1 + ln( N )) 2 EN M   Pt (w * ) + + , 2  t =1
N

jIt +1 - p


 (ji ) (x j ) ||2 .

(15)

where G =  + 2 + 2 .

After setting the gradient of (15) with respect to the classspecific column vector of coefficients  (i ) to zero, one can obtain the optimal solution as
i  Y ,  (i ) =  (pi ) K -1k p ,

Proof: Using the fact that l (w; (x t , yt ))  1 whenever the algorithm made the mistake, as well as the fact that the accumulated multi-class hinge loss is less than the accumulated instantaneous objective, we get

(16)

where K = [kij ], i, j  I t +1 - p is the kernel matrix, k ij = k (x i , x j ), and k p = [k pj ]T , j  I t +1 - p is the column vector. It is worth observing that inverting K can be done in O(B2) time using Woodbury formula (Cauwenberghs &

Multi-Class Pegasos on a Budget Poggio, 2000). Finally, upon removal of p-th SV,  are added to  of the remaining SVs. The remaining issue is finding the best among B+1 candidate SVs for projection. After plugging (16) into (15) we can observe that the minimal weight degradation of projecting equals
min || w t ||2 = min ||  p ||2 ( k pp - k T (K -1k p ) ) . p
pI t +1

where h =

(i ) iY ( m


1 +  (i ) ) n 1

 (i ) exp(- || x m - z || 2 ) m

.
(i )   k exp(- || x k - z || 2 )

iY


( ( i ) +  ( i ) ) k = m , n m n

(17)

Eq. (20) indicates that z lies on the line connecting xm and xn. Plugging (20) into (19), merging problem is simplified to finding
  (i ) max  (i ) m (i ) hR  iY  m + n 
(i )  (1-h ) 2  k mn +    n   iY  (i ) +  (i ) m n  

Considering there are B+1 SVs, evaluation of (17) requires O(B3) time for each budget maintenance step. As an efficient approximation, we propose a simplified solution that always projects the smallest SV, p = arg min jIt +1 ||  j ||2 k (x j , x j ). Then, the computation is reduced to O(B2). We should also note that the space requirement of projection is O(B2) needed to store the kernel matrix and its inverse. Unlike the recently proposed projection method for multiclass perceptron (Orabona et al., 2009) that projects an SV only onto the SVs assigned to the same class, our method solves more general case by projecting an SV onto all the remaining SVs and thus results in smaller weight degradation. 4.3 Budget maintenance through Merging Problem (14) can also be solved by merging two SVs to a newly created one. The justification is as follows. For the i-th class weight, if (xm) and (xn) are replaced by M (i ) =  (i )  (x m ) +  (i )  (x n ) /( (i ) +  (i ) ) (assuming m n m n  (i ) +  (i )  0 ) and the coefficient of M (i ) is set to m n  (i ) +  (i ) , then the weight remains unchanged. The m n difficulty is that M (i ) cannot be used directly because the pre-image of M (i ) may not exist. Therefore we need to approximate M (i ) by image (z) of some input space vector z. Considering the multi-class problem, z can be found as

 h2 k mn .  

(21)

We can use any efficient line search method (e.g. golden search) to find the optimal h. After that, the optimal z can be calculated using (20). After obtaining the optimal solution z, the optimal coefficients  z = [ (z1) ... (zc ) ] of z for approximating  (i )  (x m ) +  (i )  (x n ) are obtained by solving the m n following optimization problem
min iY ||  (i )  ( x m ) +  (i )  ( x n ) -  (zi )  ( z ) || 2 . m n
z

(22)

The optimal solution of (22) is
( (zi ) ) * =  (i ) k (x m , z ) +  (i ) k (x n , z ), i  Y . m n

(

)

The total cost of finding the optimal merging for the n-th and m-th SV is O(1). The remaining question is what pair of SVs leads to the small weight degradation. The optimal solution can be found by performing merging of all B(B-1)/2 pairs of SVs that would require O(B2) time. To simplify the computation, we use the same approximation method as in projection (Section 4.2) by fixing m as the SV with the smallest value of ||  m || 2 . Thus the computation is reduced to O(B). We should observe that the space requirement is also O(B) because there is no need to store the kernel matrix.

min z iY || M (i ) -  (z ) || 2 .

(18)

Let us assume the Gaussian kernel k (x, x' ) = exp( - ||x- x'|| 2 ) is used. Problem (18) can then be reduced to
max z

5. Experiments
In this section we evaluate the proposed algorithms on several large datasets whose properties are summarized in the first column of Table 1. Checkerboard is generated. Letter, USPS, Covertype and Waveform are standard UCI Repository benchmarks. Attributes in all data sets were scaled to mean 0 and standard deviation 1. In the experiment, we evaluated budget maintenance methods for BPegasosM explained in Section 4. We used budgets of B = 100 and 500 and set  = 10-4. In particular, we compared the proposed projection and merging methods with the random removal method (Cesa-Bianchi & Gentile, 2006) and the method that removes the SV with the smallest coefficient (Cheng et al., 2007). The results of the non-budgeted PegasosM are also reported to

iY (M (i ) )T  (z ) .

(19)

Setting the gradient of (19) with respect to z to zero leads to solution z = hx m + (1 - h)x n , (20)

Multi-Class Pegasos on a Budget
Table 1. Testing accuracy comparison. The lower script in the PegasosM column is #examples being trained before early stopped.

DATA SET (N, DIM, |Y|) USPS (7K, 256, 10) LETTER (16K, 16, 26) COVERTYPE (0.5M, 54, 7) WAVEFORM
(2M, 21, 3)

CHECKERB (10M, 2, 2)

PEGASOSM (#SV) 94.2±0.3 (4.2K) 95.7±0.1 (10K) 81.1±0.1 (41K72K) 86.1±0.6 (53K140K) 99.2± 0.1 (63K540K)

B 100 500 100 500 100 500 100 500 100 500

PRJTRN++ 81.1±3.2 92.0±0.5 46.3±1.8 76.6±1.1 62.5±3.1 67.3±2.9 80.7±0.8 83.5±0.5 96.9±0.4 98.2±0.3

BPM+RAND 78.3±1.5 88.5±0.6 39.9±1.8 68.1±0.8 58.1±0.7 61.6±0.7 79.1±1.1 82.7±0.6 83.6±1.0 90.3±0.4

BPM+SMAL 78.6±4.0 89.4±0.6 41.7±0.7 68.5±1.1 57.5±2.4 62.3±1.1 79.1±2.9 82.9±1.0 83.9±1.1 90.9±0.5

BPM+PROJ 90.5±0.4 92.4±0.4 76.3±0.9 87.3±0.6 70.1±0.6 74.9±0.2 85.0±0.4 86.1±0.2 98.2±0.3 99.0±0.2

BPM+MRG 92.0±0.2 93.9±0.3 72.0±1.3 89.5±0.3 72.0±0.2 76.8±0.2 85.9±0.6 86.9±0.1 99.5±0.1 99.8±0.0

establish an upper-bound on accuracy. Besides our BPegasosM framework, we also evaluated the Projectron++ algorithm (Orabona et al., 2009) which is the state-of-the-art budgeted kernel perceptron algorithm. In Projectron++, an SV is projected only if model degradation is below the threshold; otherwise, budget is increased by one SV. In our experiments, we set the Projectron++ threshold such that the number of SVs is equal to B of BPegasosM at the end of training. All algorithms used Gaussian kernels whose width  was optimized for each combination of data set, algorithm and budget, choosing among {20/d, 22/d, 24/d, 26/d}, where d is data dimensionality. All the algorithms were implemented in MATLAB and the experiments were run on a 2.1GHz Dual-Core processor with 4G memory under Linux. The accuracy of different algorithms on test data are reported in Table 1. Each result is computed using an average of 5 repetitions on different permutations of each data set. PegasosM was stopped after 3,000 seconds. From Table 1 we can observe that BPegasosM with merging and projection significantly outperformed both their removal cousin and Projectron++. Merging resulted in somewhat better performance than projecting. On Waveform and Checkerboard data, BPegasosM with merging achieved even higher accuracy than the nonbudgeted PegasosM that had to be stopped after 140K and
1 0.95 0.9

540K examples, respectively. On Covertype and Letter data, the accuracy gap between budget B = 500 and nonbudgeted algorithms remained large and it can be explained by the complexity of these problems; for example, 60% of Covertype examples became SVs in PegasosM and Letter has 26 class labels. In both data sets, the accuracy clearly improved form B = 100 to 500, which indicates that extra budget is needed for comparable accuracy. In Figure 1 and 2 the accuracy evolution curves are plotted to illustrate the anytime prediction quality. Accuracy curves of BPegasosM with merging and projection closely followed and eventually surpassed that of PegasosM. BPegasosM with removal did not perform particularly well. Figure 3 shows log-log plot of the running time vs. the data stream length. Excluding the initial stage, the nonbudgeted PegasosM had the fastest increase in training time, confirming the expected O(N2) runtime. On the budget side, the runtime time of BPegasosM with merging and projection increases linearly, with merging having better constant, as expected (merging has constant O(B) and projection O(B2)). Considering accuracy, runtime, and memory expenditure, BPegasosM with merging is the clear winner on all 5 benchmark datasets.
2 million Waveform data, B=100
0.85

10 million Checkerboard data,B=100

accuracy

accuracy

early stopped after 3000s with #SV=63K
non-budgeted Random Smallest Proj Mrg 10
3

finished in 2900s finished in 800s

0.8 0.75 0.7 0.65 0.6 2 10

finished in 1300 s early stopped after 3000s with #SV=54K finished in 360 s
non-budgeted Random Smallest Proj Mrg 10
5

0.85 0.8 0.75 0.7

10

4

10

5

10

6

10

7

10

3

10

4

10

6

length of data sequence

length of data sequence

Figure 1. Accuracy evolution curve on Checkerboard

Figure 2. Accuracy evolution curve on Waveform

Multi-Class Pegasos on a Budget Csató, L., & Opper, M. (2001). Sparse representation for gaussian process models. NIPS, 13, 444-450. Dekel, O., Shalev-Shwartz, S. & Singer, Y. (2008). The forgetron: A kernel-based perceptron on a budget. SIAM Journal on Computing, 37, 1342-1372. Engel, Y., Mannor, S. & Meir, R. (2002). Sparse online greedy support vector regression. ECML. Hsu, C.-W. & Lin, C.-J. (2002). A comparisons of methods multiclass support vector machines. IEEE Transactions on Neural Networks, 13, 415-425. Nguyen, D & Ho, T. (2005). An efficient method for simplifying support vector machines. ICML, 617-624. Orabona, F. Keshet, J. & Caputo, B. (2009). Bounded kernel-based online learning. JMLR, 10, 2643-2666. Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386-408. Schkopf, B, Mika, S., Burges, C. J. C., Knirsch, P., Müler, K., Räsch, G. & Smola, A. J. (1999). Input space versus feature space in kernel-based methods. IEEE Transactions on Neural Networks, 10, 1000-1017. Shalev-Shwartz, S., & Singer, Y. (2007). Logarithmic regret algorithms for strongly convex repeated games (Technical Report). The Hebrew University. Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos: primal estimated sub-gradient solver for svm. ICML, 807-814. Sutskever, I. (2009). A simpler unified analysis of budget perceptrons. ICML, 985-992. Vucetic, S, Coric, V., & Wang, Z. (2009). Compressed Kernel Perceptrons. DCC, 153-162. Wang, Z & Vucetic, S. (2009). Twin Vector Machines for Online Training on a Budget. SDM. Wang, Z & Vucetic, S. (2009). Tighter Perceptron with Improved Dual Use of Cached Data for Model Representation and Validation. IJCNN, 2766-2771. Wang, Z & Vucetic, S. (2010). Online PassiveAggressive Algorithms on a Budget. AISTATS. Weston, J., Bordes, A. & Bottou, L. (2005). Online (and offline) on an even tighter budget. AISTATS, 413-420. Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. ICML, 928935.

2 million Waveform data
10
4

training time (in seconds)

10

3

Proj,(B=100) Proj,(B=500) Mrg,(B=100) Mrg,(B=500) non-budgeted

10

2

10

1

10

0

10

3

10

4

10

5

10

6

length of data sequence

Figure 3. Training time curves

6. Conclusion
We proposed a family of kernel-based budgeted online algorithms for multi-class SVM training based on the Pegasos algorithm. We obtained theoretical guarantees for its performance that indicate that its success is clearly tied with the model degradation due to budget maintenance. Based on the analysis, three budget maintenance methods were studied. We experimentally evaluated the proposed methods in terms of accuracy and training time. The results indicate that highly accurate multi-class kernel classifiers can be trained on high throughput large data streams while having very modest memory signature.

Acknowledgements
This work was supported by the U.S. National Science Foundation Grant IIS-0546155. Koby Crammer is a Horev Fellow, supported by the Taub Foundations.

References
Cauwenberghs, G. & Poggio, T. (2000). Incremental and decremental support vector machine learning. NIPS, 13, 409-415. Cesa-Bianchi, N. & Gentile, C. 2006. Tracking the best hyperplane with a simple budget Perceptron. COLT. Cheng, L., Vishwanathan, S. V. N., Schuurmans, D., Wang, S. & T. Caelli. (2007). Implicit online learning with kernels. NIPS, 19, 249-256. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273-297. Crammer, K., Kandola, J. & Singer, Y. (2004). Online classification on a budget. NIPS, 16, 225-232. Crammer, K & Singer. Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2, 262-292.