Online Learning of Approximate Maximum p-Norm Margin Classifiers with Bias

Kosuke Ishibashi, Kohei Hatano and Masayuki Takeda Department of Informatics, Kyushu University 744 Motooka, Nishi-ku, Fukuoka 819-0395, Japan {k-ishi, hatano, takeda}@i.kyushu-u.ac.jp

Abstract
We propose a new online learning algorithm which provably approximates maximum margin classifiers with bias, where the margin is defined in terms of p-norm distance. Although learning of linear classifiers with bias can be reduced to learning of those without bias, the known reduction might lose the margin and slow down the convergence of online learning algorithms. Our algorithm, unlike previous online learning algorithms, implicitly uses a new reduction which preserves the margin and avoids such possible deficiencies. Our preliminary experiments show that our algorithm runs much faster than previous algorithms especially when the underlying linear classifier has large bias.

1 Introduction
Large margin classification methods are quite popular among Machine Learning and related research areas. Various generalization bounds (e.g., [32, 34, 10]) guarantee that linear classifiers with large margin over training data have small generalization error with high probability. The Support Vector Machine (SVM) [5] is one of the most powerful among such methods. The central idea of SVM is to find the maximum 2-norm margin hyperplane over linearly separable data. Further, by using kernels and soft margin formulations, it can learn large margin hyperplane over linearly inseparable data as well. The problem of finding the maximum 2-norm margin hyperplane over data is formulated as a quadratic programming problem. So the task of SVM can be solved in polynomial time by using standard optimization methods. On the other hand, solving quadratic programming problems is time-consuming, especially for huge data which is now common in many applications. This motivates many researches for making SVM more scalable. One of major approaches is to decompose the original quadratic programming problem into smaller problems which are to solve [28, 29, 16, 8, 17]. Another popular approach is to apply online learning algorithms. Online learning algorithms such as Perceptron [31, 27, 26] and its variants [1, 11, 22, 13] work in iterations, where at each iteration, they process only one instance and update their hypotheses successively. Online learning algorithms use less memory, and are easy to

implement. Many online learning algorithms that find large margin classifiers have been proposed, including, e.g., Kernel Adatron [12], Voted Perceptron [11], Max Margin Perceptron [21], ROMMA [22], ALMA [13], NORMA [19], LASVM [4], MICRA [35], and Pegasos [33]. However, most of these online learning algorithms do not fully exploit the linear separability of data. More precisely, they are designed to learn homogeneous hyperplanes, i.e., hyperplanes that lie on the origin, and they cannot learn linear classifiers with bias directly. So, in order to learn linear classifiers with bias, typical online learning algorithms map instances from the original space R n to an augmented space Rn+1 with an extra dimension by using the mapping ~  : x  x = (x, -R), where R is the maximum 2-norm of instances [10]. Then, a hyperplane with bias (w , b) in the original space corresponds to the hyperplane without bias ~ w = (w , -b/R) in the augmented space since w · x + b = ~~ w · x. So, by using this mapping, learning linear classifiers with bias can be reduced to learning those without bias. But, this mapping weakens the guarantee of margin. Suppose that for a sequence of labeled examples (x 1 , y1 ), . . . , (xT , yT ) (xt  Rn and yt  {-1, +1} for t = 1, . . . , T ), there is a hyperplane with bias (u, b) that has margin  = min
t=1,...,T

yt (u · xt + b) , u 2R

where instances are normalized by R so as to limit the maximum 2-norm of instances to be one. Then, the correspond~ ing hyperplane u = (u, -b/R) over the augmented space, ~ in which the maximum norm of instances is bounded by R, has margin = ~ ~~ 1 y (u · x + b) y (u · x) = ,  ~ 2 u 2R 2 u 2R ~

since u 2 = u 2 + b2 /R2  2 u 2 , and x 2  2R. ~2 ~2 Even though the loss of margin is at most by a constant factor, it might cause significant difference in prediction performance over practical applications. In this paper, we propose a new online learning algorithm that approximately maximizes the margin. Our algorithm, PUMMA (P-norm Utilizing Maximum Margin Algorithm), is an extension of ROMMA [22] in two ways. First, PUMMA can optimize the bias directly by using an implicit reduction from learning of linear classifiers with bias to learning those without bias, instead of using the mapping .


Second, PUMMA can provably approximate the maximum p-norm margin classifier for p  2. A benefit of maximizing p-norm margin is that we can find sparse linear classifiers quickly. Technically speaking, PUMMA is a variant of p-norm algorithm [15, 14]. It is known that, if we set p =  or p = O(ln n), the p-norm algorithm behaves like online multiplicative update algorithms such as Winnow [23], which can converge exponentially faster than Perceptron, when the underlying linear classifier is sparse. For example, if the target concept is a k -disjunction over n boolean variables, Winnow can find a consistent hypothesis in O(k ln n) mistakes, while Perceptron needs (k n) mistakes [20]. We show that PUMMA, given a parameter  (0 <   1) and p  2, finds a linear classifier which has p-norm mar- R2 gin at least (1 -  ) in O( (p21)2 ) updates, when there ex ists a hyperplane with p-norm margin  that separates the given sequence of data. The worst-case iteration bound of PUMMA is as the same as those of typical Perceptron-like algorithms when p=2 and that of ALMA [13] for p > 2, PUMMA is potentially faster than these previous algorithms especially when the underlying linear classifier has large bias. For linearly inseparable data, PUMMA can use kernels and the 2-norm soft margin formulation [9] for p = 2, as well as previous Perceptron-like online learning algorithms. Further, we extend PUMMA to deal with 2-norm soft margin formulation for p > 2. Note that in standard implementations of the SVM [16, 8, 17], the 1-norm soft margin formulation (see, e.g., [10]) is preferred since it often requires less computation time. However, in general, both soft margin formulations are incomparable in terms of generalization ability, which depends on data and choices of kernels. For online-based implementations of the SVM with 1-norm soft margin see LASVM [4] and Pegasos [33]. There are other related works. For p = 2, previous algorithms such as Kernel Adatron [12], NPA [18], SMO algorithm [29], Max Margin Perceptron [21], and LASVM [4] can find bias directly as well. However, the first three algorithms are not suitable for the online setting since they need to store past examples to compute the bias. Max Margin Perceptron finds the same solution of our algorithm, but its upperbound of updates is ln(R/ ) times worse than that of PUMMA . For LASVM, there is no theoretical analysis of its convergence rate. For p = , ROME algorithm [24] is also similar to our present work. It is an online learning algorithm that finds an accurate linear classifier quickly when the margin of the underlying classifier is defined as -norm distance. On the other hand, ROME requires prior knowledge of the margin and bias. For a more general convex optimization technique which includes ROMMA as a special case, see [3]. In our preliminary experiments, PUMMA converges faster than previous online algorithms over artificial dataset, especially when the underlying linear classifier has large bias. In particular, for p = O(ln n), PUMMA is from 2 to 10 times faster than ALMA. Over real datasets, PUMMA often outperforms previous online algorithms.

2 Preliminaries
2.1 Norm For any vector x  R n and p > 0, p-norm x p of x is given n 1 as ( i=1 |xi |p ) p . In particular, x  is given as x  = maxi |xi |. It can be shown that, for any fixed x  R n , the p-norm x p is decreasing with respect to p, i.e., x p  x p for any 0 < p  p . For p > 1, q -norm is dual to p-norm if 1/q = 1 - 1/p. For p  1 and q such that 1/p + 1/q = 1, it is known that x


x

p

x

1

 n1/p x

.

2.2 Online learning We consider the standard setting of online learning of linear classifiers, in which learning proceeds in trials. At each trial t, the learner receives an instance x t  Rn , and it predicts a label yt  {-1, +1}. Then the learner receives the true ^ label yt  {-1, +1} and then it possibly updates its current hypothesis depending on the received label. In this paper, we assume that labels are determined by a linear classifier f (x) = sign(w · x + b) for some weight vector w  R n and bias b  R, where sign(a) = +1 if a  0, otherwise ^ sign(a) = -1. In particular, if y t = yt , we say that the learner makes a mistake. A typical goal of online learning is to minimize the number of mistakes as small as possible. Most of known online algorithms are mistake-driven, that is, they update their hypotheses when they make a mistake. The p-norm distance between a hyperplane and a point is computed as follows: Lemma 1 ([25]) Let V = {v  R n | w · v + b = 0}. Then, for any x  Rn , min x - v p = v V where q = 1/(1 - 1/p) 1 . Based on Lemma 1, the p-norm (geometric) margin of a hyperplane (w, b) over an example (x, y ) is defined as y (w · x + b) . wq For any sequence of examples S = ((x 1 , y1 ), . . . , (xT , yT )) (T  1), the margin of a hyperplane (w, b) over S is defined as the minimum margin of examples in S . The algorithms we consider update their hypotheses if not only they make a mistake, but also their hypotheses have insufficient margin. In this paper, the learner's goal is to minimize the number of updates in order to obtain a linear classifier with approximately maximum p-norm margin over the given sequence of examples. 2.3 Convex duality We review the basic results on convex analysis. Let F : Rn  R be a strictly convex differentiable function. The Legendre dual of F , denoted as F  , is defined by F  ( ) = sup ( · w - F (w)) . w  Rn
More generally, this lemma holds for an arbitrary norm and its dual norm.
1

|w · x + b| , wq


It can be verified that F  is also strictly convex and differentiable. Then the following lemma holds: Lemma 2 ([30, 7]) 1. F  = F . 2. F (w ) + F  ( ) =  · w if and only if  = F (w ). 3. F  = (F )-1 . In particular, we use F (w ) = paper. Let f = F , that is, f (w)i = w
1 2

w

2 q

throughout this

sign(wi )|wi |q-1
q -2 q

By Lemma 2 and some calculations, we obtain the following property. Lemma 3 ([14]) 1. The inverse f -1 of f is given as f -1 (w)i = where 1/p + 1/q = 1. 2. f (w ) p = w q . 3. w · f (w) = f (w ) p = w 2 . 2 q Finally, we will use the following bound later. Proposition 1 ([15, 14]) Let G() =  with p  2 and let g = G. Then it holds for any x and a that (p - 1) a 2. G( + a)  G() + g ( ) · a + p 2
1 2 2 p

sign(wi )|wi |p-1 , w p-2 p

It can be shown that the constraints of the problem (2) is relaxed, that is, the constraints of the problem (2) is weaker than those of the problem (1) when p = 2 and b t is fixed with 0. In fact, the second constraint in (2) corresponds to the hyperspace that contains the polyhedron which representing the constraints yj (w · xj )  1 (j = 1, . . . , t - 2). Our algorithm, PUMMA, generalizes ROMMA in two folds: (i) PUMMA can maximize any p-norm margin with p  2. (ii) PUMMA can directly learns non-homogeneous hyperplanes. PUMMA takes  (0   < 1) and p (p  2) as parameters. For initialization, it requires initial weight vector w 0 = 0  Rn and positive and negative instances x pos 1 and xneg , respectively. These two examples are easily ob1 tained by keep predicting -1 until the first positive example appears and predicting +1 until the first negative example comes. If either a positive or negative example cannot be obtained, then the number of updates is at most 1. Then, given a sequence S =((x 1 , y1 ), . . . , (xt-1 , yt-1 )) of examples and an instance x t , PUMMA predicts yt = ^ sign(w t · xt + bt ), where w t and bt is given as follows: 1 (w t , bt ) = arg min w 2 , q w ,b 2 subject to : w · xpos + b  1, w · xneg + b  -1 t t w · f (w t-1 )  wt-1 2 , q where q = 1/(1 - 1/p), xpos and xneg are the last positive t t and negative examples which incur updates, respectively. If o e yt (w t · xt + bt ) < 1 -  , PUMMAp ( ) updates (xp+s , xn+g ) t1 t1 neg pos neg pos = (xt , xt ), if yt = +1, and (xt+1 , xt+1 ) = (xt , xt ), otherwise. 3.1 Solution of the optimization problem (3) Now we show the solution of the optimization problem (3). In this subsection, for simplicity, we denote v = w t-1 ,  = f (w t-1 ), xpos = xpos and xneg = xneg . Let L be the t t Lagrangian, that is, 1 w2 L(w, b, ,  ) = 2 q +  {1 - y (w · x + b)}
{pos,neg}

(3)

3 PUMMA
We consider the learning of maximum p-norm margin classifiers in the online learning setting. By Lemma 1, the problem of finding the maximum p-norm margin hyperplane over a sequence of labeled examples S = ((x 1 , y1 ),. . . ,(xm , yT )) is formulated as follows: 1 min w 2 , (1) q w ,b 2 subject to : yt (w · xt + b)  1 (1  t  T ), where q is such that 1/p + 1/q = 1. Since the problem (1) is a convex optimization problem with linear inequality constraints, it can be solved by optimization methods such as interior-point methods [6]. However, in the context of online learning, it is time-consuming to solve the problem (1) at each trial. Further, it is necessary to store all the past given examples. For p = 2, Li and Long proposed an elegant solution of the problem (1) in the online learning setting [22]. Their algorithm, ROMMA, is an online learning algorithm that finds approximate 2-norm maximum margin hyperplanes without bias. At each trial t, given an instance x t , ROMMA predicts yt = sign(w t · xt ) such that ^ 1 wt = arg min w 2 , (2) 2 w2 subject to yt-1 w · xt-1  1 and w · w t-1  wt-1 2 . 2

+ ( v
pos neg

2 q

-  · w),

(4)

= +1 and y = -1. Then the partial derivawhere y tive of L w.r.t. wi and b is given respectively as  L = f (w )i - y  xi -  i , and (5)  wi
{pos,neg}

L = pos - neg . (6) b Since the solution (w  , b ) must enforce the partial derivatives (5) and (6) to be zero, the vector w  is specified as w  = f -1 (z +  ), where  = pos = neg , z = xpos - xneg and f -1 ( )i = sign(i )|i |p-1 
p-2 p

.


PUMMA p ( ) begin 1. (Initialization) Get examples (x pos , +1) 1 and (xneg , -1). Let w 0 = (0, . . . , 0)  1 Rn . 2. For t = 1 to T , (a) Receive an instance x t . (b) Let 1 (wt , bt ) = arg min w 2 , q w ,b 2 subject to : (w · xpos + b)  1 t (w · xneg + b)  -1 t w · f (w t-1 )  wt-1 q . 2 (c) Predict yt = sign(w t · xt + bt ). ^ (d) Receive the label y t . If yt (wt · xt + bt ) < 1 -  , update ( xt , xneg ) , (yt = +1) o e t (xp+s , xn+g ) = t1 t1 (xpos , xt ) , (yt = -1). t Otherwise, let end.
o e (xp+s , xn+g ) = (xpos , xneg ). t t t1 t1

where  and  where  and  satisfies the following equations f -1 (z +  ) · z = 2, (17) f -1 (z +  ) ·  = v 2 . q That is, the optimal solution w  satisfies the constraints of the problem(3) with equality. In this case, the solution can be obtained by maximizing its Lagrange dual L  which is defined as L (,  ) = min L(w, b, ,  ). w ,b Further, with some calculations, L  is computed as 1 2 L (,  ) = - z +   p + 2 +   2 . p 2 Then, Note that the partial derivatives of L  are  L = -f -1 (z +   ) · z + 2   L = -f -1 (z +   ) ·  +  2 . p  Since L is concave, the equations (17) is satisfied if and only if L is maximized. So, given an initial assignment (0 , 0 ), we can approximate (,  ) by repeating the Newton update  = - k +1 k 2 L (,  )-1 L (k , k ) k + 1 k for sufficiently many steps, where i (  2 L 2 = f -1 z +  )i zi , 2  i -1 (  2 L = f z +  )i zi i ,  i (  2 L = f -1 z +  )i zi i ,    i (  2 L 2 = f -1 z +  )i i , 2  and f -1  )i =
(

(18)

Figure 1: The description of PUMMA .

Further, by KKT conditions, the parameters  and  satisfy that (1 - w · xpos - b ) = 0, (1 + w · xneg + b ) = 0, 1 - w · xpos - b  0, 1 + w · xneg + b  0,   0, ( v v
2 q 2 q

(7) (8) (9) (10) (11) (12) (13) (14)

- w  ·  ) = 0,


- w ·   0,

and   0.

We show that  > 0 by contradiction. Assuming that  = 0, we have w  = f (  ) =  v . Then the conditions (12), (13) and (14) implies  = 1 and thus w  = v . However, the conditions (9) or (10) cannot be satisfied for w  = v , which is a contradiction. Now we consider two cases. (i) Suppose that  = 0. Then, since  > 0 and the conditions (7) and (8) hold, the vector w  is given as w  = f -1 (z ),
2 p.

 f -1 ( )  i |i |2(p-1) |i |p-2 . 2p-2 + (p - 1) p  p-2 p

= -(p - 2)

In our implementation, we set initial values as  0 = 0 and 0 = 1 . In particular, for p = 2, it holds that f (x) = f -1 = x. So, we have the following analytical solution for equations (17): = v 2 (2 - v · z ) and v 2 z 2 - (v · z )2 v 2 z 2 - 2(v · z ) = . v 2 z 2 - (v · z )2

(15)

where  = 2/ z (ii) Otherwise, i.e., if  > 0, by the conditions (7), (8), and (12), w = f -1 (z +  v ), (16)

(19)


Figure 2: Illustration of the implicit reduction which preserves the margin. Each pair of positive and negative examples in the original space (left) corresponds to a positive example in the new space (right).

p-norm margin hyperplane. Then, we have u
·

xpos + b

=

~ ~ u · xpos + b + u · (xpos - xpos ) neg · ~ pos ~ u (x - x ) ~ + u · (xpos - xpos ) = 2 neg pos neg · pos ~ ~ ~ = 1 + u (x - x -x +x )  1 + 2 - 2 = 1.

Similarly, it holds for any x neg  X neg that u · xneg + b  -1. So, we get u 2  u 2 . Finally, since the function q q · 2 (1 < q  2) is strictly convex, the minimum is unique. q Therefore we obtain u = u . This theorem ensures that finding the maximum margin hyperplane with bias can be reduced to finding those without bias over pairs of positive and negative instances. Observe that this reduction does not reduce the margin. PUMMA can be viewed as a "wrapper" algorithm of ROMMA equipped with this reduction. Given positive and negative instances x pos and xneg , PUMMA constructs a positive instance z = (xpos - xneg )/2 and train ROMMA with z for a trial. Then PUMMA receives a weight vector w and set bias b as b = -(w · (xpos + xneg ))/2. If PUMMA makes a mistake (or does not have enough margin) over a new instance, it updates z and train ROMMA again. It is possible to use any online learning algorithm that finds maximum margin linear classifier without bias as subroutines if it satisfies the following requirement: such an algorithm must output a weight vector whose support vector is z . However, most of known online algorithms maximizing the margin does not satisfy this requirement and ROMMA seems to be the only one satisfying the requirement so far. 3.3 Convergence proof We prove an upperbound of updates made by PUMMA. First of all, by the KKT conditions for equations (7) and (8), the following property holds: Lemma 4 For t  1, it holds that wt · xpos + bt = 1 and w t · xneg + bt = -1. t t Then we prove that the optimal solution of the offline optimization problem (1) is a feasible solution of the PUMMA's optimization problem (3). Lemma 5 Let (u, b)  Rn × R be a hyperplane such that yj (u · xj + b)  1 for j = 1, . . . , t. Then, it holds that u · f (wt )  wt 2 and u q  wt q. q Proof: For convenience of the proof, we denote  t = f (w t ). Without loss of generality, we can assume that an update is made at each trial t  1. The proof for the first inequality is done by induction on t. For t = 1, the vector is written as w1 = f -1 (1 ), where  1 = (xpos - xneg ) for some   1 1 0. By the definition of u and b, it holds that u · x pos + b  1 1 neg and u · x1 + b  -1, respectively. So, we obtain u ·  1 = (u · xpos - u · xneg ) 1 1  (1 - b + 1 + b) = 2.

As a summary, in order to obtain the solution w  , we first assume the case (i) and check whether the condition w  ·  > v 2 holds or not. If it does, the solution is given as (15). q Otherwise, the case (ii) holds and the solution is (19) for p = 2, or we apply Newton method for p > 2. In either case (i) or (ii), the bias b  is given as b = - w · xpos + w · xneg . 2 (20)

3.2 Implicit reduction to learning classifiers without bias We show an interpretation of PUMMA from the viewpoint of reduction. Let us fix p = 2. Then, it is easily verified that the update of PUMMA is identical to that of ROMMA for the instance z = (xpos - xneg )/2 whose label is positive. This t t observation implies a reduction from learning linear classifiers with bias to learning of those without bias. Let X = X pos  X neg be a subset of Rn , where X pos and X neg are positive and negative set of instances and X pos  X neg = . Assume that there exists (u, b) such that u · x pos + b  1 for each xpos  X pos , and u · xneg + b  -1 for each xneg  X neg . Then we consider the set xpos . - xneg xpos pos neg neg Z= X , x X 2 That is, from a set of positive and negative instances, we define the set of positive instances. Then, the following property holds for Z . Theorem 2 Fix any p satisfying 2  p < . Let (u, b) be the maximum p-norm hyperplane over X . Then, u is the maximum p-norm hyperplane over Z as well. Also, the opposite holds for some b. Proof: Let u be the maximum p-norm hyperplane over Z . Note that u · z  1 for each z  Z (See Figure 2). So, we have u 2  u 2 for q s.t. 1/p + 1/q = 1. Now let q q ~ ~ ~ ~ b = u · (xpos + xneg )/2, where xpos and xneg satisfies ~ ~ u · (xpos - xneg ) = 2, for any xpos  X pos . Note that such ~ ~ a pair (xpos , xneg ) always exists since u is the maximum


On the other hand, by Lemma 4, we have w1 q = w1 · 1 = w 1 · 2 (xpos 1 - xneg ) 1 = 2, which shows u ·  1  w1 2 . q Suppose that for t < t , the statement is true. Then, there are two cases: (i) w t · t -1 = wt -1 2 , and w t = q f -1 ( t ), where  t = (xpos - xneg ) +  t -1 for some  t t and  , or (ii)w t · t -1 > wt -1 2 , and w t = f -1 (t ), q where  t = (xpos - xneg ). For the case (ii), the proof t t follows the same argument for t = 1, so we only consider the case (i). By the inductive assumption, we have u · t By Lemma 4, wt
2= =

Theorem 3 Suppose that for a sequence S = ((x 1 , y1 ), . . . , (w T , yT )), there exists a hyperplane (u, b)  R n × R such that yt (u · xt + b)  1 for t = 1, . . . , T and the hyperplane (u, b) has p-norm margin  over S . Further, let R = maxt=1,...,T xt p. (i) Then the number of updates made by PUMMAp ( ) is at most . ( p - 1)R2 u 2 q O 2 (ii) PUMMAp ( ) outputs a hypothesis with p-norm margin at least (1 -  ) after at most the updates above. Proof: As in Lemma 5, without loss of generality, we assume that PUMMA updates for t = 1, . . . , M (M  T ). By Lemma 5, we have w t q  u q for t  1. Further, by Lemma 6, it holds that after M updates u
2 q

(u · xpos - u · xneg ) +  u · t -1 t t

 2  +  w t -1 2 q = wt wt
· ·

q

t (xpos t
-

xneg ) t

+  wt

 t -1

 wT 2  q
2u
2

2M , 2(p - 1)R2
2u
2

= 2  +  w t -1 2 . q So, we get u ·  t  wt 2 and thus we prove the first q inequality. The second inequality holds immediately since both (u, b) and (w t , bt ) satisfy the same constraints in (3) and (w t , bt ) minimizes the norm by definition. Next, prove the following lemma: Lemma 6 For each trial t  1 in which an update is incurred, 2 2 2 , wt+1 q - wt q  2(p - 1)R2 where R = maxj =1,...,t xj p. Proof: By the weak duality theorem (see, e.g.,[6]), the optimum of the problem (3) is bounded below by the Lagrangian dual L (,  ) in (18) for any   0 and   0. Therefore, using the notations in the derivation of update, 1 1 1 w 2 - v 2  L (,  ) - v 2 . q q q 2 2 2 So, by using Proposition 1 and letting  = 1, we have 1 L (, 1) - v 2 q 2 = - G( + z ) + G() + 2 (p - 1) 2  z 2 + 2  - g () · z - p 2 (p - 1) 2  z 2 + 2. = - v · z - p 2 The right hand side of the inequality above is maximized if 2-v·z = . (21) (p - 1) z 2 p Note that  is positive since v · z  2 -  . Subsisting (21), L (, 1) - 1 v 2
2 q

q q which implies M  . Further, after at most 2 2 updates, we have y t (w t + bt )  1 -  for t  T . Then the achieved margin is at least

R2

R2

1- 1-  = (1 -  ) . wq uq Since it holds that x p  x 1 for p  1 and x   x p  n1/p x  , we obtain the following corollary (A similar result was shown in [13]). Corollary 4 Assume that for a sequence S = ((x 1 , y1 ), . . . , (w T , yT )), there exists a hyperplane (u, b)  R n × R such that yt (u · xt + b)  1 for t = 1, . . . , T and the hyperplane (u, b) has -norm margin  over S . Further, let R = maxt=1,...,T xt . Then, by setting p = c ln n (c > 0), (i) the number of updates made by PUMMA p ( ) is at most R2 . u 2 ln n 1 O 2 (ii) PUMMAp ( ) outputs a hypothesis with -norm margin 1- at least e1/  after at most the updates above. c

4 Kernel and Soft Margin Extensions
4.1 Kernel Extension As well as SVM, ROMMA and other Perceptron-like online algorithms, PUMMA can use kernel functions for p = 2. Note that, at trial t, the weight vector w t is written as   t j-1 n t-1  n  j z j , wt =
=1 =j +1


(2 - v · z )2 2  . 2(p - 1) z 2 2(p - 1)R2 p

Now we are ready to prove our main result.

thus an inner product w t · xt is given as a weighted sum of inner products x j · xj between instances since z j = xpos - xneg . Therefore, we can apply kernel methods by j j replacing each inner product x j · xj with K (xj , xj ) for some kernel K . More practically, we can compute the inner products between w t and a mapped instance using the recurrence w t = t (xpos - xneg ) + t w t-1 . t t


4.2 2-norm Soft Margin Extension In order to apply PUMMA to linearly inseparable data, as in [21, 22], we employ the 2-norm soft margin minimization [9, 10], which is formulated as follows: Given a sequence S = ((x 1 , y1 ), . . . , (xT , yT )) and letting S be the set of examples in S , 1 min w w ,b, 2 subject to y (w · x + b)  1 - x ((x, y )  S ), where the constant C > 0 is given as a parameter. Here, we t implicitly assume that labels are consistent, i.e., if x t = xt hen yt = yt . So we drop y from the subscript of  . For p = 2, it is well known that this formulation is equivalent to the 2-norm minimization problem over linearly separable examples in an augmented space: 1 min w 2, ~ ~ ,b, 2 w subject to: ~~ y (w · x + b)  1 (x  S ),  y ~ ~ where w = (w , C  ), x = (x, C ex ) for each (x, y )  S , and each ex is a unit vector in R |S | whose element corresponding to x is 1 and other elements are set to 0. To use a kernel function K with this soft margin formulation, we just modify K as follows: ij ~ K (xj , xj ) = K (xi , xj ) + , C (23)
q

Solution The Lagrangian function is given as L(w, b,  , ,  ) C( 1 2 w 2+ x = q 2 2 x,y)Mt   (1 - t - yt w · xt ) +
{pos,neg}

+

C( 2

x , y ) S

2 x ,

(22)

w

+ +C x

t-1 2 q

- w · f (w t-1 )
2

t-1,x - C
Mt-1

x
Mt-1

. x t-1,x

To simplify descriptions, without loss of generality, we assume that xt is a positive instance. Note that every solution is as same as when xt is a negative one. As done in the separable case, by using KKT conditions, we consider the following two cases: (i)Suppose that  = 0,  > 0. Then the optimal solution (w  , b ,   ) is given as w = f -1 (z ), 2 = 2 , 2 C+ z p  p n t os = t eg = , C  x = 0 (x  Mt \{xpos , xneg }), t t where z = xpos - xneg . (ii)Otherwise,  = 0,  > 0. Let t t  = f (w t-1 ). Then we have w = f -1 (z +   ),  if (xt , yt )  Mt , / pos C, t = neg C +  t-1 , if (xt , yt )  Mt ,  neg ne +  t-g , t = 1 C x =  t-1,x (x  Mt \{xpos , xneg }), t t where  and  are the maximizers of the Lagrange dual L (,  ) = min w, b, L(w, b,  , ,  ) 1 = z +   2 p 2 2  ne +  t-g - 2 + 1 C 2 x ) -   2 - C ( - p 2

where ij = 1 if i = j , otherwise ij = 0. For p > 2, we modify PUMMA so that, given S = ((x1 , y1 ), . . . , (xt-1 , yt-1 )) and an instance x t , it predicts yt = sign(w t · xt + bt ), where (w t , bt ,  t ) is specified as ^ follows: 1 C( 2 (w t , bt , t ) = arg min w 2+ x , (24) q w,b, 2 2 x,y)Mt subject to:
p w · xpos + b  1 - t os , t

(25)

n w · xneg + b  -1 + t eg , t ( x t-1,x  w · f (w t-1 ) + C x,y)Mt-1 ( 2 2 t-1,x , wt-1 q + C x,y)Mt-1

2 x,j .

Mt-1

where Mt denotes the set of examples in S which have inp curred updates of PUMMA in t - 1 trials,  t os = xpos and t n o t eg = xneg . Then the modified PUMMA update x p+s or t1 t neg xt if yt+1 (w t+1 · xt+1 + bt+1 ) < 1 -  - xt+1 , where xt+1 = xt if xt+1 = xt such that (xt , yt )  Mt . Otherwise, xt+1 = 0.

Again, we can approximate (,  ) by repeating the Newton update  = - k +1 k 2 L (,  )-1 L (k , k ) k + 1 k


for sufficiently many steps, where i -1 ( 2  2 L 2 f z +   )i zi + = 2 C  2 L  2 L =     i ( ne = f -1 z +   )i zi i + t-g 1 L = 2
2 
margin

p=2
0.0225 0.02

0.01 PUMMA 0

ROMMA

ALMA PUMMA(15)

i

2 f -1 z +   )i i + C

(

x
Mt-1

-0.01

PUMMA(1) ROMMA(15) ALMA(1)

2 x,j .

-0.02 0 0.5 1 1.5 2 x 10 2.5
4

As in the case without soft margin, in order to acquire the solution w  and   , we first assume the case (i) and check whether the third constraint of the problem (24) holds with strict inequality or not. If it does, then the case (i) is true. Otherwise, the case (ii) holds. Finally, the bias b  is given as
p n w · xpos + w  · xneg + (t os - t eg ) . 2 By the same argument as Section 3, we obtain the following:
margin

# of updates p=2 ln (N)
0.0461 PUMMA 0 ALMA

b = -

-0.05 ALMA(15) -0.1 ALMA(9) ALMA(1) -0.15 PUMMA(15) PUMMA(9) PUMMA(1) -0.2 10
4

Theorem 5 For a sequence S = ((x 1 , y1 ), . . . , (w T , yT )), let (u, b, )  Rn × R × R|S | be the optimal solution of the problem (22). Further, let R = max t=1,...,T xt p. (i) Then the number of updates made by PUMMA p ( ) is at most u  ( ( 2 2 2+ p - 1)R2 + C q x,y)S x . O 2 (ii) PUMMAp ( ) outputs a hypothesis with p-norm margin 1 whose objective value for the problem (22) is at most (1-)2 times the optimum after at most the updates above.

10

6

# of updates

Figure 3: Number of updates and margin over artificial data set in the case p = 2 (upper) and p = 2 ln(n) (lower). We set x-axes log scale since the numbers of updates of ALMA are quite larger than PUMMA 's. And we hide the result of the case p = 2 and b = 9 since we make the figure easy to view. The parenthetical digits denote the value of bias.

5 Experiments
5.1 Experiments over artificial datasets We examine PUMMA , ALMA and ROMMA over artificial datasets generated by sparse linear classifiers. Each artificial dataset consists of n-dimensional {-1, +1}-valued vectors with n = 100. Each vector is labeled with a r-of-k threshold function f , which is represented as f (x) = sign(x i1 + · · · + xik + k - 2r + 1) for some i1 , . . . , ik s.t. 1  i1  i2  · · ·  ik  n, and it outputs +1 if at least r of k relevant features have value +1, and outputs -1, otherwise. For k = 16 and r  {1, 4, 8} (equivalently, the bias b  {15, 9, 1}, respectively), we generate random 1000 examples labeled by the r-of-k threshold function, so that positive and negative examples are equally likely. For ALMA and ROMMA, we add an extra dimension with value -R to each vector to learn linear classifiers with bias, where R = max ||x||p . Note that one can choose different values other than -R, say, 1. However, as remarked in [10], such a choice for the value in the extra dimension increases the number of iterations by O(R 2 ) times when the underlying hyperplane has large bias. So our choice seems to be fair. We set parameters so that each algorithm is guaranteed to achieve at least 0.9 times the maximum p-norm margin.

That is, we set  = 0.1 (note the parameter  is defined differently in [13]) for ALMA and  = 0.1 for ROMMA an PUMMA . We examine p  {2, 2 ln n}. We train each algorithm until its hypothesis converges by running it in epochs, where, in one epoch, we make each algorithm go through the whole training data once. At end of each epoch, for each algorithm, we record number of updates, margin incurred during the training and real computation time. Note that we measure the margin of each hypothesis over the original space. We execute these operations 10 times, changing the randomly generated data, and we average the results over 10 executions. The experiments are conducted on a 3.8 GHz Intel Xeon processor with 8 GB RAM running Linux. We use MATLAB for the experiments. The results are represented in Figure 3 and 4. We observe that PUMMA converges faster. PUMMA 's computation time is quite shorter than that of ALMA, although it uses Newton method in each update, Note that we omit the result of ALMA in the case p = 2 since the result is worse than the others. For p = 2, we don't use Newton method in the execution of PUMMA because we have the analytical solution of the optimal value of  and  by solving the optimization problem directly.


Table 1: Computation time (sec.) and obtained margin (denoted as  dataset onosphere house-votes adult-1k adult-2k adult-4k adult-8k adult-16k adult-full
i

)

on some UCI datasets. MICRA s ec. 102  0.48 10.04 0.09 16.51 2.34 4.03 5.61 2.81 55.91 2.00 189.13 1.46 2050.84 1.13 12394.86 0.79

SVMlight sec. 102  0.06 10.55 0.03 17.42 0.47 4.95 2.13 3.40 9.33 2.40 232.42 1.69 1271.06 1.20 5893.20 0.83

PUMMA s ec. 102  0.54 10.49 0.26 17.31 5.40 4.50 25.38 3.37 159.54 2.38 807.46 1.67 3365.47 1.18 44480.59 0.82

ROMMA s ec. 102  3.12 10.50 0.62 17.36 15.83 4.91 82.70 3.38 496.52 2.38 2167.40 1.67 12503.62 1.18 71296.34 0.82

5.2 Experiments over some UCI datasets We compare PUMMA with some other learning algorithms over the real datasets. The algorithms are SVM light [16], MICRA [35], and ROMMA [22]. We used the following datasets of UCI Machine Learning Repository [2]. (i) The ionosphere dataset consists of 351 instances which have 34 continuous attributes. (ii) The house-vote dataset consists of 435 instances which have 16 discrete attributes {y , n, ?}. We change these attributes to {1, -1, 0}. (iii) The adult dataset consists of 32561 instances which have 14 attributes. Among the attributes, 6 of them are discrete and the others are continuous. We change this 14 attributes to 123 binary attributes as Platt did in [29]. The name of dataset 'adult-mk' in Table 1 denotes a subset of the adult dataset which contains 1000 × m instances. Note that all the datasets have binary class and we change the range of labels with {1, -1}. To optimize the 2-norm soft margin for this linearly inseparable dataset, we use the following modified inner product ij I P (xi , xj ) = xi · xj + . C We added a dimension which denotes the bias as in Section 1 when we run MICRA and ROMMA which can't deal with bias directly. We modify SVMlight so as not to optimize 1-norm soft margin, and we change the inner product so that it optimizes 2-norm soft margin. We set  = 0.01 for PUMMA and ROMMA to achieve 99% of the maximum margin. The parameters of MICRA are changed for each dataset as in [35]. But, parameters might not be completely the same as them because some datasets are different from those they used. Finally we set 2-norm soft margin parameter C = 1 for all algorithms. In order to converge faster, we use the following heuristics for each online algorithm. Active Set We try to improve the order of given examples to feed for each online algorithm. First, we give all the examples to each online learning algorithm once. Then, we make a new dataset called "active set" , containing the examples which causes updates. After that, we give each example in the active set to the algorithm. If the example doesn't cause any updates, we remove the example from the active set, and we repeat this procedure until the active set becomes empty. Finally, we give all the examples again and check if the al-

gorithm makes any updates. If some updates occur, we construct an active set again and repeat the whole procedure. We run each algorithm and we measure its real computation time as well as its obtained margin. The experiments on real datasets are conducted on a 3.0 GHz Intel Xeon processor with 16 GB RAM running Linux. We implemented each algorithm in C. Table 1 shows the real computation time and obtained margin. As can be seen, PUMMA converges quite faster than ROMMA. On the other hand, PUMMA converges slower than MICRA. However, the parameters of MICRA are quite sensitive to datasets and nontrivial to tune appropriately. The results on all the real data set show that SVM light is the fastest, whereas MICRA is reported to be faster than SVM light over some datasets and with tuned parameters [35]. Note that this might be due to our selection of active sets which is different from theirs. 5.3 Experiments over MNIST dataset Next, we compare these algorithms over MNIST dataset. Since the dataset is not linearly separable, we use polynomial kernel and 2-norm soft margin as follows. 1 x ·x d  i j i,j K (xi , xj ) = . + + s C Since computing kernels is time-consuming, we use some extra heuristics in addition to our active set selection. Kernel Cache Since we have to compute kernel values of the same examples repeatedly, we memorize them in a cache matrix. In the cache matrix, each row memorizes the kernel values of a support vector and all the examples, where a support vector is an instance which causes an update. The length of each row equals to the number of training instances. The number of rows depends on the memory size. When the new kernel value of a support vector and an instance is required, we search the cached value in the cache matrix. If we fails, we calculate the value and store it in the cache matrix. To do this, we search the row of the corresponding support vector. We store the value if we succeed, or we make the new row otherwise. If the matrix is full, we replace the least referenced row by the new row. Inner Product Cache In our experiments, we keep giving examples to each online leaning algorithms until they


Table 2: Computation time (sec.) and obtained margin (denoted as  class 1 2 3 4 5 6 7 8 9 avg.
0

)

on MNIST datasets. ROMMA s ec.  218.97 1.150 231.82 0.708 870.58 0.761 2296.34 0.719 505.11 0.626 1007.91 0.661 308.18 0.876 584.36 0.608 5074.51 0.723 6057.92 0.538 1715.57 0.737

SVMlight sec.  256.51 1.339 152.10 0.712 413.43 0.810 566.84 0.763 333.04 0.650 428.36 0.672 246.47 0.941 322.90 0.621 694.17 0.810 599.78 0.558 401.36 0.788

SVMlight w/o bias s ec.  164.83 1.155 119.62 0.712 309.77 0.765 384.17 0.722 267.65 0.629 301.99 0.664 184.39 0.880 304.54 0.611 437.48 0.727 399.08 0.541 287.35 0.741

PUMMA s ec.  373.48 1.330 291.54 0.706 1674.08 0.804 2654.19 0.757 905.16 0.645 1480.44 0.667 534.80 0.934 860.89 0.616 5648.12 0.804 5290.33 0.554 1971.30 0.782

make no update on all the examples. Assume that at trial t = t1 , t2 (t1 < t2 ), the weight vector w t is updated by the same example xt1 . The weight vector w t2 is written as   tj -1 kt2 -1 2  w t2 = k  j z j
=1

=

t k2 -1
=t1

=j +1

w
t1

k

+

t2 -1 j =t1

 
nt2 -1
=j +1

 n  j z j .

So, if we memorize the inner product w t1 · xt1 , we can calculate w t2 · xt1 easier. This technique is efficient when we use kernel. Halving  It is reported that by decreasing  in a dynamical way, ROMMA converges faster [22]. Similar to their approach, we shrink the parameter  by halving repeatedly. More precisely, we set  = 1 at first, and halve  when the algorithm makes no update for all the examples. We repeat this procedure until  is as small as we require. Note that if  is smaller than the required value  target , we set  = target . When we use kernels, this halving heuristics can reduce support vectors in the early stage of learning, which contributes faster convergence. MNIST dataset contains 60, 000 matrix and labels. Each (28 × 28) matrix represents the image of the hand written digit. The value of each element is in {0, · · · , 255}, which denotes the density. Each label takes the value {0, · · · , 9}. MNIST dataset has 10 classes. Since each algorithm can deal with only binary class, we change each label so that one class is positive and the others are negative. Then we get 10 binary labeled datasets. We run three learning algorithms, SVM light , ROMMA and PUMMA on these datasets until they converge. We omit the evaluation of MICRA since it needs careful tuning of parameters to converge fast. We record the real computation time and margin. Note that we use our heuristics for ROMMA and PUMMA . And we set some kernel parameters, s = 11002 , d = 5 and C = 1/30 as in [22]. We set target = 0.01 and use 1 GB kernel cache. We also run SVMlight with the same size of cache memory, but its caching

strategy is different from ours. The experiments on MNIST dataset are conducted on the same machine as the experiments on UCI dataset. The results are shown in Table 2. PUMMA gains higher margin than ROMMA over almost all of the datasets. On the other hand, PUMMA requires more computation time. This seems to be due to the fact that ROMMA solves the different optimization problem, i.e., maximization of margin without bias. We observe the same tendency between SVMlight with and without bias. Further, computation times of PUMMA are worse than SVM light . But, PUMMA and ROMMA might be improved if we employ a different strategy for active set selection.

6 Conclusion and Future work
In this paper, we propose PUMMA which obtains the maximum p-norm margin classifier with bias approximately. Our algorithm often runs faster than previous online learning algorithms when the underlying linear classifier has large bias, by taking advantage of finding bias directly. Although the worst case upperbound on iterations of our algorithm is the same as those of previous algorithms, our experiments over artificial datasets suggest that our iteration bound might be better. For example, when the target function is a r-of-k threshold function, iteration bound of PUMMA is O(k 2 ln n) with p = O(ln n). However, in our experiments, PUMMA seems to converge in O(rk ln n) iterations, which is the best upperbound obtained by Winnow when k and r are known a priori. Unfortunately, we have not yet succeeded in proving better iteration bounds. It is still open if there exists an online learning algorithms that learns r-of-k threshold functions in O(rk ln n) updates without knowing k and r [23]. So far PUMMA or ALMA approximates -norm margin indirectly by setting p = O(ln n). Developing an adaptive online algorithm that directly maximizes -norm margin is also an open problem. One of the future work is to extend our algorithm to handle 1-norm soft margin which is commonly used in SVM. Further, we would like to apply PUMMA to learning sparse classifiers in practical applica-


p=2

ROMMA 10 PUMMA second

5

0

15

9 bias p=2 ln (n)

1

300 PUMMA 200 second ALMA

100

0

15

9 bias

1

Figure 4: Computation time over artificial data set in the case p = 2 (upper) and p = 2 ln(n) (lower).

tions.

Acknowledgments
We thank anonymous referees for helpful comments.

References
[1] J. K. Anlauf and M. Biehl. The adatron; an adaptive perceptron algorithm. Europhysics Letters, 10:687­ 692, 1989. [2] A. Asuncion and D. J. Newman. UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences, http://mlearn.ics.uci.edu/ MLRepository.html, 2007. [3] H. H. Bauschke and P. L. Combettes. A weak-to-strong convergence principle for Fejer-monotone methods in ´ hilbelt spaces. Mathematics of Operations Research, 26(2):248­264, 2001. [4] A. Bordes, S. Ertekin, J. Weston, and Leon Bottou. ´ Fast kernel classifiers with online and active learning. Journal of Machine Learning Research, 6:1579­1619, 2005. [5] B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pages 144­152, 1992.

[6] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [7] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. [8] C. C. Chang and C. J. Lin. Libsvm: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/ libsvm, 2001. [9] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273­297, 1995. [10] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machine. Cambridge University Press, 2000. [11] Y. Freund and R. E. Schapire. Large margin classification using the perceptron algorithm. Machine Learning, 37(3):277­299, 1999. [12] T. Friess, N. Cristianini, and C. Campbell. The kernel adatron algorithm: a fast and simple learning procedure for support vector machine. In Proceedings of the 15th International Conference on Machine Learning, 1998. [13] C. Gentile. A new approximate maximal margin classification algorithm. Journal of Machine Learning Research, 2:213­242, 2001. [14] C. Gentile. The robustness of the p-norm algorithms. Machine Learning, 53(3):265­299, 2003. [15] A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. In Proceedings of the tenth anual conference of Computational learning theory, pages 171­183, 1997. [16] T. Joachims. Making large-scale support vector machine learning practical. In A. Smola B. Scholkopf, ¨ C. Burges, editor, Advances in kernel methods - Support vector learning, pages 169­184. MIT Press, 1999. [17] T. Joachims. Training linear svms in linear time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), 2006. [18] S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy. A fast iterative nearest point algorithm for support vector machine classifier design. IEEE Transactions on Neural Networks, 11(1):124­ 136, 2000. [19] J. Kivinen, A. J. Smola, and R. C. Williamson. Online learning with kernels. IEEE Transactions on Signal Processing, 52(8):2165­2176, 2004. [20] J. Kivinen, M. K. Warmuth, and P. Auer. The perceptron algorithm versus winnow: linear versus logarithmic mistake bounds when few input variables are relevant. Artificial Intelligence, 97(1-2):325­343, 1997. [21] A. Kowalczyk. Maximum margin perceptron. In B. Scholkopf A. Smola, P. Bartlett and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 75­114. MIT Press, 2000. [22] Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Machine Learning, 46(1-3):361­ 387, 2002. [23] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2(4):285­318, 1988. [24] P. M. Long and X. Wu. Mistake bounds for maximum entropy discrimination. In Advances in Neural Infor-


mation Processing Systems 17, pages 833­840, 2004. [25] O. L. Mangasarian. Arbitrary-norm separating plane. Operations Research Letters, 24:15­23, 1999. [26] M. L. Minsky and S. A. Papert. Perceptrons. MIT Press, 1969. [27] A. B. Novikoff. On convergence proofs on perceptrons. In Symposium on the Mathematical Theory of Automata, volume 12, pages 615­622. Polytechnic Institute of Brooklyn, 1962. [28] E. Osuna, R. Freund, and F. Girosi. Improved training algorithm for support vector machines. In Proceedings of IEEE NNSP'97, 1997. [29] J. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Scholkopf, ¨ C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 185­208. MIT Press, 1999. [30] R. T. Rockafellar. Convex Analysis. Princeton University Press, 1970. [31] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65:386­408, 1959. [32] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin: a new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):1651­1686, 1998. [33] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In Proceedings of the 24th International Conference on Machine Learning, 2007. [34] J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over datadependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926­1940, 1998. [35] P. Tsampouka and J. Shawe-Taylor. Approximate maximum margin algorithms with rules controlled by the number of mistakes. In Proceedings of the 24th International Conference on Machine Learning, 2007.