A simpler unified analysis of Budget Perceptrons

Ilya Sutskever University of Toronto, 6 King's College Rd., Toronto, Ontario, M5S 3G4, Canada

ILYA @ CS . UTORONTO . CA

Abstract
The kernel Perceptron is an appealing online learning algorithm that has a drawback: whenever it makes an error it must increase its support set, which slows training and testing if the number of errors is large. The Forgetron and the Randomized Budget Perceptron algorithms overcome this problem by restricting the number of support vectors the Perceptron is allowed to have. These algorithms have regret bounds whose proofs are dissimilar. In this paper we propose a unified analysis of both of these algorithms by observing that the way in which they remove support vectors can be seen as types of L2 -regularization. By casting these algorithms as instances of online convex optimization problems and applying a variant of Zinkevich's theorem for noisy and incorrect gradient, we can bound the regret of these algorithms more easily than before. Our bounds are similar to the existing ones, but the proofs are less technical.

1.1. Perceptrons The Perceptron (Rosenblatt, 1958) is the oldest and one of the most widespread online learning algorithms. It classifies n-dimensional real vectors into one of two possible classes (which are 1 and -1), and stores its knowledge in the form of an n-dimensional weight vector w. It predicts the class p = sign(w x) for the input vector x, and updates its parameters by the equation wi+1 = wi + xi · (yi - pi )/2 (1)

where yi  {1, -1} is the label of the example xi , and pi = sign(wi xi ) is the Perceptron's prediction on the example xi . This update is nonzero only when the Perceptron makes a mistake. There are several variants on the Perceptron learning algorithm (e.g., (Shalev-Shwartz & Singer, 2005; ShalevShwartz et al., 2007; Crammer et al., 2006; Littlestone & Warmuth, 1989); see references in (Cesa-Bianchi & Lugosi, 2006)). They are similar as algorithms and have formal performance guarantees. The oldest guarantee of this kind is Novikoff's theorem (Novikoff, 1963), which states that the number of errors that the Perceptron makes is at most 1/ 2 , where  is the margin of a set of weights w that makes no errors on the data (in this setting, the margin is the distance between the hyperplane defined by this Perceptron and the point closest to it; the theorem assumes that x  1).1 1.2. Kernels Despite its simplicity and appeal, the Perceptron can represent a fairly limited set of hypotheses, and there are many practical problems for which the Perceptron simply cannot obtain low training error. This difficulty was first overcame by the Support Vector Machine (SVM) (Cortes & Vapnik, 1995), which is a nonlinear Perceptron that works very well in practice. The strength of the SVM comes from two main sources. First, it uses the kernel trick (Scholkopf & Smola, 2002), which is a way of dramatically enhancing the set of hypothesis representable by the Perceptron. Consider mapping a training example x to an expanded
1

1. Introduction
Traditional batch learning algorithms update their parameters after processing every datapoint in the training set. In contrast, online learning algorithms update their parameters based on small batches of training examples. When the training set is large enough, batch algorithms will learn more slowly than online algorithms (e.g., (Bottou & Bousquet, 2008; Shalev-Shwartz & Srebro, 2008)). For example, if the training set is redundant, online learning algorithms can finish learning before completing their first pass over the training set, which is not possible with batch algorithms.
Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

In this paper, all norms are L2 .

A simpler unified analysis of Budget Perceptrons

set of features (x); for example, ((x1 , . . . , xn )) =
jS xj S{1,...,n}

(a 2n -dimensional vector). If, in-

stead of classifying x, we classify (x) with a Perceptron, we could represent a much richer set of hypotheses (simply because Perceptrons that classify (x) have 2n parameters while Perceptrons that classify x have only n). Using (x) directly is obviously infeasible due to its size; however, the n inner product (x), (y) = i=1 (1+xi ·yi ) is efficiently computable, and the kernel trick makes essential use of this efficiency: notice that if we run the Perceptron on examples of the form (x), then
M

main computational challenge in doing so is computing the inner products w, (x) . To do so, we must store the active set {x1 , . . . , xM } and evaluate the kernel function M times, so the time to compute w, (x) grows linearly with the size of the active set. If the Perceptron makes a large number of errors during training, the size of its active set will become large (see eq. 2), which will significantly slow both training and testing. There have been a number of attempts to fix the problem of large active sets in the online kernel Perceptron (Crammer et al., 2004; Weston et al., 2005; Dekel et al., 2008; Cavallanti et al., 2007), the last two of which have formal performance guarantees. These algorithms enforce a strict upper bound B (the budget) on the active set, making sure that these Perceptrons are efficient. They do so by removing a vector from the active set whenever the active set becomes too large. We will now describe the two algorithms relevant to this work. The Forgetron (Dekel et al., 2008) is the first budget Perceptron that had a formal performance guarantee. It enforces a strict bound on the active set by removing vectors as follows: first, it reduces the weight i of every vector in its active set; then, it discards the oldest vector in the active set, which has the smallest weight. This is done only when the size of the active set exceeds the budget B, so applying this removal procedure on every error will ensure that the size of the active set will not exceed the budget. Due to the repeated weight reductions, the oldest vector will have small weight, so removing it will not change the Perceptron significantly. The factor by which the Forgetron reduces weight is not constant and differs from error to error. The Randomized Budget Perceptron (RBP) of (Cavallanti et al., 2007) is considerably simpler than the Forgetron: whenever the Perceptron makes an error and the size of the active set exceeds the budget, RBP discards a random vector from its active set. Both algorithms have regret bounds whose proofs are different and are somewhat nontrivial. In this paper, we present a view that lets us reprove similar regret bounds for the RBP and for a simplified Forgetron (one that reduces the weights by the same factor on each error) with less effort using the same idea. The idea is that both algorithms can be seen as doing online gradient descent on the hinge loss with L2 -regularization on their errors, where the RBP algorithm adds a special kind of zero-mean noise to the gradient and the Forgetron has a small deterministic error in its gradient. Thus, we will view these algorithms as performing online convex programming on the points where the algorithms err, and apply a version of Zinkevich's theorem (Zinkevich, 2003) that accounts for a changing target and errors in the gradients, which we will prove in the next section. This will let us bound the total number of errors

w=
i=1

(xi )i

(2)

where i  ±1, M is the number of errors, and xi are the errors themselves. When w has this form, we can compute the Perceptron's predictions by the equation
M

sign w, (x) = sign
i=1

(xi ), (x) i

(3)

which is a dramatically more efficient than working with the (x)'s directly. The inner product (xi ), (x) is represented by the kernel function k(x, y). Following (Dekel et al., 2008), we call {x1 , . . . , xM } the active set of w. In addition to the kernel trick, the SVM uses regularization and minimizes the hinge loss, which prevents the SVM from overfitting and causes it to find sparse solutions­ vectors w that can be expressed as a linear combination of a small number of (xi )'s. The hinge loss function L is defined by the equation L(w; (x, y)) = max(0, 1 - w, x y) (4)

and given a training set {(x1 , y1 ), . . . , (xn , yn )}, the SVM minimizes the cost function
n

L(w; (xi , yi )) +  · w 2 /2
i=1

(5)

An important property of the hinge loss is that if w does not classify x correctly (i.e., sign w, x = y), then L(w; (x, y)) > 1, implying that the total hinge loss of the training set upper bounds the number of errors in the training set (e.g., (Shalev-Shwartz, 2007)). 1.3. Budget SVMs are usually trained with batch algorithms, but it is tempting to apply the plain Perceptron to the vectors (x), as described in the previous sections, in order to obtain an online learning algorithm for the Kernel Perceptron. The

A simpler unified analysis of Budget Perceptrons

with the hinge loss of the best slowly-changing sequence of hypotheses. As a result, obtain similar proofs for a regret bound of the RBP and the simplified Forgetron. The regret bound that we prove for the RBP has a better dependence on the budget size than (Cavallanti et al., 2007), but it is competitive against hypotheses of a slightly smaller norm. An unusual aspect of the regret bound of the RBP is that it is shown to be competitive not against a fixed Perceptron, but against a sequence of slowly-changing Perceptrons. Our bounds for both the Forgretron and the RBP will also be able to cope with a sequence of slow-changing Perceptrons, thanks to lemma 1 below.

son vectors, and let S = shift. Then
T

T i=1

wi-1 - wi be their total

(fi (wi ) - fi (wi )) 
i=1

U 2 3 · U S T ·  · G2 + + +U ·E 2 2 2 (6)

2. Online convex programming with errors in the gradients
Online convex programming (OCP) is the setting where in each round, a convex function f is chosen; we choose a point x without knowing f and experience f (x) loss. Our goal is to have a cumulative loss that is not much greater than the cumulative loss of any fixed point x . Zinkevich's theorem states that this goal is attained with simple gradient descent. We will apply Zinkevich's theorem, where the function f is an L2 -regularized hinge loss of an unknown datapoint. Our analysis requires a variant of Zinkevich's theorem that accounts both for tracking--i.e., is competitive against a slowly-changing sequence of hypotheses--and for errors in the gradient. A tracking version of Zinkevich's theorem is already known (Zinkevich, 2003); however, the existing variants do not account for errors in the gradient. We also need the result to hold for noisy gradients, but this is done with the application of the idea of lemma 2 in (Flaxman et al., 2005) (see subsection 3.1). Lemma 1: Let C be a closed convex set of diameter U . Let f1 , . . . , fT be a sequence of arbitrary convex functions (in the analysis below, each fi will be an L2 -regularized hinge loss). Let E1 , . . . , ET be an arbitrary sequence of vectors (the T gradient errors), and let E = i=1 Ei . Let  > 0 be the learning rate, and let w1  C be an arbitrary initial point. Define wi+1 = C (wi -  · i ), where i = fi (wi ) + Ei , and fi (wi ) is any subgradient of fi at wi . The variable i is a gradient with error, and C (x) is the closest point of C to x. Let G be such that i  G holds for all i. Let w0 , . . . , wT  C be an arbitrary sequence of compari-

Note that if there is no shift (S = 0) and there are no errors in the gradient (E = 0), then we get the original form of Zinkevich's theorem. The variable w0 is introduced only to make the proof shorter and can be eliminated by setting T -1 w0 = w1 , causing the shift to be S = i=1 wi - wi+1 . Note, also, that the ability to cope with a changing optimal solution is inversely proportional to the learning rate: the smaller the learning rate, the harder it is to keep track of change, which makes intuitive sense. Proof: Our proof is standard and is very similar to (Zinkevich, 2003). It measures the speed with which we approach the competing points. For i = 1, . . . , T let Di = wi - wi-1 Then Di = = 1 = = wi - wi-1 wi - wi-1 wi - wi-1 wi - wi-1 wi - wi-1 wi-1 - wi
2 2 2 2 2

- wi+1 - wi 2 .
2 2

- wi+1 - wi

- C (wi -  · i ) - wi - wi -  · i - wi -
2 2

(wi - wi-1 ) + (wi-1 - wi ) -  · i
2 2

- wi - wi-1 -  · i
2 2

2

-

+

2(fi (wi ) + Ei ) (wi - wi ) - 2(wi - wi-1 ) (wi-1 - wi ) 2 - wi-1 - wi
2

-  2 · G2 +

2(fi (wi ) - fi (wi )) - 2 Ei · wi - wi - 2 wi - wi-1 · wi-1 - wi 3 -U · wi-1 - wi -  2 · G2 + 2(fi (wi ) - fi (wi )) - 2 Ei · U - 2 · U · wi-1 - wi The three inequalities are justified as follows: · In ineq. 1 used C (a) - b  a - b for all b  C and all a, since C is convex. · In ineq. 2 we used fi (wi ) (wi - wi )  fi (wi ) - fi (wi ) since fi is convex and fi (wi ) is a subgradient; we also applied Cauchy-Swartz twice.

A simpler unified analysis of Budget Perceptrons

· In ineq. 3 we used the fact that all the points are in C and that its diameter is U . We can finish the proof by noticing that i=1 Di  U 2 ; rearranging and dividing by 2, we get the stated bound.
T

3. Bounding the regret
In this section we will explain the main idea of the unified analysis. Consider both the Forgetron and the RBP when the active set of w already contains B vectors, and suppose that they make an error on vector x. Then both algorithms will remove a vector from the active set, and add x to the active set with an appropriate value for its weight . In this section, we will show that both algorithms follow the gradient of an L2 -regularized hinge loss with corrupted gradients. Adding a vector to the active set according to the Perceptron's learning rule is equivalent to adding the gradient of the hinge loss to w on input x on which the Perceptron errs. This means that if (x, y) is a datapoint and p = sign w, x is an incorrect prediction (p = y), then L (w; (x, y)) = -(y - p)/2 · x. Removing a vector from the active set can be seen as a form of L2 -regularization for both algorithms, where the gradients are corrupted in different ways. Consider first the Forgetron. Recall that it scales down all the weights 's by a constant on each error, which is L2 -regularization with some weight and some learning rate. However, in addition to the scaling, the Forgetron also removes the vector with the smallest weight. If the weight of the removed vector is small, then the Forgetron can be seen as doing L2 -regularization with a small amount of error (where the error removes the oldest point), so we can apply lemma 1. Next, consider the RBP, which removes a random vector from its active set. Let
B

Algorithm 1 The modified Randomized Budget Perceptron. Input: data {(x1 , y1 ), . . . , (xT , yT )}, budget B, the convex domain C, the learning rate . Initialize w1  0. for i = 1 to T do Set pi  sign(wi (xi )) (the prediction) if pi = yi ; i.e., if wi predicts xi correctly then Set wi+1  wi else Set wi+1  C (wi + ·(yi -pi )/2·(xi )-s(wi )) end if end for

In what follows, we will cast the Forgetron and the RBP as algorithms that perform online convex optimization with an L2 -regularized hinge loss, where the Forgetron has errors in its gradients and the RBP is given noisy gradients. We will run the online convex optimization only on the datapoints on which the Perceptron makes a mistake, as in (Shalev-Shwartz, 2007, ch. 2), which is straightforward in the Forgetron's case. However, it is less straightforward in the case of the RBP, because the set of vectors on which the Perceptron makes a mistake is random; nonetheless, we will show that the idea of lemma 2 of (Flaxman et al., 2005) still applies in the next subsection. Both proofs have analogous structure; they differ in their choice of the parameters U (the diameter of the set of vectors we compete against),  (the weight decay),  (the learning rate), and the manner in which the gradient is corrupted. Finally, we will see that our algorithms use large learning rates; however, because of that, we will get that both Perceptrons are able to track a changing hypothesis (a direct consequence of lemma 1)­which was known for the RBP but not for the Forgetron. We will assume that the vectors (xi ) are distinct and that (x)  1 for all x, which implies the following fact. Fact 1: If we run Algorithms 1 and 2, then wt = where |j |   for all j.
B j=1

w I s(w)

=
i=1

(xi )i

 Unif(1, . . . , B) = (xI )I

(xj )j

Removing a random vector from the active set is equivalent to replacing w with w - s(w). However, notice that E[s(w)] = w/B; at the same time, replacing w with w - w/B is what we get from following the gradient of an L2 -regularizer. Hence, replacing w with w - s(w) is no different from performing L2 -regularization where the gradient is corrupted with a certain type of zero-mean noise that depends on w.2
2

Thus, the weight of every vector never exceeds the learning rate, which follows from the form of the gradient of the hinge loss and from the fact that all the vectors are distinct. 3.1. The Randomized Budget Perceptron
terms. If w is represented as the sum of less than B terms, then we will add zero terms to make sure that w is represented as the sum of B terms. For example, if w = (x1 ) + (x2 ), then s(w) = 0 with probability 1 - 2/B.

We defined s(w) when w is represented as the sum of B

A simpler unified analysis of Budget Perceptrons

We will now cast the RBP as an instance of online convex programming. The modified RBP (Algorithm 1) differs from the original RBP in order to be more compatible with lemma 1 in two ways. First, it prevents the weight vector from being too large (using the projection C ); and second, it will occasionally trim the active set even when its size is smaller than B, because s(w) is subtracted even when the active set is smaller than B. However, when the active set is small, then s(w) is likely to be zero. Here is the setup. We are given B, the budget; we will determine , , and U as the proof goes on. We are also given an arbitrary sequence of points (x1 , y1 ), . . . , (xT , yT ) that satisfy (xi )  1 for every i. First, the convex domain C is {w : w  U/2}, so the diameter of C is U . Second, the functions to be minimized are fi (w) = L(w; ((xi ), yi )) +  · w 2 /2. We will write Li (w) for L(w; ((xi ), yi )). Consider Algorithm 1. Once we choose , the choice of the learning rate  is constrained. Indeed, we wish to replace updates of the form i = - · fi (wi ) = - · Li (w) -  · w with the random update i = - · Li (w) - s(w) In order for i to satisfy E[i ] = i , we must have

For each i, define the function hi (w) = fi (w) + (w - wi ) ni . Then hi (wi ) = fi (wi ) and E[hi (w)] = fi (w) for all w; the functions hi are convex and hi (w) = fi (w) + ni . Running the modified RBP is equivalent to running online gradient descent with projections C on the functions hm(j) that correspond to the examples on which the RBP made an error, so lemma 1 applies to hm(j) for 1  j  M with the vectors wm(j) . Using the fact that the hinge loss  1 whenever the algorithm makes an error, as well as applying lemma 1, we get
M

M


j=1 M

fm(j) (wm(j) )

=
j=1 M

hm(j) (wm(j) ) 3 · U SM U2 + + 2 2


j=1

hm(j) (wm(j) ) +

M ·  · G2 +U ·E 2 where the second inequality is due lemma 1, SM is the shift of the vectors {wm(j) }M and G is greater than j=1 hm(j) (wm(j) ) for all j. But we know that hi (w) = Li (w)+· w 2 /2+(w-wi ) ni , that w 2  U 2 /4, and that E = 0. Using this knowledge, we can upper bound the number of errors by
T M

M 1/B =  since E[s(w)] = w/B. We will now show how to apply an idea of (Flaxman et al., 2005) to our setting in a way that will consider only a subset of the training points. Consider a sequence of noise variables n1 , . . . , nT that are defined as the noise in our gradients. Specifically, as we run the modified RBP on the a-priori known sequence of training points,3 we will have noise in our gradients whenever we make an error. If the algorithm makes a mistake on example xi , then let ni = -i / + i / be the noise; that is, ni is exactly the amount of noise we add to the gradients fi (wi ) in order to obtain noisy gradient -i / (note that -i / = fi (wi )). If the algorithm makes no error, let ni = 0. As a result, ni satisfies E[ni |n<i ] = 0, and hence E[ni ] = En<i [E[ni |n<i ]] = 0. Let M be the number of errors an and let m(1), . . . , m(M ) be the indicies where the modified RBP made its errors. The variables M , m(j) and wi are random.
It is known only for the purposes of the analysis, but not to the algorithm.
3


i=1

(7)

Li (wi ) +
j=1

(wm(j) - wm(j) ) nm(j) +

M·

 · U2 U2 3 · US M ·  · G2 + + + 8 2 2 2

where we added the hinge losses of the comparison sequence at all of the points (which are always nonnegative). Here S is the shift of all the comparison vectors wi ; we used SM  S, which holds by the triangle inequality. Finally, by averaging the inequality, we get
T

E[M ] 
i=1 2

Li (wi ) + E[M ] ·

 · U2 + 8

(8)

U 3 · US E[M ] ·  · G2 + + 2 2 2

which is a deterministic bound on the average number of errors. The nm(i) terms disappear since wi is determined by n<i and E[ni |n<i ] = 0. This can be made clearer by defining an indicator variable Erri which is 1 if the algorithm errs on example i and 0 otherwise. Using Erri , we get
M T

(wm(j) - wm(j) ) nm(j) =
j=1 i=1

(wi - wi ) ni · Erri

A simpler unified analysis of Budget Perceptrons

Now, consider E[(wi - wi ) ni · Erri |n<i ]. Given n<i , wi is fixed and Erri is known, and E[ni |n<i ] = 0; this shows that each term in the above sum is zero in expectation.4 This bound is meaningful only when multiplicative factor of E[M ] on the RHS of eq. 8 is less than 1, which is  · U 2 /8 + G2 /2. This constrains the choice of the parameters. Before we find the parameters, we observe that G2  4, which can be seen by noticing that the gradient hi (wi ) has the form ±(xj ) ± c(xk ) for some j and k and some c  1, which follows from fact 1; since (x)  1, the norm hi (wi )  2, and hence G2  4. Finding a set of reasonable parameters is straightforward. By recalling that 1/(B) =  and choosing  = 1/U 2 , B = 16U 2 ,  = 1/16 makes sure that the multiplicative factor of E[M ] on the RHS of eq. 8 is 1/4, yielding the following: Theorem 1: For all sequences of points satisfying (xi )  1, and for all  sequences of comparison vectors wi satisfying wi  B/8, the expected number of mistakes Algorithm 1 makes is bounded by 3 E[M ]  4
T i=2 T

Algorithm 2 The simplified Forgetron. Input: data {(x1 , y1 ), . . . , (xT , yT )}, budget B, the convex domain C, the learning rate , and the weight decay . Initialize w1  0. for i = 1 to T do Set pi  sign(wi (xi )) (the prediction) if pi = yi ; i.e., if wi predicts xi correctly then Set wi+1  wi else Set wi+1  wi If wi+1 's active set is equal to B, then remove the vector with the smallest weight from wi+1 's active set. Set wi+1  wi+1 +·(yi -pi )/2·(xi )-·wi+1 Set wi+1  C (wi+1 ) end if end for

Li (wi ) +
i=1

 B +6·S· B 2

(9)

Ei = ci (xold(i) ) where ci / is the weight of the oldest example in wi multiplied by (1 - ). The need for this multiplication is from eq. 10, which first scales the oldest vector in the active set by 1 -  before removing it with - · Ei . We know that the weight of each example is reduced by a factor of (1 - ) on each error, and if an example is removed from the active set then its weight was removed B times, so |ci |  (1 - )B . Furthermore, since (xold(i) )  1, we have Ei  ci . Note that when the active set is sufficiently small, we do not need to remove the oldest vector, in which case Ei = 0  (1 - )B As in the previous section, we can apply lemma 1 on the errors of the simplified Forgetron; this is considerably more straightforward than for the RBP since the algorithm is deterministic. At once, lemma 1 gives the bound
M

where S = wi - wi-1 , and Algorithm 1 uses  =  1/16, the domain C = {w : w  B/8}, and the budget B. As in (Cavallanti et al., 2007), it is possible to select other constants that can increase the multiplicative constant of E[M ] (which is 3/4) closer to 1 at the expense of being competitive with vectors of smaller norms. 3.2. The simplified Forgetron We perform an analogous analysis of the simplified Forgetron (Algorithm 2), which turns out to be simpler than that of Algorithm 1 because lemma 1 applies directly. Let the datapoints (xi , yi ) and the functions fi be exactly as in the previous section: fi (w) = Li (w) +  w 2 /2, where Li (w) = L(w; ((xi ), yi )). We will run online convex programming on the functions fi with errors in the gradient, where the errors make sure that the update wi+1  C (wi -  · Li (wi ) -  · wi -  · Ei )

M


j=1 M

fm(j) (wm(j) ) U2 3 · U SM + + 2 2


j=1

fm(j) (wm(j) ) + M ·  · G2 +U ·E 2
M

 (10)
j=1

Lm(j) (wm(j) ) + M ·

 · U2 U2 + + 8 2

removes the oldest member of the active set, when the active set has B members. This is arranged by setting
We cannot simply use E[ni ] = 0 because wi is also random, so there might be correlations causing E[ni wi ] to be nonzero.
4

3 · U SM M ·  · G2 + + U · M · (1 - )B 2 2
T


i=1

Li (wi ) + M ·

 · U2 U2 3 · US + + + 8 2 2

A simpler unified analysis of Budget Perceptrons

M ·  · G2 + U · M · (1 - )B 2 where SM is the shift of the vectors wm(j) . We used
B E = j=1 Em(j)  M · (1 - ) and that fi (wi ) = 2 Li (wi ) +  · wi /2  Li (wi ) + U 2 /8 in the third inequality, and the non-negativity of the hinge loss in the last inequality. We also used the fact that SM  S as in the previous section. M

4. Conclusions and Discussions
In this paper we bounded the regret of two budget Perceptron algorithms using the observation that both algorithms can be seen as performing a type of L2 -regularization. We can analyze the original RBP (recall that it differs from the modified RBP in that it does not remove any vector from the active set if there are fewer than B of them) by making sure that first B cost functions are the hinge loss without L2 -regularization. While reasonably straightforward, this analysis is slightly more cumbersome to describe. It is also straightforward to keep track of w 2 = 1i,jB = i j k(xi , xj ) efficiently, which is needed to implement the projection C . It may even be possible to remove the need for the projection C altogether using the lazy-projection variant of Zinkevich's theorem (Zinkevich, 2004). Finally, the assumption that the vectors are all distinct is not strong since we can easily introduce tiny perturbations which will not change the algorithm to a noticeable extent yet will make sure that no two vectors are equal.

Before we choose the parameters to make sure that the multiplicative factor of M on the RHS is less than 1, we need to bound G. We know that the gradient has the form of ±(xi ) - w - Ei , whose norm can be bounded with the triangle inequality by 2 + U  since (xi ) and Ei are both  1, so G2  (2 + U )2 . If we choose  = 1/U 2 and ensure that U  1, we will get that G2  9. As before, we need to choose the parameters to make sure that the multiplicative factor of M on the RHS is small. It will be sufficient for the parameters to satisfy  · G2 1  · U2 + + U · (1 - )B  8 2 2 (11)

Acknowledgements
We thank anonymous reviewers for useful comments and suggestions. This research was partially supported by the Ontario Graduate Scholarship.

We begin by choosing  = 1/U 2 , which causes the first term to be 1/8. By choosing  = 1/32 and using G2  9  16, we get that the second term  1/4. Finally, we need to choose B so that U (1 - 1/(32U 2 ))B  1/8. Since 2 we enforce U  1, we have (1 - 1/(32U 2 ))32U  1/2. So the third term  1/8 if B  32U 2 log2 (8U ), which is satisfied if B U2 = (12) 32 log2 8B Note that if B  500, then U  1. Combining the above, we get the following: Theorem 2: Given a budget B  500, let  = 1/32, U be as above, and  = 1/U 2 . Consider Algorithm 2 with C = {w : w  U/2}. If we run Algorithm 2 on an arbitrary sequence of points satisfying (xi )  1, then the number of errors made by the simplified Forgetron is less than 1 M 2
T

References
Bottou, L., & Bousquet, O. (2008). The tradeoffs of large scale learning. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.), Advances in neural information processing systems, vol. 20, 161­168. NIPS Foundation (http://books.nips.cc). Cavallanti, G., Cesa-Bianchi, N., & Gentile, C. (2007). Tracking the Best Hyperplane with a Simple Budget Perceptron. Machine Learning, 69(2/3), 143­167. Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge University Press. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273­297. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online Passive-Aggressive Algorithms. The Journal of Machine Learning Research, 7, 551­585. Crammer, K., Kandola, J., & Singer, Y. (2004). Online classification on a budget. In S. Thrun, L. Saul and B. Sch¨ lkopf (Eds.), Advances in neural information o processing systems 16, 225­232. Cambridge, MA: MIT Press.

Li (wi ) + 16 · U 2 + 48 · U · S
i=1

(13)

where wi is any sequence of comparison vectors satisfying wi  U/2, and S is the shift of wi . As the original Forgetron regret bound, this bound does not apply against comparison vectors whose norm is of order  ( B).

A simpler unified analysis of Budget Perceptrons

Dekel, O., Shalev-Shwartz, S., & Singer, Y. (2008). The Forgetron: A Kernel-Based Perceptron on a Budget. SIAM Journal on Computing, 37(5), 1342­1372. Flaxman, A. D., Kalai, A. T., & Mcmahan, B. H. (2005). Online convex optimization in the bandit setting: gradient descent without a gradient. SODA '05: Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms (pp. 385­394). Philadelphia, PA, USA: Society for Industrial and Applied Mathematics. Littlestone, N., & Warmuth, M. (1989). The weighted majority algorithm. 30th Annual Symposium on the Foundations of Computer Science., 256­261. Novikoff, A. B. (1963). On convergence proofs for perceptrons. Proceedings of the Symposium on the Mathematical Theory of Automata, 12, 615­622. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386­408. Scholkopf, B., & Smola, A. (2002). Learning with kernels. MIT Press Cambridge, Mass. Shalev-Shwartz, S. (2007). Online Learning: Theory, Algorithms, and Applications. Doctoral dissertation, The Hebrew University of Jerusalem. Shalev-Shwartz, S., & Singer, Y. (2005). A new perspective on an old perceptron algorithm. Proceedings of the Sixteenth Annual Conference on Computational Learning Theory (pp. 264­278). Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. The 24th international conference on Machine learning, 807­814. Shalev-Shwartz, S., & Srebro, N. (2008). SVM optimization: inverse dependence on training set size. Proceedings of the 25th international conference on Machine learning (pp. 928­935). Weston, J., Bordes, A., & Bottou, L. (2005). Online (and offline) on an even tighter budget. 413­420. Society for Artificial Intelligence and Statistics. (Available electronically at http://www.gatsby.ucl.ac.uk/aistats/). Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. Proceedings of the International Conference on Machine Learning (pp. 928­936). Zinkevich, M. (2004). Theoretical Guarantees for Algorithms in Multi-Agent Settings. Doctoral dissertation, Carnegie Mellon University.