A simpler unified analysis of Budget Perceptrons Ilya Sutskever University of Toronto, 6 King's College Rd., Toronto, Ontario, M5S 3G4, Canada ILYA @ CS . UTORONTO . CA Abstract The kernel Perceptron is an appealing online learning algorithm that has a drawback: whenever it makes an error it must increase its support set, which slows training and testing if the number of errors is large. The Forgetron and the Randomized Budget Perceptron algorithms overcome this problem by restricting the number of support vectors the Perceptron is allowed to have. These algorithms have regret bounds whose proofs are dissimilar. In this paper we propose a unified analysis of both of these algorithms by observing that the way in which they remove support vectors can be seen as types of L2 -regularization. By casting these algorithms as instances of online convex optimization problems and applying a variant of Zinkevich's theorem for noisy and incorrect gradient, we can bound the regret of these algorithms more easily than before. Our bounds are similar to the existing ones, but the proofs are less technical. 1.1. Perceptrons The Perceptron (Rosenblatt, 1958) is the oldest and one of the most widespread online learning algorithms. It classifies n-dimensional real vectors into one of two possible classes (which are 1 and -1), and stores its knowledge in the form of an n-dimensional weight vector w. It predicts the class p = sign(w x) for the input vector x, and updates its parameters by the equation wi+1 = wi + xi · (yi - pi )/2 (1) where yi {1, -1} is the label of the example xi , and pi = sign(wi xi ) is the Perceptron's prediction on the example xi . This update is nonzero only when the Perceptron makes a mistake. There are several variants on the Perceptron learning algorithm (e.g., (Shalev-Shwartz & Singer, 2005; ShalevShwartz et al., 2007; Crammer et al., 2006; Littlestone & Warmuth, 1989); see references in (Cesa-Bianchi & Lugosi, 2006)). They are similar as algorithms and have formal performance guarantees. The oldest guarantee of this kind is Novikoff's theorem (Novikoff, 1963), which states that the number of errors that the Perceptron makes is at most 1/ 2 , where is the margin of a set of weights w that makes no errors on the data (in this setting, the margin is the distance between the hyperplane defined by this Perceptron and the point closest to it; the theorem assumes that x 1).1 1.2. Kernels Despite its simplicity and appeal, the Perceptron can represent a fairly limited set of hypotheses, and there are many practical problems for which the Perceptron simply cannot obtain low training error. This difficulty was first overcame by the Support Vector Machine (SVM) (Cortes & Vapnik, 1995), which is a nonlinear Perceptron that works very well in practice. The strength of the SVM comes from two main sources. First, it uses the kernel trick (Scholkopf & Smola, 2002), which is a way of dramatically enhancing the set of hypothesis representable by the Perceptron. Consider mapping a training example x to an expanded 1 1. Introduction Traditional batch learning algorithms update their parameters after processing every datapoint in the training set. In contrast, online learning algorithms update their parameters based on small batches of training examples. When the training set is large enough, batch algorithms will learn more slowly than online algorithms (e.g., (Bottou & Bousquet, 2008; Shalev-Shwartz & Srebro, 2008)). For example, if the training set is redundant, online learning algorithms can finish learning before completing their first pass over the training set, which is not possible with batch algorithms. Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). In this paper, all norms are L2 . A simpler unified analysis of Budget Perceptrons set of features (x); for example, ((x1 , . . . , xn )) = jS xj S{1,...,n} (a 2n -dimensional vector). If, in- stead of classifying x, we classify (x) with a Perceptron, we could represent a much richer set of hypotheses (simply because Perceptrons that classify (x) have 2n parameters while Perceptrons that classify x have only n). Using (x) directly is obviously infeasible due to its size; however, the n inner product (x), (y) = i=1 (1+xi ·yi ) is efficiently computable, and the kernel trick makes essential use of this efficiency: notice that if we run the Perceptron on examples of the form (x), then M main computational challenge in doing so is computing the inner products w, (x) . To do so, we must store the active set {x1 , . . . , xM } and evaluate the kernel function M times, so the time to compute w, (x) grows linearly with the size of the active set. If the Perceptron makes a large number of errors during training, the size of its active set will become large (see eq. 2), which will significantly slow both training and testing. There have been a number of attempts to fix the problem of large active sets in the online kernel Perceptron (Crammer et al., 2004; Weston et al., 2005; Dekel et al., 2008; Cavallanti et al., 2007), the last two of which have formal performance guarantees. These algorithms enforce a strict upper bound B (the budget) on the active set, making sure that these Perceptrons are efficient. They do so by removing a vector from the active set whenever the active set becomes too large. We will now describe the two algorithms relevant to this work. The Forgetron (Dekel et al., 2008) is the first budget Perceptron that had a formal performance guarantee. It enforces a strict bound on the active set by removing vectors as follows: first, it reduces the weight i of every vector in its active set; then, it discards the oldest vector in the active set, which has the smallest weight. This is done only when the size of the active set exceeds the budget B, so applying this removal procedure on every error will ensure that the size of the active set will not exceed the budget. Due to the repeated weight reductions, the oldest vector will have small weight, so removing it will not change the Perceptron significantly. The factor by which the Forgetron reduces weight is not constant and differs from error to error. The Randomized Budget Perceptron (RBP) of (Cavallanti et al., 2007) is considerably simpler than the Forgetron: whenever the Perceptron makes an error and the size of the active set exceeds the budget, RBP discards a random vector from its active set. Both algorithms have regret bounds whose proofs are different and are somewhat nontrivial. In this paper, we present a view that lets us reprove similar regret bounds for the RBP and for a simplified Forgetron (one that reduces the weights by the same factor on each error) with less effort using the same idea. The idea is that both algorithms can be seen as doing online gradient descent on the hinge loss with L2 -regularization on their errors, where the RBP algorithm adds a special kind of zero-mean noise to the gradient and the Forgetron has a small deterministic error in its gradient. Thus, we will view these algorithms as performing online convex programming on the points where the algorithms err, and apply a version of Zinkevich's theorem (Zinkevich, 2003) that accounts for a changing target and errors in the gradients, which we will prove in the next section. This will let us bound the total number of errors w= i=1 (xi )i (2) where i ±1, M is the number of errors, and xi are the errors themselves. When w has this form, we can compute the Perceptron's predictions by the equation M sign w, (x) = sign i=1 (xi ), (x) i (3) which is a dramatically more efficient than working with the (x)'s directly. The inner product (xi ), (x) is represented by the kernel function k(x, y). Following (Dekel et al., 2008), we call {x1 , . . . , xM } the active set of w. In addition to the kernel trick, the SVM uses regularization and minimizes the hinge loss, which prevents the SVM from overfitting and causes it to find sparse solutions­ vectors w that can be expressed as a linear combination of a small number of (xi )'s. The hinge loss function L is defined by the equation L(w; (x, y)) = max(0, 1 - w, x y) (4) and given a training set {(x1 , y1 ), . . . , (xn , yn )}, the SVM minimizes the cost function n L(w; (xi , yi )) + · w 2 /2 i=1 (5) An important property of the hinge loss is that if w does not classify x correctly (i.e., sign w, x = y), then L(w; (x, y)) > 1, implying that the total hinge loss of the training set upper bounds the number of errors in the training set (e.g., (Shalev-Shwartz, 2007)). 1.3. Budget SVMs are usually trained with batch algorithms, but it is tempting to apply the plain Perceptron to the vectors (x), as described in the previous sections, in order to obtain an online learning algorithm for the Kernel Perceptron. The A simpler unified analysis of Budget Perceptrons with the hinge loss of the best slowly-changing sequence of hypotheses. As a result, obtain similar proofs for a regret bound of the RBP and the simplified Forgetron. The regret bound that we prove for the RBP has a better dependence on the budget size than (Cavallanti et al., 2007), but it is competitive against hypotheses of a slightly smaller norm. An unusual aspect of the regret bound of the RBP is that it is shown to be competitive not against a fixed Perceptron, but against a sequence of slowly-changing Perceptrons. Our bounds for both the Forgretron and the RBP will also be able to cope with a sequence of slow-changing Perceptrons, thanks to lemma 1 below. son vectors, and let S = shift. Then T T i=1 wi-1 - wi be their total (fi (wi ) - fi (wi )) i=1 U 2 3 · U S T · · G2 + + +U ·E 2 2 2 (6) 2. Online convex programming with errors in the gradients Online convex programming (OCP) is the setting where in each round, a convex function f is chosen; we choose a point x without knowing f and experience f (x) loss. Our goal is to have a cumulative loss that is not much greater than the cumulative loss of any fixed point x . Zinkevich's theorem states that this goal is attained with simple gradient descent. We will apply Zinkevich's theorem, where the function f is an L2 -regularized hinge loss of an unknown datapoint. Our analysis requires a variant of Zinkevich's theorem that accounts both for tracking--i.e., is competitive against a slowly-changing sequence of hypotheses--and for errors in the gradient. A tracking version of Zinkevich's theorem is already known (Zinkevich, 2003); however, the existing variants do not account for errors in the gradient. We also need the result to hold for noisy gradients, but this is done with the application of the idea of lemma 2 in (Flaxman et al., 2005) (see subsection 3.1). Lemma 1: Let C be a closed convex set of diameter U . Let f1 , . . . , fT be a sequence of arbitrary convex functions (in the analysis below, each fi will be an L2 -regularized hinge loss). Let E1 , . . . , ET be an arbitrary sequence of vectors (the T gradient errors), and let E = i=1 Ei . Let > 0 be the learning rate, and let w1 C be an arbitrary initial point. Define wi+1 = C (wi - · i ), where i = fi (wi ) + Ei , and fi (wi ) is any subgradient of fi at wi . The variable i is a gradient with error, and C (x) is the closest point of C to x. Let G be such that i G holds for all i. Let w0 , . . . , wT C be an arbitrary sequence of compari- Note that if there is no shift (S = 0) and there are no errors in the gradient (E = 0), then we get the original form of Zinkevich's theorem. The variable w0 is introduced only to make the proof shorter and can be eliminated by setting T -1 w0 = w1 , causing the shift to be S = i=1 wi - wi+1 . Note, also, that the ability to cope with a changing optimal solution is inversely proportional to the learning rate: the smaller the learning rate, the harder it is to keep track of change, which makes intuitive sense. Proof: Our proof is standard and is very similar to (Zinkevich, 2003). It measures the speed with which we approach the competing points. For i = 1, . . . , T let Di = wi - wi-1 Then Di = = 1 = = wi - wi-1 wi - wi-1 wi - wi-1 wi - wi-1 wi - wi-1 wi-1 - wi 2 2 2 2 2 - wi+1 - wi 2 . 2 2 - wi+1 - wi - C (wi - · i ) - wi - wi - · i - wi - 2 2 (wi - wi-1 ) + (wi-1 - wi ) - · i 2 2 - wi - wi-1 - · i 2 2 2 - + 2(fi (wi ) + Ei ) (wi - wi ) - 2(wi - wi-1 ) (wi-1 - wi ) 2 - wi-1 - wi 2 - 2 · G2 + 2(fi (wi ) - fi (wi )) - 2 Ei · wi - wi - 2 wi - wi-1 · wi-1 - wi 3 -U · wi-1 - wi - 2 · G2 + 2(fi (wi ) - fi (wi )) - 2 Ei · U - 2 · U · wi-1 - wi The three inequalities are justified as follows: · In ineq. 1 used C (a) - b a - b for all b C and all a, since C is convex. · In ineq. 2 we used fi (wi ) (wi - wi ) fi (wi ) - fi (wi ) since fi is convex and fi (wi ) is a subgradient; we also applied Cauchy-Swartz twice. A simpler unified analysis of Budget Perceptrons · In ineq. 3 we used the fact that all the points are in C and that its diameter is U . We can finish the proof by noticing that i=1 Di U 2 ; rearranging and dividing by 2, we get the stated bound. T 3. Bounding the regret In this section we will explain the main idea of the unified analysis. Consider both the Forgetron and the RBP when the active set of w already contains B vectors, and suppose that they make an error on vector x. Then both algorithms will remove a vector from the active set, and add x to the active set with an appropriate value for its weight . In this section, we will show that both algorithms follow the gradient of an L2 -regularized hinge loss with corrupted gradients. Adding a vector to the active set according to the Perceptron's learning rule is equivalent to adding the gradient of the hinge loss to w on input x on which the Perceptron errs. This means that if (x, y) is a datapoint and p = sign w, x is an incorrect prediction (p = y), then L (w; (x, y)) = -(y - p)/2 · x. Removing a vector from the active set can be seen as a form of L2 -regularization for both algorithms, where the gradients are corrupted in different ways. Consider first the Forgetron. Recall that it scales down all the weights 's by a constant on each error, which is L2 -regularization with some weight and some learning rate. However, in addition to the scaling, the Forgetron also removes the vector with the smallest weight. If the weight of the removed vector is small, then the Forgetron can be seen as doing L2 -regularization with a small amount of error (where the error removes the oldest point), so we can apply lemma 1. Next, consider the RBP, which removes a random vector from its active set. Let B Algorithm 1 The modified Randomized Budget Perceptron. Input: data {(x1 , y1 ), . . . , (xT , yT )}, budget B, the convex domain C, the learning rate . Initialize w1 0. for i = 1 to T do Set pi sign(wi (xi )) (the prediction) if pi = yi ; i.e., if wi predicts xi correctly then Set wi+1 wi else Set wi+1 C (wi + ·(yi -pi )/2·(xi )-s(wi )) end if end for In what follows, we will cast the Forgetron and the RBP as algorithms that perform online convex optimization with an L2 -regularized hinge loss, where the Forgetron has errors in its gradients and the RBP is given noisy gradients. We will run the online convex optimization only on the datapoints on which the Perceptron makes a mistake, as in (Shalev-Shwartz, 2007, ch. 2), which is straightforward in the Forgetron's case. However, it is less straightforward in the case of the RBP, because the set of vectors on which the Perceptron makes a mistake is random; nonetheless, we will show that the idea of lemma 2 of (Flaxman et al., 2005) still applies in the next subsection. Both proofs have analogous structure; they differ in their choice of the parameters U (the diameter of the set of vectors we compete against), (the weight decay), (the learning rate), and the manner in which the gradient is corrupted. Finally, we will see that our algorithms use large learning rates; however, because of that, we will get that both Perceptrons are able to track a changing hypothesis (a direct consequence of lemma 1)­which was known for the RBP but not for the Forgetron. We will assume that the vectors (xi ) are distinct and that (x) 1 for all x, which implies the following fact. Fact 1: If we run Algorithms 1 and 2, then wt = where |j | for all j. B j=1 w I s(w) = i=1 (xi )i Unif(1, . . . , B) = (xI )I (xj )j Removing a random vector from the active set is equivalent to replacing w with w - s(w). However, notice that E[s(w)] = w/B; at the same time, replacing w with w - w/B is what we get from following the gradient of an L2 -regularizer. Hence, replacing w with w - s(w) is no different from performing L2 -regularization where the gradient is corrupted with a certain type of zero-mean noise that depends on w.2 2 Thus, the weight of every vector never exceeds the learning rate, which follows from the form of the gradient of the hinge loss and from the fact that all the vectors are distinct. 3.1. The Randomized Budget Perceptron terms. If w is represented as the sum of less than B terms, then we will add zero terms to make sure that w is represented as the sum of B terms. For example, if w = (x1 ) + (x2 ), then s(w) = 0 with probability 1 - 2/B. We defined s(w) when w is represented as the sum of B A simpler unified analysis of Budget Perceptrons We will now cast the RBP as an instance of online convex programming. The modified RBP (Algorithm 1) differs from the original RBP in order to be more compatible with lemma 1 in two ways. First, it prevents the weight vector from being too large (using the projection C ); and second, it will occasionally trim the active set even when its size is smaller than B, because s(w) is subtracted even when the active set is smaller than B. However, when the active set is small, then s(w) is likely to be zero. Here is the setup. We are given B, the budget; we will determine , , and U as the proof goes on. We are also given an arbitrary sequence of points (x1 , y1 ), . . . , (xT , yT ) that satisfy (xi ) 1 for every i. First, the convex domain C is {w : w U/2}, so the diameter of C is U . Second, the functions to be minimized are fi (w) = L(w; ((xi ), yi )) + · w 2 /2. We will write Li (w) for L(w; ((xi ), yi )). Consider Algorithm 1. Once we choose , the choice of the learning rate is constrained. Indeed, we wish to replace updates of the form i = - · fi (wi ) = - · Li (w) - · w with the random update i = - · Li (w) - s(w) In order for i to satisfy E[i ] = i , we must have For each i, define the function hi (w) = fi (w) + (w - wi ) ni . Then hi (wi ) = fi (wi ) and E[hi (w)] = fi (w) for all w; the functions hi are convex and hi (w) = fi (w) + ni . Running the modified RBP is equivalent to running online gradient descent with projections C on the functions hm(j) that correspond to the examples on which the RBP made an error, so lemma 1 applies to hm(j) for 1 j M with the vectors wm(j) . Using the fact that the hinge loss 1 whenever the algorithm makes an error, as well as applying lemma 1, we get M M j=1 M fm(j) (wm(j) ) = j=1 M hm(j) (wm(j) ) 3 · U SM U2 + + 2 2 j=1 hm(j) (wm(j) ) + M · · G2 +U ·E 2 where the second inequality is due lemma 1, SM is the shift of the vectors {wm(j) }M and G is greater than j=1 hm(j) (wm(j) ) for all j. But we know that hi (w) = Li (w)+· w 2 /2+(w-wi ) ni , that w 2 U 2 /4, and that E = 0. Using this knowledge, we can upper bound the number of errors by T M M 1/B = since E[s(w)] = w/B. We will now show how to apply an idea of (Flaxman et al., 2005) to our setting in a way that will consider only a subset of the training points. Consider a sequence of noise variables n1 , . . . , nT that are defined as the noise in our gradients. Specifically, as we run the modified RBP on the a-priori known sequence of training points,3 we will have noise in our gradients whenever we make an error. If the algorithm makes a mistake on example xi , then let ni = -i / + i / be the noise; that is, ni is exactly the amount of noise we add to the gradients fi (wi ) in order to obtain noisy gradient -i / (note that -i / = fi (wi )). If the algorithm makes no error, let ni = 0. As a result, ni satisfies E[ni |n