The Margin Perceptron with Unlearning Constantinos Panagiotakopoulos costapan@eng.auth.gr Petroula Tsampouka petroula@gen.auth.gr Physics Division, School of Technology, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece Abstract We introduce into the classical Perceptron algorithm with margin a mechanism of unlearning which in the course of the regular update allows for a reduction of possible contributions from "very well classified" patterns to the weight vector. The resulting incremental classification algorithm, called Margin Perceptron with Unlearning (MPU), provably converges in a finite number of updates to any desirable chosen before running approximation of either the maximal margin or the optimal 1-norm soft margin solution. Moreover, an experimental comparative evaluation involving representative linear Support Vector Machines reveals that the MPU algorithm is very competitive. term in the high dimensional space as long as it ensures zero loss for the patterns mapped in that space. In the second treatment minimisation of the capacity term coincides with maximisation of the margin. SVMs traditionally follow the second approach. On the other hand solving the maximum margin problem has also been the goal of various Perceptron-like algorithms (Li & Long, 2002; Gentile, 2001; Tsampouka & Shawe-Taylor, 2007). The key characteristic of such algorithms is that they work in the primal space in an online manner, i.e. processing one example at a time. Cycling repeatedly through the patterns they update their internal state stored in the weight vector each time a misclassification condition is satisfied by a pattern. More specifically, the Perceptron-like algorithms mentioned above are able to attain any approximation of the maximum margin, thereby producing infinitely close approximations of the L2-SVM solution. On the contrary, the classical Perceptron algorithm with margin (Duda & Hart, 1973) provably achieves only up to 1/2 of the maximum margin (Krauth & M´zard, e 1987). The weight vector of the Perceptron algorithm (Rosenblatt, 1958) is built progressively by adding to it the patterns found misclassified multiplied by the corresponding labels. This procedure is characterised as learning. However, by introducing in the present work what we call unlearning, a mechanism allowing for a reduction or even elimination of the contribution that some patterns have to the solution weight vector when they are "very well classified", we find that the Perceptron's inability to provably achieve maximum margin is miraculously remedied. The possibility that the contribution of a given pattern to the current solution may increase or decrease with running time is very common to SVM algorithms although the procedure by which this is achieved does not tempt one to pay particular attention to a clear-cut distinction between learning and unlearning. Since the L1-SVM problem is not known to admit an equivalent maximum margin interpretation via a mapping to an appropriate space Perceptron-like algorithms appeared so far unable to deal with such a task. In early attempts (Guyon & Stork, 1999) at ad- 1. Introduction Support Vector Machines (SVMs) (Boser et al., 1992; Vapnik, 1998; Cristianini & Shawe-Taylor, 2000) have been extensively used as linear classifiers either in the space where the patterns originally reside or in high dimensional feature spaces induced by kernels. SVMs appear to be very successful at addressing the problem of minimising an objective function involving the empirical risk while at the same time keeping low the complexity of the classifier. As measures of the empirical risk various quantities have been proposed with the 1- and 2-norm loss functions being the most widely accepted ones giving rise to the optimisation problems known as L1- and L2-SVMs (Cortes & Vapnik, 1995). In the case that the 2-norm loss function takes the place of the empirical risk an equivalent formulation exists which renders the dataset linearly separable in a high dimensional feature space. Therefore, one can either solve the general optimisation problem consisting of two terms, namely the capacity term and the 2-norm loss or just attempt to minimise the capacity Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). The Margin Perceptron with Unlearning dressing the L1-SVM problem with a Perceptron an upper bound was set on the number of updates associated with any pattern before that pattern ended up being ignored in the process, thereby establishing artificially linear separability. The defect of the method was the impossibility for the patterns ignored to reenter the process rendering the effect of any early updates performed towards wrong directions totally irreversible. As a consequence, the outcome lacked the uniqueness dictated by optimisation theory since it depended heavily on the sequence according to which the patterns were presented to the algorithm. Here, we follow a procedure which does not suffer from the previous restrictions and through which the Perceptron is able in a finite number of steps to achieve any desirable fixed before running approximation of the optimal solution. Again, the mechanism of unlearning may be identified as the decisive factor in phasing out the misleading effect that some patterns have, mostly during early stages of the algorithm. The decision of whether the contribution of a pattern to the weight vector should be reduced concerns only the pattern that is currently presented to the algorithm. Consequently, the mechanism of unlearning becomes vacuous if our algorithm is implemented in a strictly online setting in which all examples are considered to be distinct. This reveals an important difference from the Budget Perceptron (Crammer et al., 2003), an online algorithm featuring a removal of past examples from the expression of the weight vector which, however, only takes place whenever the pattern currently presented to the algorithm is inserted. Furthermore, in a batch setting we aim at convergence to the known SVM solution. The paper is organised as follows. Section 2 describes the algorithm. Section 3 is devoted to a theoretical analysis. In Section 4 we give details concerning the implementation of our algorithm while in Section 5 we deliver our experimental results. Finally, Section 6 contains our conclusions. tiplied by the corresponding labels in order to allow for a uniform treatment of both categories of patterns. Also, R max y k with y k the k-th augmented patk tern multiplied by its label. First we assume that the training set is linearly separable in the appropriate feature space. The relation characterising optimally correct classification of the training patterns by a weight vector u of unit norm in the augmented space is u · y k d u : max u =1 min {u · y i } i k. (1) Here d is the maximum margin in the augmented space with respect to hyperplanes passing through the origin which should be distinguished from the maximum geometric margin in the non-augmented feature space. The Margin Perceptron with Unlearning (MPU) at its t-th step maintains two vectors. The first is the usual augmented weight vector at in the primal representation having the dimensionality M of the feature space. The second is a vector of non-negative integer-valued t components Ik and dimensionality equal to the number N of patterns which keeps track of the update history of the algorithm. Both are initially set to zero, 0 i.e. a0 = 0 and Ik = 0 k, and are updated according to the rule at+1 = at + t y k t+1 t Ik = Ik + t (2) each time the training pattern y k presented to the algorithm satisfies either one of the conditions appearing in the following definition of t t = +1, if at · y k b (3) t -1, if at · y k b + b and Ik > 0. 2. The Margin Perceptron Algorithm with Unlearning Given a training set which may be already mapped into a feature space of higher dimensionality we place all patterns in the same position at a distance in an additional dimension, thereby constructing an embedding of our data into the so-called augmented space (Duda & Hart, 1973). The advantage of this embedding is that the linear hypothesis in the augmented space becomes homogeneous. Throughout our discussion we assume that the augmented patterns are mul- Here b and b are positive constants. We shall refer to an update with t = +1 as a learning update (step) whereas to an update with t = -1 as an unlearnt ing update (step). Thus, the variable Ik is equal to the difference between the number of learning and unlearning steps associated with the k-th pattern. The total number of learning or unlearning steps will be denoted as t+ or t- , respectively. Obviously, the total number of updates t is t = t+ + t- and t Ik = k =0 t-1 (4) = t+ - t- . (5) The Margin Perceptron with Unlearning Algorithm 1 The Margin Perceptron with Unlearning Input: A dataset S = (y 1 , . . . , y k , . . . , y N ) with y k the k-th augmented pattern multiplied by its label Define: R max y k k sification. Actually, it is not very difficult to see that such a gap should necessarily satisfy the constraint b > R2 (7) Require: I, b > 0, b > R2 0 Initialise: t = 0, a0 = 0, Ik = 0 k repeat for k = 1 to N do ptk = at · y k t if (ptk b and Ik < I) then at+1 = at + y k t+1 t Ik = Ik + 1 tt+1 t else if Ik = 0 then continue else if ptk b + b then at+1 = at - y k t+1 t Ik = Ik - 1 tt+1 end if end for until no update made within the for loop t It should be clear that with the variables Ik produced by the algorithm as described above the weight vector admits the so-called dual representation in order for the algorithm to have a chance of converging in a finite number of steps. Indeed, let us assume that the pattern y k presented to the algorithm satisfies at · y k = b. Then, a learning update takes place as a result of which the weight vector becomes at+1 = at + y k . Now, consider a situation in which the same pattern is presented once again to the algorithm without another update having intervened. We 2 t+1 have at+1 · y k = b + y k and Ik > 0. Therefore, 2 if b y k an unlearning step will take place which will exactly cancel the effect of the learning step. This procedure may be repeated again and again leading to a closed loop. If, instead, b > y k 2 k, i.e. b > R2 such unwanted situations are naturally avoided. The same condition also ensures that t+ > t- , if t > 0. Obviously, one step before we arrive at the situation where the number of learning steps equals the number of unlearning steps we should have t+ = t- + 1 meaning that all dual variables vanish except for one which takes the value 1. So, let us assume that this happens at the t-th step of the algorithm and that the only non-vanishing dual variable is associated with the t k-th pattern, i.e. Ik = 1 holds. Then, at = y k from 2 where at · y k = y k < b < b + b forbidding the unlearning step that would lead to equality between learnings and unlearnings. In the case that the assumption of linear separability in the feature space is relaxed we impose the further condition that a learning update is allowed provided t that the dual variable Ik associated with the misclast sified pattern y k satisfies the constraint Ik < I, where I is a positive integer. In such a case the update rule (2) remains the same but t is now defined by t = t +1, if at · y k b and Ik < I (8) t -1, if at · y k b + b and Ik > 0. at = k t Ik y k (6) as a linear combination with positive coefficients of t the training patterns. Consequently, the Ik 's play the role of the dual variables. A learning step takes place whenever the pattern y k satisfies the misclassification condition at · y k b meaning that y k is either a classification or a margin error. We shall refer to such a pattern as misclassified. Instead, an unlearning step takes place whenever the pattern y k is "very well classified", i.e. at · y k b + b, and provided this pattern does have a non-vanishing contribution to the dual representation (6) of the current weight vector. Thus, a prerequisite for unlearning is that learning has previously taken place involving the pattern in question. Moreover, unlearning may be regarded as a mechanism preventing the dual representation (6) from expanding unnecessarily. Notice, that due to the "discrete" nat ture of the update rule the dual variables Ik remain naturally non-negative without having to be clipped as in the case of SVMs. An important feature of the algorithm is the existence of the gap b between the threshold b of the misclassification condition at ·y k b and the threshold b+b of the condition at ·y k b+b for exceedingly good clas- Notice that again as a result of the "discrete" nature of the update rule the dual variables satisfy naturally the condition t 0 Ik I k (9) without any need for clipping. Setting I = in (8) we recover (3). Finally, in the case that the mapping into the feature space is induced by a kernel and a primal representation of at is unavailable it suffices to update the dual t variables Ik according to (2) and perform the inner products using the dual representation (6) of at . The Margin Perceptron with Unlearning 3. Theoretical Analysis We first consider a linearly separable training set possessing maximum margin d in the augmented space with respect to hyperplanes passing through the origin. Theorem 1 The Margin Perceptron with Unlearning and I = converges in a finite number tc of updates satisfying the bound tc R2 + 2b 2 d 1+ R2 + 2b 4(b - R2 ) . (10) by making use of the fact that the quantity (R2 + 2 2b)(t+ - t- ) - d (t+ - t- )2 acquires a maximum with + 2 respect to (t - t- ) for (t+ - t- ) = (R2 + 2b)/2d . Then, taking into account (7) we immediately obtain t- (R2 + 2b)2 . 2 8d (b - R2 ) (16) Again from (15) by dropping the last term in the upper 2 2 bound on at we have d (t+ - t- )2 (R2 + 2b)(t+ - - t ) or 2 t+ - t- (R2 + 2b)/d . (17) Combining (16) and (17) we obtain the upper bound on the number t of updates given by the r.h.s. of (10). Upon convergence of the algorithm in tc updates it is obvious that all the patterns y k which have a nonvanishing contribution to the expansion (6) of the solution weight vector atc necessarily have their inner product with atc in the interval (b, b + b), i.e. t b < atc · y k < b + b , if Ikc > 0. Moreover, the zero-threshold solution hyperplane possesses margin d which is a fraction f of the maximum margin d obeying the inequality f d /d > 1 + min{b, b + R }/b 2 -1 . (11) Finally, an after-running lower bound on f involving the margin d achieved, the length atc of the solution weight vector atc and the final difference between the number of learning and unlearning steps (t+ - t- ) is c c given by f = d /d d (t+ - t- )/ atc . c c (12) (18) From (6) with t = tc we get atc 2 = atc · k t Ikc y k = k t Ikc atc · y k Proof Taking the inner product with the optimal direction u in (6) and using (1) and (5) we obtain at at ·u = k t Ik (y k ·u) or, taking into account (5) and (18), b(t+ - t- ) < atc c c 2 d k t Ik = d (t -t ). + - < (b + b)(t+ - t- ). c c (19) (13) From the update rule (2) taking into account the conditions under which the update takes place and the definition of t we get at+1 2 Since upon convergence all patterns violate the misclassification condition we have atc · y k > b k k with utc or equivalently utc · y k > b/ atc atc / atc . Stated differently, the margin achieved satisfies d = min {utc · y k } > b/ atc . Additionally, k - at 2 = yk 2 + 2t at · y k from (13) 1/d (t+ - t- )/ atc . c c Thus, we obtain . (21) If we now make use in (21) of the upper bound on atc 2 from (19) we get f > (1 + b/b)-1 . If, instead, 2 we employ (15) to obtain the bound atc (R2 + -1 . The above 2b)(t+ - t- ) we get f > 2 + R2 /b c c lower bounds on f written compactly lead to (11). Finally, (12) follows readily from (20). Remark 1 From (11) it becomes apparent that as b with b kept fixed the MPU algorithm with I = is guaranteed before running to converge to the maximum margin solution, i.e. d d . If, instead, we first take the limit b (thereby recovering the classical Perceptron with margin) and then the limit b we get d > d /2. f = d /d > (b/ atc )/d b(t+ - t- )/ atc c c 2 R2 + 2t b - b(1 - t ) a repeated application of which, using also (4) and (5) and the initial condition a0 = 0, gives t-1 t-1 (20) at R2 t + 2b =0 + 2 - b t - =0 = R2 t + 2b(t - t- ) - b(t - (t+ - t- )) = (R2 + 2b)(t+ - t- ) - 2(b - R2 )t- . (14) Combining (13) and (14) we obtain a Novikoff-like (Novikoff, 1962) squeezing relation 2 d (t+ -t- )2 at 2 (R +2b)(t -t )-2(b-R )t . (15) 2 + - 2 - From (15) we get 2 2(b - R2 )t- (R2 + 2b)(t+ - t- ) - d (t+ - t- )2 2 (R2 + 2b)2 /4d The Margin Perceptron with Unlearning Remark 2 The lower bound on atc from (15) with the upper bound from (19) lead to t+ - t- < (b + c c 2 2 b)/d . Also, from d d > b/ atc we get atc > 2 2 2 b /d which combined with the upper bound on atc 2 -1 -2 - + from (19) leads to tc - tc > b (b + b) d . Thus, upon convergence of the MPU algorithm it holds that -2 -2 (1 + b/b)-1 bd < t+ - t- < (1 + b/b)bd c c 2 t Proof From (5) and the restriction Ik I we have t+ - t- = k t Ik N I. (25) Also, from (14) and taking into account (25) we get 2(b - R2 )t- (R2 + 2b)(t+ - t- ) N I(R2 + 2b) R2 + 2b 2(b - R2 ) which combined with (25) leads to the upper bound on the number t of updates given by the r.h.s. of (22). t- N I Let us assume that the algorithm has converged in tc updates to the solution weight vector atc . We define I tc I b atc , c = k (0 c ), = <1 k k b b b b and notice that only the c 's with k from the sets k wc = K1 = {k : c > 0, 1 < wc · y k < 1 + } k K2 = {k : c = I/b, wc · y k 1} k are non-zero and consequently wc = k or meaning that - t- is strongly constrained. Morec -2 + - over, (tc - tc )/b d as b . This is the analog -2 for the hard of the well-known result k k = margin SVM with k being the optimal dual variables. Remark 3 As we already pointed out the MPU algorithm in a strictly online setting reduces to the classical Perceptron with margin. Nevertheless, our derivation of the upper bound on the number of updates given by the r.h.s. of (10) does hold in the context of any related strictly online scenario, e.g. a variation of the variablesize cache Budget Perceptron, in the t-th step of which the insertion (t = +1) or the removal (t = -1) of the pattern y k is performed according to the rule (2) under t the conditions stated in (3) (with Ik {0, 1}). Crucial in this respect is the validity of b > R2 . The r.h.s. of (17) was shown in (Crammer et al., 2003) to be an upper bound on the variable cache size of the Budget Perceptron but no mistake bound was obtained. Now we relax the assumption of linear separability of the training set in the feature space and we move on to the analysis of the MPU algorithm with I < . Theorem 2 The Margin Perceptron with Unlearning converges in a finite number tc of updates satisfying the bound 2b + b tc N I . (22) b - R2 Let us consider the "objective" function 1 2 J (w, Cp ) w + Cp max{1 - w · y k , 0}, 2 k t+ c (26) c y k . k c y k = k kK1 c y k + k kK2 Then, wc = kK1 2 = wc · k c y k k · yk + kK2 = k c w c · y k k · yk c + k kK2 kK2 c wc k c wc k c - k < (1 + ) kK1 c + k kK2 c w c · y k k < (1 + ) kK1 c + k kK2 c - k kK2 c (1 - wc · y k ) k (27) = (1 + ) k I c - k b max {1 - w c · y k , 0} . k where Cp > 0 is a "penalty" parameter and Jopt (Cp ) min J (w, Cp ) w Now, taking into account (27) we have J (wc , I/b)= w c + I b 2 - 1 wc 2 2 its minimal value with Cp fixed. Then, with atc being the solution weight vector and provided b > b J (atc /b, I/b) - Jopt (I/b) 2b/b J < . Jopt Jopt (I/b) 1 - b/b (23) J Moreover, Jopt satisfies the after-running bound J aft Jopt 1 2 max {1 - wc · y k , 0} k 2 = wc + I b - 1 2 c c y i · y j i j i,j max {1 - wc · y k , 0} k atc 2 +I k max{b - atc · y k , 0} t- ) c - 1 2 b(t+ c - atc 2 -1. (24) < (1 + ) k c - k 1 2 c c y i · y j . i j i,j (28) The Margin Perceptron with Unlearning From the weak duality theorem it follows that J (w, Cp ) L() = k 4. Implementation To reduce the computational cost we construct a twomember sequence of reduced "active sets" of data points with the second being a subset of the first. As we cycle once through the full dataset the largest active set (first-level active set) is formed from the points of the full dataset either satisfying the condit tion at · y k cb with c 1 or having Ik > 0. The ¯ ¯ second-level active set is built as we cycle once through the first-level active set and comprises the points y k t for which Ik undergoes a change and remains positive. Of course, as we cycle through the full dataset or the first-level active set along with the selection of points which form the first-level or the second-level active set we do perform updates whenever a point satisfies the conditions either for learning or unlearning. The second-level active set is presented repetitively to the algorithm for Nep2 mini-epochs. Then, the first-level active set becomes the set under consideration which is presented once again to the algorithm. During each round involving the first-level set, a new second-level set is constructed and a new cycle of Nep2 passes begins. By invoking the first-level set for the (Nep1 +1)th time we trigger the loading of the full dataset and the procedure starts all over again until no update occurs as we run over the full dataset. It is understood, of course, that the maximum number of rounds Nep1 and Nep2 for each active set is not exhausted in the case that no update occurs during a round. Finally, following a common practice every time we make use of the full dataset we actually employ a permuted instance of it since presenting the points to the algorithm in a different order helps avoiding the disorientation caused by spurious patterns in the sequence of examples. Evidently, the whole procedure amounts to a different way of sequentially presenting the patterns to the algorithm and does not affect the validity of our analysis. An additional mechanism providing a very substantial improvement of the computational efficiency is the one of performing multiple updates once a data point is presented to the algorithm1 . It is understood, of course, that in order for a multiple update to be compatible with our theoretical analysis it should be equivalent to a certain number of updates occuring as a result of repeatedly presenting to the algorithm the point in question. In the case of learning we are allowed to t perform min{, I - Ik } learning updates at once with = [(b - at · y k ) y k -2 ] + 1. Indeed, for any positive t integer n < min{, I - Ik } and at+n = at + ny k 2 it holds that at+n · y k = at · y k + n y k at · y k + 1 The MPU with multiple updates bears some resemblance to the algorithm of (Hsieh et al., 2008). k - 1 2 i j y i · y j , i,j where L() is the dual Lagrangian and 0 k Cp . Thus, setting Cp = I/b and k = c we get k J (wc , I/b) Jopt (I/b) k c - k 1 2 c c y i · y j . i j i,j (29) Combining (28) with (29) we obtain J (wc , I/b) - Jopt (I/b) < k c . k (30) Moreover, taking into account (27) we get c - k k 1 2 c c y i · y j = i j i,j k c - k 1 wc 2 c k k 2 > k 1 c - (1 + ) k 2 k 1 c = (1 - ) k 2 which combined with (29) leads to Jopt (I/b) > 1 (1 - ) 2 c . k k (31) From (30) and (31) we immediately obtain (23). t Finally, using k c = k Ikc /b = (t+ - t- )/b, the c c k definition of J and the lower bound on Jopt from (29) we easily derive (24). Remark 4 From the before-running "accuracy" of (23) it follows that as b grows with b kept fixed the MPU algorithm is guaranteed to converge to solutions for which the objective function J (atc /b, I/b) approaches as close as one wishes the optimal one Jopt (Cp ) with Cp = I/b. Thus, if we allow only b to grow while keeping b and I fixed we will asymptotically approach an optimal objective with vanishing penalty parameter Cp . If, instead, our goal is to approach the optimal objective Jopt (Cp ) with a given value of Cp we should follow a limiting process in which both I and b increase with their ratio kept all the way equal to the desirable value of Cp . A simple way of doing so is to choose first a sufficiently large value for the integer I and then determine b from the relation b = I/Cp with the value of b kept fixed. To select the appropriate value of I which ensures before running the desired accuracy we should, of course, express in -1 terms of I and Cp as = 2Cp bI -1 1 - Cp bI -1 from where I = [Cp b(2 + ) -1 ] + 1 (with [x] the integer part of x 0). We conclude that the size of the upper bound I on the dual variables in the MPU algorithm does not depend on the penalty parameter Cp alone but on the required accuracy as well. The Margin Perceptron with Unlearning Table 1. Results of a high accuracy comparative study of the MPU, DCD, SVMperf and Pegasos algorithms. data set Adult Web Physics Real-sim News20 Webspam Covertype C11 CCAT 2 MPU J 11434.4 6605.17 49643.6 5402.25 2562.57 69592.1 340490 44725.4 92990.2 Secs 0.4 0.1 0.4 0.5 4.9 3.1 22.2 5.7 8.1 0.06 0.07 0.05 0.01 0.003 0.07 0.08 0.02 0.02 DCD J 11434.4 6605.15 49642.6 5402.22 2562.57 69591.4 340491 44725.3 92989.3 -2 2 Secs 0.3 0.1 2.0 0.4 4.2 2.8 46.6 5.1 7.9 0.004 0.001 0.02 0.001 0.001 0.001 0.008 0.001 0.001 SVMperf J 11434.4 6605.22 49644.1 5402.58 2562.71 69592.9 340493 44730.5 92992.8 Secs 9.3 11.1 2.2 6.1 659.6 10.3 505.4 29.2 54.3 iter 3 × 107 2 × 107 5 × 105 2 × 107 1 × 107 9 × 106 1 × 108 1 × 108 1 × 108 Pegasos J 11434.9 6605.31 49644.2 5402.53 2562.69 69599.6 340505 44730.3 93000.5 Secs 397.0 199.7 13.3 395.3 1673.0 297.7 2447.9 2743.6 3103.6 ( - 1) y k at · y k + (b - at · y k ) y k yk = b meaning that after n consecutive updates the pattern t+n t y k is still misclassified with Ik = Ik + n < I and consequently one more learning update involving that pattern could legitimately take place. Similarly, in the case of unlearning we can show that we are allowed to t perform min{, Ik } unlearning updates at once with -2 = [(at · y k - (b + b)) y k ] + 1. Obviously, the gain from the mechanism of multiple updates lies in the fact that at the expense of a single inner product calculation we effectively perform numerous updates. Finally, we point out that we may terminate the MPU algorithm with I < before it converges when the after-running accuracy aft entering (24) (with atc and (t+ - t+ ) now being the final weight vector and the c c final value of (t+ - t- ), respectively) falls below a certain desirable level stop at the end of a complete cycle through the full dataset. Notice, that the derivation of (24) relies only on the validity of (5) and (9) which are always respected by the algorithm. Moreover, if the guaranteed before-running accuracy of (23) is equal or smaller than stop it is obvious that, despite the early stopping, the algorithm will necessarily achieve the required accuracy stop in at most the number of updates given by the r.h.s. of (22). Similarly, if I = we may terminate the MPU algorithm when the lower bound of (12) on f falls below the r.h.s. of (11). on large datasets provided linear kernels are employed. Among such algorithms we choose for our experiments the Dual Coordinate Descent (DCD) algorithm (Hsieh et al., 2008), SVMperf (Joachims, 2006), the first cutting-plane algorithm for training linear SVMs and Pegasos (Shalev-Schwartz et al., 2007), a stochastic gradient descent algorithm. The source for DCD (version 1.5) is available at http://www.csie.ntu.edu. tw/cjlin/liblinear, for SVMperf (version 2.50) at http://smvlight.joachims.org and for Pegasos at http://ttic.uchicago.edu/shai/code. The datasets we used for training are the binary Adult (N =32561, M =123) and Web (N =49749, M =300) UCI datasets as compiled by Platt, the training set of the KDD04 Physics dataset (N =50000, M =70 after removing the 8 columns containing missing features) obtainable from http: //kodiak.cs.cornell.edu/kddcup/datasets.html, the Real-sim (N =72309, M =20958), News20 (N =19996, M =1355191) and Webspam (unigram treatment with N =350000 and M =254) datasets all available at http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets, the multiclass Covertype (N =581012, M =54) UCI dataset and the full Reuters RCV1 dataset (N =804414, M =47236) obtainable from http://www.jmlr.org/papers/ volume5/lewis04a/lyrl2004 rcv1v2 README.htm. In the case of the Covertype dataset we study the binary classification problem of the first class versus rest while for the RCV1 we consider both the binary text classification tasks of the C11 and CCAT classes versus rest. The Physics and Covertype datasets were rescaled by multiplying all the features with 0.001. The experiments were conducted on a 2.5 GHz Intel Core 2 Duo processor with 3 GB RAM running Windows Vista. Our codes written in C++ were compiled using the g++ compiler under Cygwin. In our experiments we chose Cp = 1. Also, we did 5. Experiments We concentrate on the more popular L1-SVM problem and compare the MPU algorithm with SVMs on the basis of their ability to arrive fast at a certain approximation of the optimal 1-norm soft margin solution. Moreover, in order to be able to train on rather large datasets we choose as feature space the initial instance space. Recently, a lot of effort has been devoted to developing algorithms which are able to train fast The Margin Perceptron with Unlearning Table 2. Results of a low (1%) accuracy comparative study of the MPU, DCD, SVMperf and Pegasos algorithms. We report training times in secs keeping the dataset ordering of Table 1. References Boser, B., Guyon, I., and Vapnik, V. A training algorithm for optimal margin classifiers. In COLT, pp. 144­152, 1992. Cortes, C. and Vapnik, V. Support vector networks. Machine Learning, 20:273­297, 1995. Crammer, K., Kandola, J., and Singer, Y. Online classification on a budget. In NIPS 16, 2003. Cristianini, N. and Shawe-Taylor, J. An Introduction to Support Vector Machines. Cambridge University Press, 2000. Duda, R. O. and Hart, P. E. Pattern Classsification and Scene Analysis. Wiley, 1973. Gentile, C. A new approximate maximal margin classification algorithm. JMLR, 2:213­242, 2001. Guyon, I. and Stork, D. Linear discriminant and support vector classifiers. In Advances in Large Margin Classifiers. MIT Press, 1999. Hsieh, C.-J., Chang, K.-W., Lin, C.-J., Keerthi, S. S., and Sundararajan, S. A dual coordinate descent method for large-scale linear SVM. In ICML, pp. 408­415, 2008. Joachims, T. Training linear SVMs in linear time. In KDD'06, pp. 217­226. ACM Press, 2006. Krauth, W. and M´zard, M. Learning algorithms with e optimal stability in neural networks. Journal of Physics A, 20:L745­L752, 1987. Li, Y. and Long, P. The relaxed online maximum margin algorithm. Machine Learning, 46:361­387, 2002. Novikoff, A. B. J. On convergence proofs on perceptrons. In Proc. Symp. Math. Theory Automata, Vol. 12, pp. 615­622, 1962. Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65 (6):386­408, 1958. Shalev-Schwartz, S., Singer, Y., and Srebro, N. Pegasos: Primal estimated sub-gradient solver for SVM. In ICML, pp. 807­814, 2007. Tsampouka, P. and Shawe-Taylor, J. Approximate maximum margin algorithms with rules controlled by the number of mistakes. In ICML, pp. 903­910, 2007. Vapnik, V. Statistical Learning Theory. Wiley, 1998. MPU DCD SVMperf Pegasos 0.1 0.2 1.5 4.0 0.1 0.1 0.3 1.0 0.1 0.1 0.3 0.3 0.3 0.2 0.8 3.2 1.6 0.6 12.8 10.6 1.2 1.4 7.4 13.7 6.2 9.1 76.5 141.9 3.1 3.5 12.9 27.9 3.7 3.6 19.3 31.6 not include any bias term (i.e. we did not perform any augmentation). The implementation of the MPU algorithm followed closely the description of Section 4 with active set parameters c = 1.01, Nep1 = 3 ¯ and Nep2 = 10. Moreover, we did make use of the mechanism of early stopping. We also assumed b = 3R2 throughout. For Pegasos we chose k = 30 and = N -1 (since (N )-1 plays the role of Cp ). In the first experiment our procedure was to obtain a very good approximation of the optimal objective J Jopt (i.e. Jopt 10-4 ) using MPU with = 10-5 and stop = 10-4 and then try to achieve comparable values of J with the other linear SVM solvers. For DCD and SVMperf we employed the "accuracy" ( 0.1) whereas for Pegasos the number of iterations iter in order to obtain the required value of J . Table 1 contains our experimental results (objective J and training runtime) for this high accuracy comparative study. SVMperf is clearly considerably slower than DCD and MPU but much faster than Pegasos. DCD, instead, appears to be statistically the fastest but still its runtimes are very close to the ones of MPU. We also carried out a second experiment aiming at obtaining a value of J which deviates from Jopt by 1%. The results of this low accuracy study are reported in Table 2. We see that DCD and MPU are still the fastest with close runtimes followed by SVMperf . The remarkable finding is, however, the spectacular improvement that Pegasos presents. In the light of our results we may conclude that the MPU algorithm is very competitive. 6. Conclusions By enriching the classical Margin Perceptron with a mechanism of unlearning which allows the algorithm to recover from any wrong decisions taken on the way we succeded in tackling both the hard margin and the 1norm soft margin problems. We derived upper bounds on the number of steps required in order for the algorithm to achieve any desirable fixed before running approximation of the optimal solution for both tasks and described strategies for its efficient implementation. Finally, the experiments provided evidence that the new algorithm is a very competitive linear SVM.