The Pro jectron: a Bounded Kernel-Based Perceptron

Francesco Orab ona Joseph Keshet Barbara Caputo IDIAP Research Institute, Centre du Parc, CH-1920 Martigny, Switzerland Swiss Federal Institute of Technology (EPFL), CH-1015 Lausanne, Switzerland

forabona@idiap.ch jkeshet@idiap.ch bcaputo@idiap.ch

Abstract
We present a discriminative online algorithm with a bounded memory growth, which is based on the kernel-based Perceptron. Generally, the required memory of the kernelbased Perceptron for storing the online hypothesis is not bounded. Previous work has been focused on discarding part of the instances in order to keep the memory bounded. In the proposed algorithm the instances are not discarded, but pro jected onto the space spanned by the previous online hypothesis. We derive a relative mistake bound and compare our algorithm both analytically and empirically to the state-of-the-art Forgetron algorithm (Dekel et al, 2007). The first variant of our algorithm, called Pro jectron, outperforms the Forgetron. The second variant, called Pro jectron++, outperforms even the Perceptron.

construct their classification function incrementally, keeping a subset of the instances called support set. Each time an instance is misclassified it is added to the support set, and the classification function is defined as a kernel combination of the observations in this set. It is clear that if the problem is not linearly separable, they will never stop updating the classification function. This leads eventually to a memory explosion, and it concretely limits the usage of these methods for all those applications where data must be acquired continuously in time. Several authors tried in the past to address this problem, mainly by bounding a priori the memory requirements. The first algorithm to overcome the unlimited growth of the support set was proposed by Crammer et al. (2003). The algorithm was then refined by Weston et al. (2005). The idea of the algorithm was to discard a vector of the solution, once the maximum dimension has been reached. The strategy was purely heuristic and no mistake bounds were given. A similar strategy has been used also in NORMA (Kivinen et al., 2004) and SILK (Cheng et al., 2007). The very first online algorithm to have a fixed memory "budget" and at the same time to have a relative mistake bound has been the Forgetron (Dekel et al., 2007). A stochastic algorithm that on average achieves similar performances, and with a similar mistake bound has been proposed by Cesa-Bianchi et al. (2006). In this paper we take a different route. We modify the Perceptron algorithm so that the number of stored samples is always bounded. Instead of fixing a priori the maximum dimension of the solution, we introduce a parameter that can be tuned by the user, to trade accuracy for sparseness, depending on the needs of the task at hand. We call the algorithm, that constitutes the first contribution of this paper, Projectron. The Pro jectron is an online, Perceptron-like method that is bounded in space and in time complexity. We derive for it a mistake bound, and we show experimentally that it outperforms consistently the Forgetron algorithm. The second contribution of this paper is the

1. Intro duction
One of the most important aspects of online learning methods is their ability to work in an open-ended fashion. Autonomous agents, for example, need to learn continuously from their surroundings, to adapt to the environment and maintain satisfactory performances. A recent stream of work on artificial cognitive systems have signaled the need for life-long learning methods and the promise of discriminative classifiers for this task (Orabona et al., 2007, and references therein). Kernel-based discriminative online algorithms have been shown to perform very well on binary classification problems (see for example (Kivinen et al., 2004; Crammer et al., 2006)). Most of them can be seen as belonging to the Perceptron algorithm family. They
Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).


The Pro jectron: a Bounded Kernel-Based Perceptron

derivation of a second algorithm, that we call Projectron++. It achieves better performances than the Perceptron, retaining all the advantage of the Pro jectron listed above. Note that this is opposite to previous budget online learning algorithms, delivering performances at most as good as the original Perceptron. The rest of the paper is organized as follows: in Section 2 we state the problem and we introduce the necessary background theory. Section 3 introduces the Pro jectron, Section 4 derives its properties and Section 4.1 derives the Pro jectron++. We report experiments in Section 5, and we conclude the paper with an overall discussion.

Algorithm 1 Perceptron Algorithm Initialize: S0 = , f0 = 0 for t = 1, 2, . . . , T do Receive new instance xt Predict yt = sign(ft-1 (xt )) ^ Receive label yt if yt = yt then ^ ft = ft-1 + yt k (xt , З) St = St-1  {t} else ft = ft-1 St = St-1 end if end for Although the Perceptron is a very simple algorithm, it is considered to produce very good results. Our goal is to derive and analyze a new algorithm which attains the same results as the Perceptron but with a minimal size of support set. In the next section we present our Pro jectron algorithm.

2. Problem Setting
The basis of our study is the well known Perceptron algorithm (Rosenblatt, 1958). The Perceptron algorithm learns the mapping f : X  R based on a set of examples S = {(x1 , y1 ), . . . , (xT , yT )}, where xt  X is called an instance and yt  {-1, +1} is called a label. We denote the prediction of Perceptron as sign(f (x)) and we interpret |f (x)| as the confidence in the prediction. We call the output f of the Perceptron algorithm a hypothesis, and we denote the set of all attainable hypotheses by H. In this paper we assume that H is a Reproducing Kernel Hilbert Space (RKHS) with a positive definite kernel function k : X з X  R implementing the inner product З, З . The inner product is defined so that it satisfies the reproducing property, k(x, З), f (З) = f (x) The Perceptron algorithm is an online algorithm, in which the learning takes place in rounds. At each round a new hypothesis function is estimated, based on the previous one. We denote the hypothesis estimated after the t-th round by ft . The algorithm starts with the zero hypothesis f0 = 0. On each round t, an instance xt  X is presented to the algorithm. The algorithm predicts a label yt  {-1, +1} by using the ^ current function, yt = sign(ft (xt )). Then, the cor^ rect label yt is revealed. If the prediction yt differs ^ from the correct label yt , it updates the hypothesis ft = ft-1 + yt k (xt , З), otherwise the hypothesis is left intact, ft = ft-1 . Practically, the hypothesis ft can be written as a kernel expansion (SchЈlkopf et al., 2000), o i i k (xi , x) , (1) ft (x) =
St

3. The Pro jectron Algorithm
Let us first consider a finite dimensional RKHS H induced by a kernel such as the polynomial kernel. Since H is finite dimensional, there is a finite number of linearly independent hypotheses in this space. Hence, any hypothesis in this space can be expressed using a finite number of examples. We can modify the Perceptron algorithm to use only one set of independent instances as follows. On each round the algorithm receives an instance and predicts its label. On a prediction mistake, if the instanct can be spanned by the e -1 support set, namely, xt = i=1 di xi , it is not added to the support set. Instead, the coefficients {i } in the expansion Eq. (1) are not merely yi , i  St-1 , but they are changed to reflect the addition of this instance to the hypothesis, that is, i = yi + yt di , 1  i  t - 1. If the instance and the support set are linearly independent, the instance is added to the set with t = yt as before. This technique reduces the size of the support set without changing the hypothesis in any way, and was used by Downs at al. (2001) to simplify Support Vector Machine solutions. Let us consider now the more elaborate case of an infinite dimensional RKHS H induced by kernels such as the Gaussian kernel. In this case, it is not possible to find a finite number of linearly independent vectors which span the whole space, and hence there is no guarantee that the hypothesis can be expressed by a finite number of instances. However, we can approximate the concept of linear independence with a

where i = yi and St is defined to be the set of instance indices for which an update of the hypothesis occurred, i.e., St = {0  i  t | yi = yi }. The set St is called the ^ support set. The Perceptron algorithm is summarized in Algorithm 1.


The Pro jectron: a Bounded Kernel-Based Perceptron

Algorithm 2 Pro jectron Algorithm Initialize: S0 = , f0 = 0 for t = 1, 2, . . . , T do Receive new instance xt Predict yt = sign(ft-1 (xt )) ^ Receive label yt if yt = yt then ^ ft = ft-1 + yt k (xt , З) ft = ft pro jected onto the space St-1 t = ft - ft if t   then ft = ft St = St-1 else ft = ft St = St-1  {t} end if else ft = ft-1 St = St-1 end if end for finite number of vectors (CsatД & Opper, 2001; Eno gel et al., 2002; Orabona et al., 2007). In particular assume that at round t of the algorithm there is a prediction mistake and the mistaken instance xt should be added to the support set. Before adding the instance to the support, we construct two hypotheses: a temporal hypothesis ft using the function k (xt , З), that is, ft = ft-1 + yt k (xt , З), and a pro jected hypothesis ft , which is the pro jection of ft onto the space spanned by St-1 . That is, the pro jected hypothesis is a hypothesis from the support set St-1 which is the closest to the temporal hypothesis. Denote by t the distance between the hypotheses t = ft - ft . If the norm of distance t is below some threshold  , we use the pro jected hypothesis as our next hypothesis, i.e., ft = ft , otherwise we use the temporal hypothesis as our next hypothesis, i.e., ft = ft . As we show in the next section, this strategy assures that the maximum size of the support set is always finite, regardless of the dimension of the RKHS H. Guided by these considerations we can design a new Perceptron-like algorithm that pro jects the solution onto the space spanned by the previous support vectors whenever possible. We call this algorithm Pro jectron. The algorithm is given in Algorithm 2. In our algorithm the parameter  plays an important role. If  is equal to zero, we obtain exactly the same solution of the Perceptron algorithm. In this case, however, the Pro jectron solution can still be sparser when some of the instances are linearly dependent or

when the kernel induces a finite dimensional RKHS H. In case  is greater than zero we trade precision for sparseness. Moreover, as shown in the next section, this implies a bounded algorithmic complexity, namely, the memory and time requirements for each step are bounded. We will also derive mistake bounds to analyze the effect of  on the classification accuracy. We now consider the problem of deriving the pro jected hypothesis ft in a Hilbert space H, induced by a kernel function k (З, З). Denote by Pt-1 ft the pro jection of ft  H onto the subspace Ht-1  H spanned by the set St-1 . The pro jected hypothesis ft is defined as ft = Pt-1 ft . Expanding ft we have ft = Pt-1 ft = Pt-1 (ft-1 + yt k (xt , З)) . (2)

The pro jection is an idempotent (Pt2 1 = Pt-1 ) and - linear operator, hence, ft = ft-1 + yt Pt-1 k (xt , З) . (3)

Recall that t = ft - ft . Substitute ft from Eq. (3) and ft we have t = ft - ft = yt Pt-1 k (xt , З) - yt k (xt , З) . (4)

Recall that the pro jection of ft  H onto a subspace Ht-1  H j s the hypothesis in Ht-1 closest to i ft . Hence, let St-1 dj k (xj , З) b e an hyp othesis in Ht-1 , where (d1 , . . . , dt-1 ) is a set of coefficients. The closest hypothesis is the one for which j 2 t 2 =
(d1 ,...,dt-1 )

min

dj k (xj , З) - k (xt , З)

.

St-1

(5) Expanding Eq. (5) we get t 2 = min i dj di k (xj , xi ) dj k (xj , xt ) + k (xt , xt )
St-1

(d1 ,...,dt-1 )

,j St-1

-2

j

.

(6)

Define Kt-1 to be the matrix generated by the instances in the support set St-1 , that is, {Kt-1 }i,j = k (xi , xj ) for every i, j  St-1 . Define kt to be the vector whose i-th element is kti = k (xi , xt ). We have d , t 2 = min T Kt-1 d - 2dT kt + k (xt , xt ) (7)
d

where d = (d1 , . . . , dt-1 )T . Solving Eq. (7), that is, applying the extremum conditions with respect to d, we obtain d = K-11 kt (8) t-


The Pro jectron: a Bounded Kernel-Based Perceptron

and, by substituting Eq. (8) into Eq. (7),
T t 2 = k (xt , xt ) - kt d .

(9)

Furthermore, substituting Eq. (8) into Eq. (3) we get j d k (xj , З) . ft = ft-1 + yt (10) j
St-1

The next theorem provides a mistake bound. The main idea is to bound the maximum number of mistakes of the algorithm, relatively to the best hypothesis g  H chosen in hindsight. Let us define D1 as D1 = tT (g (xt ), yt ) (12)

=1

We have shown how to calculate both the distance t and the pro jected hypothesis ft . In summary, one needs to compute d according to Eq. (8), plug the result either into Eq. (9) and obtain t or into Eq. (10) and obtain the pro jected hypothesis. In order to make the computation more tractable, we introduce an efficient method to calculate the matrix inversion K-1 iteratively. This method was first int troduced in (Cauwenberghs & Poggio, 2000), and we give it here only for completeness. We would like to note in passing that the matrix Kt-1 can be safely inverted since, by incremental construction, it is always full-rank. After the addition of a new sample, K-1 t becomes   0 d d  . ( .  T K-11 . + 1 t- -1  t 2 -1   0 0 ЗЗЗ 00

where (g (xt ), yt ) is the hinge loss suffered by the function g on the example (xt , yt ), that is, max{0, 1 - yt g (xt )}. With these definitions we can state the following bound for the Pro jectron Algorithm. Theorem 2. Let (x1 , y1 ), З З З , (xT , yT ) be a sequence of instance-label pairs where xt  X , yt  {-1, +1}, and k (xt , xt )  R for al l t. Let g be an arbitrary function in H. Assume that the Projectron algorithm is R2 run with 0   < 2-g . Then the number of predic2 tion mistakes the Projectron makes on the sequence is at most 2 g + 2D1 2 - R2 - 2 g The proof of this theorem is based on the following lemma. Lemma 1. Let (x, y ) be an example, with x  X and y  {+1, -1}. Denote by f an hypothesis in H, such that y f (x) < 1. Let f  = f +  y q (З), where q (З)  H. Then the fol lowing bound holds for any   0: f - g 2 - f - g 2   P 2 (f (x), y ) - 2(g (x), y )

11) where d and t 2 are already evaluated during the previous steps of the algorithm. Thanks to this incremental evaluation, the time complexity of the linear independence check is O(|St-1 |2 ), as one can easily see from Eq. (8).

- q(З) 2 -2 f, q (З)-k (x, З) -2 q (З)-k (x, З) З g roof. f - g 2 - f  - g 2 = 2 y g - f , q (З) -  2 q(З)
2

4. Analysis
In this section we analyze the performance of the Projectron algorithm in the usual framework of online learning with a competitor. First, we present a theorem which states that the size of the support set is bounded. Theorem 1. Let k : X з X  R a continuous Mercer kernel, with X a compact subset of a Banach space. Then, for any training sequence (xi , yi ), i = 1, З З З ,  and for any  > 0, the size of the support set of the Projectron algorithm is finite. The proof of this theorem goes along the same lines as the proof of Theorem 3.1 in (Engel et al., 2002), and we omit it for brevity. Note that this theorem guarantees that the size of the support set is bounded, however it does not state that the size of the support set is fixed or can be estimated before training.

= 2 y (g (x) - f (x)) -  2 q(З) 2  + 2 y g - f , q (З) - k (x, З) 2  (f (x), y ) - 2(g (x), y ) -  q(З) 2 W - 2y f, q (З) - k (x, З) - 2 q(З) - k (x, З) З g ith this bound we are ready to prove Thm. 2. Proof. Define the relative progress in each round as t = ft-1 - g 2 - ft - g 2. We bound the progress from above and below. On rounds in which there is no mistake t is 0. On rounds in which there is a mistake there are two possible updates: either ft = ft-1 + yt Pt-1 k (xt , З) or ft = ft-1 + yt k (xt , З). In the following we bound the progress from below, when the update is of the former type (the same bound can be obtained for the latter type as well, but the derivation is omitted). In particular we set q (З) = Pt-1 k (xt , З) in


The Pro jectron: a Bounded Kernel-Based Perceptron

Lemma 1 and use t = yt Pt-1 k (xt , З) - yt k (xt , З) from Eq. (4) t = ft-1 - g 2 - ft - g 2 2  t (ft-1 (xt ), yt ) - 2(g (xt ), yt ) - t Pt-1 k (xt , З) 2 - 2 ft-1 , t - 2 t g

.

Note that ft-1 , t = 0, because ft-1 belongs to the space spanned by the functions indexed by St-1 . Moreover, on every pro jection update t   and using the theorem assumption Pt-1 k (xt , З)  R, we then have 2 . t  t ((ft-1 (xt ), yt ) - (g (xt ), yt ))-t R2 -2 g We can further bound t by noting that on every prediction mistake (ft-1 (xt ), yt )  1. Overall we have ft-1 - g 2 - ft - g 2  2 t (1 - (g (xt ), yt )) - t R2 - 2 g

Notice that the bound in Corollary 1 is similar to Thm. 5.1 in (Dekel et al., 2007) of the Forgetron algorithm. The difference is in the assumptions made: in the Forgetron, the size of the support set is guaranteed to be less than a fixed size B that depends on U , while in the Pro jectron we choose the value of  or, equivalently, U , and there is no guarantee on the exact size of the support set. However, the experimental results suggest that, with the same assumptions used in the derivation of the Forgetron bound, the Pro jectron needs a smaller support set and produces less mistakes. It is also possible to give yet another bound by slightly changing the proof of Thm. 2. This theorem is a worstcase mistake bound for the Pro jectron algorithm. We state it here without the proof, leaving it for a long version of this paper. Theorem 3. Let (x1 , y1 ), З З З , (xT , yT ) be a sequence of instance-label pairs where xt  X , yt  {-1, +1}, and k (xt , xt )  R for al l t. Let g an arbitrary function in H. Assume that the Projectron algorithm is run 1 with 0   < g . Then, M , the number of prediction mistakes the Projectron makes on the sequence is at most R  2 2 g 2 + 4D Rg + 1   2(1 -  g ) The last theorem suggests that the performance of the Pro jectron are slightly worse than the Perceptron (Shalev-Shwartz & Singer, 2005). Specifically the degradation in the performance of Pro jectron compared to the Perceptron are related to 1/(1- g )2 . In the next subsection we present a variant to the Pro jectron algorithm, which attains even better performance. 4.1. Going Beyond the Perceptron The proof of Thm. 2 and Corollary 1 direct us how to improve the Pro jectron algorithm to go beyond the performance of the Perceptron algorithm, while maintaining a bounded support set. Let us start from the algorithm in Corollary 1. We change it so an update takes place not only if there is a prediction mistake, but also when the prediction is correct with a low confidence. We indicate this latter case as a margin error, that is, 0 < yt ft-1 (xt ) < 1. This strategy improves the classification rate but also increases the size of the support set (Crammer et al., 2006). A possible solution to this obstacle is not to update every round a margin error occurs, but also when the new instance can be projected onto the support set. Hence, the update on margin error rounds

.

We sum over t both sides. Let t be an indicator function for a mistake on the t-th round, that is, t is 1 if there is a mistake on round t and 0 otherwise, hence it can be upper bounded by 1. The left hand side of the equation is a telescopic sum, hence it collapses to f0 - g 2 - fT - g 2, which can be upper bounded by g 2 , using the fact that f0 = 0 and that fT - g 2 is non-negative. Finally, we have 2 , g 2 + 2D1  M - R2 - 2 g where M is the number of mistakes. To compare with other similar algorithms it can be useful to change the formulation of the algorithm in order to use the maximum norm of g as parameter instead of  . Hence we can fix an upper bound, U , on g and then we set  to have a positive progress. Specifically, on each round we set  to be . 12 (ft-1 (xt ), yt ) - Pt-1 k (xt , З) 2 - 0.5 (13) 2U The next corollary, based on Thm. 2, provides a mistake bounds in terms of U rather than  . Corollary 1. Let (x1 , y1 ), З З З , (xT , yT ) be a sequence of instance-label pairs where xt  X , yt  {-1, +1}, and k (xt , xt )  1 for al l t. Let g be an arbitrary function in H, whose norm g is bounded by U . Assume that the Projectron algorithm is run with a parameter  , which is set in each round according to Eq. (13). Then, the number of prediction mistakes the Projectron makes on the sequence is at most 2g
2

+ 4D1 .


The Pro jectron: a Bounded Kernel-Based Perceptron

would be in the general form ft = ft-1 + yt t Pt-1 k (xt , З) , (14)

with 0 < t  1. The last constraint comes from proofs of Thm. 2 and Corollary 1 in which we upper bound t by 1. Note that setting t to 0 is equivalent to leave the hypothesis unchanged. The bound in Corollary 1 becomes { 2 M  2( g + 2D1 - t ) , (15)
t:0<yt ft-1 (xt )<1}

variant, which outperforms its other variants. We also use two other baseline algorithms: the first one is a Perceptron algorithm which stops updating the solution once the support size has reached some limit, and it is used to verify that the Pro jectron is better than just stop learning. We name it Stoptron. The second baseline algorithm is the PA-I variant of the PassiveAggressive learning algorithm (Crammer et al., 2006), which gives an upper bound to the classification performance that Pro jectron++ can reach. We tested the algorithms with two standard machine learning datasets: Adults9 and Vehicle 2 and a synthetic dataset, all of them with more than 10000 samples. The synthetic dataset is built in the same way as in (Dekel et al., 2007). It is composed with samples taken from two separate bi-dimensional Gaussian distributions. The means of the positive and negative samples are (1, 1) and (-1, -1), respectively, while the covariance matrix for both is diagonal matrix with (0.2, 2) as its diagonal. Then the labels are flipped with a probability of 0.1 to introduce noise. All the experiments were performed over 5 different permutations of the training set. All algorithms used a Gaussian kernel with  2 equals 25, 4, and 0.5 for Adults9, Vehicle, and the synthetic datasets, respectively. The C parameter of the PA-I was set to 1, to have an update similar to the Perceptron and Projectron. Due to the different nature of our algorithm compared to the budget ones, we cannot select the support set size in hindsight. Hence, we compared them using the proper conditions to obtain the same bounds. That is, we selected the maximum support size B for the Forgetron algorithm, which implies a maximum value U , the norm of g , for( its bound to hold. In particular U is equal to 1/4 B + 1)/ log (B + 1) (Dekel et al., 2007), where B is the budget parameter that sets the maximum size of the support set. We then selected the parameter  in the Pro jectron in each round according to Eq. (13). Hence the final size of the Projectron solution will depend on U and on the particular classification problem at hand. We have set B on each dataset roughly to 1/2 and 1/4 of the size of the Perceptron support set, for a total of 6 experiments. Note that Pro jectron can also be used without taking into account the norm of the competitor and considering  just as a parameter. In particular  should be set to trade accuracy for sparseness. In Tables 1­3 we summarize the results of our experiments. The cumulative number of mistakes as percentage of the training size (mean Б std) and the size of the
2 Downloaded from http://www.sie.ntu.edu.tw/ ~cjlin/libsvmtools/datasets/.

where t bounds the progress made on margin error round t. In particular it is easy to see from Lemma 1 that t is 2 t (ft-1 (xt ), yt ) - t Pt-1 k (xt ,З) 2 - 2U t , (16) for 0 < t  1, and is 0 when there is no update. Whenever t is non-negative the worst-case number of mistakes in Eq. (15) decreases, hopefully along with the classification error rate of the algorithm. Hence, we determine the optimal t which maximizes t . In particular, the expression of t in Eq. (16) is quadratic in t , and is maximized for t = (ft-1 (xt ), yt )/ Pt-1 k (xt , З) 2. Constraining t to be less than or equal to 1, we have1 t = min{(ft-1 (xt ), yt )/ Pt-1 k (xt , З) 2, 1} . (17)

In summary, at every round t with margin error we calculate t according to Eq. (17), and check that t is non-negative. If so we update the hypothesis using Eq. (14), otherwise we leave it untouched. With this modification we expect better performance, that is, fewer mistakes, but without any increase of the support set size. We can even expect solutions with a smaller support set, since new instances can be added to the support set only if misclassified, hence having less mistakes should result in a smaller support set. We name this variant Pro jectron++, and in the next section we compare it to the original version.

5. Exp erimental Results
In this section we present experimental results that demonstrate the effectiveness of the Pro jectron and the Pro jectron++. We compare both algorithms to the Perceptron and to the budget algorithms Forgetron (Dekel et al., 2007) and Randomized Budget Perceptron (RBP) (Cesa-Bianchi et al., 2006). For the Forgetron, we choose the state-of-the-art "self-tuned"
This update rule gives t = 1 on rounds in which there is a mistake.
1


The Pro jectron: a Bounded Kernel-Based Perceptron Table 1. Adult9 dataset, 32561 samples. Algorithm % Mistakes Size Support Set Perceptron 20.99% Б 0.06 6835.6 Б 20.28 PA-I 18.11% Б 0.10 12537 Б 36.2 B=1500 Projectron 20.95% Б 0.12 1094.6 Б 16.06 Projectron++ 20.04% Б 0.14 992.8 Б 9.73 Forgetron 21.90% Б 0.23 1500 RBP 22.05% Б 0.21 1500 Stoptron 22.73% Б 2.82 1500 B=3000 Projectron 20.97% Б 0.13 1499.6 Б 13.58 Projectron++ 20.16% Б 0.11 1364.2 Б 4.76 Forgetron 21.41% Б 0.13 3000 RBP 21.49% Б 0.11 3000 Stoptron 21.04% Б 1.54 3000 Table 2. Vehicle dataset, 78823 samples. Algorithm % Mistakes Size Support Set Perceptron 19.58% Б 0.09 15432.0 Б 69.62 PA-I 15.27% Б 0.05 30131.4 Б 21.07 B=4000 Projectron 19.63% Б 0.08 3496.4 Б 18.39 Projectron++ 18.27% Б 0.06 3187.0 Б 13.64 Forgetron 20.40% Б 0.04 4000 RBP 20.32% Б 0.04 4000 Stoptron 19.49% Б 3.56 4000 B=8000 Projectron 19.62% Б 0.04 4668.2 Б 32.88 Projectron++ 18.53% Б 0.07 4309.6 Б 28.67 Forgetron 19.98% Б 0.06 8000 RBP 19.94% Б 0.06 8000 Stoptron 20.17% Б 2.03 8000 Table 3. Synthetic dataset, 10000 samples. Algorithm % Mistakes Size Support Set Perceptron 18.80% Б 0.25 1880.0 Б 25.12 PA-I 12.58% Б 0.05 3986.8 Б 42.83 B=1000 Projectron 18.71% Б 0.14 108.6 Б 2.97 Projectron++ 14.09% Б 0.10 104.2 Б 2.39 Forgetron 18.96% Б 0.32 1000 RBP 18.86% Б 0.29 1000 Stoptron 17.49% Б 1.77 1000 B=500 Projectron 18.70% Б 0.21 98.6 Б 3.05 Projectron++ 14.23% Б 0.10 98.6 Б 2.30 Forgetron 19.20% Б 0.19 500 RBP 19.27% Б 0.20 500 Stoptron 21.96% Б 4.62 500

support set are reported. In all the experiments both the Pro jectron and the Pro jectron++ outperform the Forgetron and the RBP with a smaller support size. Moreover, the Pro jectron++ always outperforms the Pro jectron and has smaller support set. Due to its theoretically derived formulation, it achieves better results even if being bounded, and it has better performance than the Perceptron. In particular it gets closer to the classification rate of the PA-I, without paying the price of a large support set. It is interesting to note the performances of the Stoptron: it has an accuracy close to the other bounded algorithms in average, but with much bigger variance. This indicates that all the examined strategies for bounded learning are always better than the simple procedure to stop learning, at least to have stable performances. Last, we show the behavior of the algorithms over time. In Fig. 1 we show the average online error rate, that is, the total numbers of errors on the examples seen as a function of the number of samples for all algorithms on the Adult9 dataset with B = 1500. Note how the Pro jectron algorithm closely tracks the Perceptron. On the other hand the Forgetron and the RBP stop improving after reaching the support set size B , around 7500 samples. The growth of the support set as a function of the number of samples is depicted in Fig. 2. While for PA-I and Perceptron the growth is clearly linear, it is sub-linear for Pro jectron and for the Pro jectron++ and they will reach a maximum size and then they will stop growing (as stated in Thm. 1). In Fig. 3 we show the average online error rate as a function of the size of the support set. It is clear that the Pro jectron and the Pro jectron++ outperform the Perceptron with smaller support set.

for sparseness of the solution. The size of the solution is always guaranteed to be bounded, therefore it solves the memory explosion problem of the Perceptron and similar algorithms. Although the size of the support set is guaranteed to be bounded, the actual size of the support set cannot be determined in advance, like in the Forgetron algorithm, and it is not fixed. Practically, the size of the support set of the Pro jectron algorithms is much smaller than that of the budget algorithms. Compared to budget algorithms it has the advantage of a bounded support set size without removing or scaling instances in the set. This keeps performance high. We call this algorithm Pro jectron. Its second variant, the Pro jectron++, always outperforms the standard Perceptron algorithm, while assuring a bounded solution. Another advantage over budget algorithms is the possibility to obtain bounded batch solutions using standard online-to-batch conversion. In fact using the averaging conversion (Cesa-Bianchi et al., 2004) we get a bounded solution. This is not true for budget algorithms, where more sophisticated techniques have to be used (Dekel & Singer, 2005). A similar approach has been used in (CsatД & Opper, 2001) in the o framework of the Gaussian Processes. However in that

6. Discussion
This paper presented two different versions of a bounded online learning algorithm. The algorithms depend on a parameter that allows to trade accuracy


The Pro jectron: a Bounded Kernel-Based Perceptron
Perceptron PA-I Forgetron RBP Projectron Projectron++
Perceptron PA-I Projectron Projectron++

Online average number of mistakes

Online average number of mistakes

0.28

0.28

0.26

0.26

0.24

0.24

0.22

0.22

0.2

0.2

0.18

0.18

0.16 0

0.5

1

1.5

2

2.5

3

3.5 x 10
4

0.16 0

500

1000

1500

Number of samples

Size of the active set

Figure 1. Average online error for the different algorithms on Adult9 dataset as a function of the number of training samples. B is set to 1500.
14000

Figure 3. Average online error for the different algorithms on Adult9 dataset as a function of the size of the support set. B is set to 1500. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive-aggressive algorithms. Journal of Machine Learning Research, 7, 551­585. Crammer, K., Kandola, J., & Singer, Y. (2003). Online classification on a budget. Advances in Neural Information Processing Systems 16. CsatД, L., & Opper, M. (2001). Sparse representation for o gaussian process models. Advances in Neural Information Processing Systems 13. Dekel, O., Shalev-Shwartz, S., & Singer, Y. (2007). The Forgetron: A kernel-based perceptron on a budget. SIAM Journal on Computing, 37, 1342­1372. Dekel, O., & Singer, Y. (2005). Data-driven online to batch conversions. Advances in Neural Information Processing Systems 18. Downs, T., Gates, K. E., & Masters, A. (2001). Exact simplification of support vectors solutions. Journal of Machine Learning Research, 2, 293­297. Engel, Y., Mannor, S., & Meir, R. (2002). Sparse online greedy support vector regression. Proceedings 13th European Conference on Machine Learning. Kivinen, J., Smola, A., & Williamson, R. (2004). Online learning with kernels. IEEE Trans. on Signal Processing, 52, 2165­2176. Orabona, F., Castellini, C., Caputo, B., Luo, J., & Sandini, G. (2007). Indoor place recognition using online independent support vector machines. Proc. of the British Machine Vision Conference 2007 (pp. 1090­1099). Rosenblatt, F. (1958). The Perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386­407. SchЈlkopf, B., Herbrich, R., Smola, A., & Williamson, R. o (2000). A generalized representer theorem. Proc. of the 13th Conference on Computational Learning Theory. Shalev-Shwartz, S., & Singer, Y. (2005). A new perspective on an old perceptron algorithm. Proc. of the 18th Conference on Learning Theory (pp. 264­278). Weston, J., Bordes, A., & Bottou, L. (2005). Online (and offline) on an even tighter budget. Proceedings of AISTATS 2005 (pp. 413­420).

12000

Size of the active set

10000

Perceptron PA-I Forgetron RBP Projectron Projectron++

8000

6000

4000

2000

0 0

0.5

1

1.5

2

2.5

3

3.5 x 10
4

Number of samples

Figure 2. Size of the support set for the different algorithms on Adult9 dataset as a function of the number of training samples. B is set to 1500.

paper no mistake bounds were derived and the use of the hinge loss allows us to have sparser solution. Acknowledgments. This work was supported by EU pro ject DIRAC (FP6-0027787).

References
Cauwenberghs, G., & Poggio, T. (2000). Incremental and decremental support vector machine learning. Advances in Neural Information Processing Systems 14. Cesa-Bianchi, N., Conconi, A., & Gentile, C. (2004). On the generalization ability of on-line learning algorithms. IEEE Trans. on Information Theory, 50, 2050­2057. Cesa-Bianchi, N., Conconi, A., & Gentile, C. (2006). Tracking the best hyperplane with a simple budget Perceptron. Proc. of the 19th Conference on Learning Theory (pp. 483­498). Cheng, L., Vishwanathan, S. V. N., Schuurmans, D., Wang, S., & Caelli, T. (2007). Implicit online learning with kernels. Advances in Neural Information Processing Systems 19.