The Pro jectron: a Bounded Kernel-Based Perceptron Francesco Orab ona Joseph Keshet Barbara Caputo IDIAP Research Institute, Centre du Parc, CH-1920 Martigny, Switzerland Swiss Federal Institute of Technology (EPFL), CH-1015 Lausanne, Switzerland forabona@idiap.ch jkeshet@idiap.ch bcaputo@idiap.ch Abstract We present a discriminative online algorithm with a bounded memory growth, which is based on the kernel-based Perceptron. Generally, the required memory of the kernelbased Perceptron for storing the online hypothesis is not bounded. Previous work has been focused on discarding part of the instances in order to keep the memory bounded. In the proposed algorithm the instances are not discarded, but pro jected onto the space spanned by the previous online hypothesis. We derive a relative mistake bound and compare our algorithm both analytically and empirically to the state-of-the-art Forgetron algorithm (Dekel et al, 2007). The first variant of our algorithm, called Pro jectron, outperforms the Forgetron. The second variant, called Pro jectron++, outperforms even the Perceptron. construct their classification function incrementally, keeping a subset of the instances called support set. Each time an instance is misclassified it is added to the support set, and the classification function is defined as a kernel combination of the observations in this set. It is clear that if the problem is not linearly separable, they will never stop updating the classification function. This leads eventually to a memory explosion, and it concretely limits the usage of these methods for all those applications where data must be acquired continuously in time. Several authors tried in the past to address this problem, mainly by bounding a priori the memory requirements. The first algorithm to overcome the unlimited growth of the support set was proposed by Crammer et al. (2003). The algorithm was then refined by Weston et al. (2005). The idea of the algorithm was to discard a vector of the solution, once the maximum dimension has been reached. The strategy was purely heuristic and no mistake bounds were given. A similar strategy has been used also in NORMA (Kivinen et al., 2004) and SILK (Cheng et al., 2007). The very first online algorithm to have a fixed memory "budget" and at the same time to have a relative mistake bound has been the Forgetron (Dekel et al., 2007). A stochastic algorithm that on average achieves similar performances, and with a similar mistake bound has been proposed by Cesa-Bianchi et al. (2006). In this paper we take a different route. We modify the Perceptron algorithm so that the number of stored samples is always bounded. Instead of fixing a priori the maximum dimension of the solution, we introduce a parameter that can be tuned by the user, to trade accuracy for sparseness, depending on the needs of the task at hand. We call the algorithm, that constitutes the first contribution of this paper, Projectron. The Pro jectron is an online, Perceptron-like method that is bounded in space and in time complexity. We derive for it a mistake bound, and we show experimentally that it outperforms consistently the Forgetron algorithm. The second contribution of this paper is the 1. Intro duction One of the most important aspects of online learning methods is their ability to work in an open-ended fashion. Autonomous agents, for example, need to learn continuously from their surroundings, to adapt to the environment and maintain satisfactory performances. A recent stream of work on artificial cognitive systems have signaled the need for life-long learning methods and the promise of discriminative classifiers for this task (Orabona et al., 2007, and references therein). Kernel-based discriminative online algorithms have been shown to perform very well on binary classification problems (see for example (Kivinen et al., 2004; Crammer et al., 2006)). Most of them can be seen as belonging to the Perceptron algorithm family. They Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). The Pro jectron: a Bounded Kernel-Based Perceptron derivation of a second algorithm, that we call Projectron++. It achieves better performances than the Perceptron, retaining all the advantage of the Pro jectron listed above. Note that this is opposite to previous budget online learning algorithms, delivering performances at most as good as the original Perceptron. The rest of the paper is organized as follows: in Section 2 we state the problem and we introduce the necessary background theory. Section 3 introduces the Pro jectron, Section 4 derives its properties and Section 4.1 derives the Pro jectron++. We report experiments in Section 5, and we conclude the paper with an overall discussion. Algorithm 1 Perceptron Algorithm Initialize: S0 = , f0 = 0 for t = 1, 2, . . . , T do Receive new instance xt Predict yt = sign(ft-1 (xt )) ^ Receive label yt if yt = yt then ^ ft = ft-1 + yt k (xt , ·) St = St-1 {t} else ft = ft-1 St = St-1 end if end for Although the Perceptron is a very simple algorithm, it is considered to produce very good results. Our goal is to derive and analyze a new algorithm which attains the same results as the Perceptron but with a minimal size of support set. In the next section we present our Pro jectron algorithm. 2. Problem Setting The basis of our study is the well known Perceptron algorithm (Rosenblatt, 1958). The Perceptron algorithm learns the mapping f : X R based on a set of examples S = {(x1 , y1 ), . . . , (xT , yT )}, where xt X is called an instance and yt {-1, +1} is called a label. We denote the prediction of Perceptron as sign(f (x)) and we interpret |f (x)| as the confidence in the prediction. We call the output f of the Perceptron algorithm a hypothesis, and we denote the set of all attainable hypotheses by H. In this paper we assume that H is a Reproducing Kernel Hilbert Space (RKHS) with a positive definite kernel function k : X × X R implementing the inner product ·, · . The inner product is defined so that it satisfies the reproducing property, k(x, ·), f (·) = f (x) The Perceptron algorithm is an online algorithm, in which the learning takes place in rounds. At each round a new hypothesis function is estimated, based on the previous one. We denote the hypothesis estimated after the t-th round by ft . The algorithm starts with the zero hypothesis f0 = 0. On each round t, an instance xt X is presented to the algorithm. The algorithm predicts a label yt {-1, +1} by using the ^ current function, yt = sign(ft (xt )). Then, the cor^ rect label yt is revealed. If the prediction yt differs ^ from the correct label yt , it updates the hypothesis ft = ft-1 + yt k (xt , ·), otherwise the hypothesis is left intact, ft = ft-1 . Practically, the hypothesis ft can be written as a kernel expansion (Sch¨lkopf et al., 2000), o i i k (xi , x) , (1) ft (x) = St 3. The Pro jectron Algorithm Let us first consider a finite dimensional RKHS H induced by a kernel such as the polynomial kernel. Since H is finite dimensional, there is a finite number of linearly independent hypotheses in this space. Hence, any hypothesis in this space can be expressed using a finite number of examples. We can modify the Perceptron algorithm to use only one set of independent instances as follows. On each round the algorithm receives an instance and predicts its label. On a prediction mistake, if the instanct can be spanned by the e -1 support set, namely, xt = i=1 di xi , it is not added to the support set. Instead, the coefficients {i } in the expansion Eq. (1) are not merely yi , i St-1 , but they are changed to reflect the addition of this instance to the hypothesis, that is, i = yi + yt di , 1 i t - 1. If the instance and the support set are linearly independent, the instance is added to the set with t = yt as before. This technique reduces the size of the support set without changing the hypothesis in any way, and was used by Downs at al. (2001) to simplify Support Vector Machine solutions. Let us consider now the more elaborate case of an infinite dimensional RKHS H induced by kernels such as the Gaussian kernel. In this case, it is not possible to find a finite number of linearly independent vectors which span the whole space, and hence there is no guarantee that the hypothesis can be expressed by a finite number of instances. However, we can approximate the concept of linear independence with a where i = yi and St is defined to be the set of instance indices for which an update of the hypothesis occurred, i.e., St = {0 i t | yi = yi }. The set St is called the ^ support set. The Perceptron algorithm is summarized in Algorithm 1. The Pro jectron: a Bounded Kernel-Based Perceptron Algorithm 2 Pro jectron Algorithm Initialize: S0 = , f0 = 0 for t = 1, 2, . . . , T do Receive new instance xt Predict yt = sign(ft-1 (xt )) ^ Receive label yt if yt = yt then ^ ft = ft-1 + yt k (xt , ·) ft = ft pro jected onto the space St-1 t = ft - ft if t then ft = ft St = St-1 else ft = ft St = St-1 {t} end if else ft = ft-1 St = St-1 end if end for finite number of vectors (Csat´ & Opper, 2001; Eno gel et al., 2002; Orabona et al., 2007). In particular assume that at round t of the algorithm there is a prediction mistake and the mistaken instance xt should be added to the support set. Before adding the instance to the support, we construct two hypotheses: a temporal hypothesis ft using the function k (xt , ·), that is, ft = ft-1 + yt k (xt , ·), and a pro jected hypothesis ft , which is the pro jection of ft onto the space spanned by St-1 . That is, the pro jected hypothesis is a hypothesis from the support set St-1 which is the closest to the temporal hypothesis. Denote by t the distance between the hypotheses t = ft - ft . If the norm of distance t is below some threshold , we use the pro jected hypothesis as our next hypothesis, i.e., ft = ft , otherwise we use the temporal hypothesis as our next hypothesis, i.e., ft = ft . As we show in the next section, this strategy assures that the maximum size of the support set is always finite, regardless of the dimension of the RKHS H. Guided by these considerations we can design a new Perceptron-like algorithm that pro jects the solution onto the space spanned by the previous support vectors whenever possible. We call this algorithm Pro jectron. The algorithm is given in Algorithm 2. In our algorithm the parameter plays an important role. If is equal to zero, we obtain exactly the same solution of the Perceptron algorithm. In this case, however, the Pro jectron solution can still be sparser when some of the instances are linearly dependent or when the kernel induces a finite dimensional RKHS H. In case is greater than zero we trade precision for sparseness. Moreover, as shown in the next section, this implies a bounded algorithmic complexity, namely, the memory and time requirements for each step are bounded. We will also derive mistake bounds to analyze the effect of on the classification accuracy. We now consider the problem of deriving the pro jected hypothesis ft in a Hilbert space H, induced by a kernel function k (·, ·). Denote by Pt-1 ft the pro jection of ft H onto the subspace Ht-1 H spanned by the set St-1 . The pro jected hypothesis ft is defined as ft = Pt-1 ft . Expanding ft we have ft = Pt-1 ft = Pt-1 (ft-1 + yt k (xt , ·)) . (2) The pro jection is an idempotent (Pt2 1 = Pt-1 ) and - linear operator, hence, ft = ft-1 + yt Pt-1 k (xt , ·) . (3) Recall that t = ft - ft . Substitute ft from Eq. (3) and ft we have t = ft - ft = yt Pt-1 k (xt , ·) - yt k (xt , ·) . (4) Recall that the pro jection of ft H onto a subspace Ht-1 H j s the hypothesis in Ht-1 closest to i ft . Hence, let St-1 dj k (xj , ·) b e an hyp othesis in Ht-1 , where (d1 , . . . , dt-1 ) is a set of coefficients. The closest hypothesis is the one for which j 2 t 2 = (d1 ,...,dt-1 ) min dj k (xj , ·) - k (xt , ·) . St-1 (5) Expanding Eq. (5) we get t 2 = min i dj di k (xj , xi ) dj k (xj , xt ) + k (xt , xt ) St-1 (d1 ,...,dt-1 ) ,j St-1 -2 j . (6) Define Kt-1 to be the matrix generated by the instances in the support set St-1 , that is, {Kt-1 }i,j = k (xi , xj ) for every i, j St-1 . Define kt to be the vector whose i-th element is kti = k (xi , xt ). We have d , t 2 = min T Kt-1 d - 2dT kt + k (xt , xt ) (7) d where d = (d1 , . . . , dt-1 )T . Solving Eq. (7), that is, applying the extremum conditions with respect to d, we obtain d = K-11 kt (8) t- The Pro jectron: a Bounded Kernel-Based Perceptron and, by substituting Eq. (8) into Eq. (7), T t 2 = k (xt , xt ) - kt d . (9) Furthermore, substituting Eq. (8) into Eq. (3) we get j d k (xj , ·) . ft = ft-1 + yt (10) j St-1 The next theorem provides a mistake bound. The main idea is to bound the maximum number of mistakes of the algorithm, relatively to the best hypothesis g H chosen in hindsight. Let us define D1 as D1 = tT (g (xt ), yt ) (12) =1 We have shown how to calculate both the distance t and the pro jected hypothesis ft . In summary, one needs to compute d according to Eq. (8), plug the result either into Eq. (9) and obtain t or into Eq. (10) and obtain the pro jected hypothesis. In order to make the computation more tractable, we introduce an efficient method to calculate the matrix inversion K-1 iteratively. This method was first int troduced in (Cauwenberghs & Poggio, 2000), and we give it here only for completeness. We would like to note in passing that the matrix Kt-1 can be safely inverted since, by incremental construction, it is always full-rank. After the addition of a new sample, K-1 t becomes 0 d d . ( . T K-11 . + 1 t- -1 t 2 -1 0 0 ··· 00 where (g (xt ), yt ) is the hinge loss suffered by the function g on the example (xt , yt ), that is, max{0, 1 - yt g (xt )}. With these definitions we can state the following bound for the Pro jectron Algorithm. Theorem 2. Let (x1 , y1 ), · · · , (xT , yT ) be a sequence of instance-label pairs where xt X , yt {-1, +1}, and k (xt , xt ) R for al l t. Let g be an arbitrary function in H. Assume that the Projectron algorithm is R2 run with 0 < 2-g . Then the number of predic2 tion mistakes the Projectron makes on the sequence is at most 2 g + 2D1 2 - R2 - 2 g The proof of this theorem is based on the following lemma. Lemma 1. Let (x, y ) be an example, with x X and y {+1, -1}. Denote by f an hypothesis in H, such that y f (x) < 1. Let f = f + y q (·), where q (·) H. Then the fol lowing bound holds for any 0: f - g 2 - f - g 2 P 2 (f (x), y ) - 2(g (x), y ) 11) where d and t 2 are already evaluated during the previous steps of the algorithm. Thanks to this incremental evaluation, the time complexity of the linear independence check is O(|St-1 |2 ), as one can easily see from Eq. (8). - q(·) 2 -2 f, q (·)-k (x, ·) -2 q (·)-k (x, ·) · g roof. f - g 2 - f - g 2 = 2 y g - f , q (·) - 2 q(·) 2 4. Analysis In this section we analyze the performance of the Projectron algorithm in the usual framework of online learning with a competitor. First, we present a theorem which states that the size of the support set is bounded. Theorem 1. Let k : X × X R a continuous Mercer kernel, with X a compact subset of a Banach space. Then, for any training sequence (xi , yi ), i = 1, · · · , and for any > 0, the size of the support set of the Projectron algorithm is finite. The proof of this theorem goes along the same lines as the proof of Theorem 3.1 in (Engel et al., 2002), and we omit it for brevity. Note that this theorem guarantees that the size of the support set is bounded, however it does not state that the size of the support set is fixed or can be estimated before training. = 2 y (g (x) - f (x)) - 2 q(·) 2 + 2 y g - f , q (·) - k (x, ·) 2 (f (x), y ) - 2(g (x), y ) - q(·) 2 W - 2y f, q (·) - k (x, ·) - 2 q(·) - k (x, ·) · g ith this bound we are ready to prove Thm. 2. Proof. Define the relative progress in each round as t = ft-1 - g 2 - ft - g 2. We bound the progress from above and below. On rounds in which there is no mistake t is 0. On rounds in which there is a mistake there are two possible updates: either ft = ft-1 + yt Pt-1 k (xt , ·) or ft = ft-1 + yt k (xt , ·). In the following we bound the progress from below, when the update is of the former type (the same bound can be obtained for the latter type as well, but the derivation is omitted). In particular we set q (·) = Pt-1 k (xt , ·) in The Pro jectron: a Bounded Kernel-Based Perceptron Lemma 1 and use t = yt Pt-1 k (xt , ·) - yt k (xt , ·) from Eq. (4) t = ft-1 - g 2 - ft - g 2 2 t (ft-1 (xt ), yt ) - 2(g (xt ), yt ) - t Pt-1 k (xt , ·) 2 - 2 ft-1 , t - 2 t g . Note that ft-1 , t = 0, because ft-1 belongs to the space spanned by the functions indexed by St-1 . Moreover, on every pro jection update t and using the theorem assumption Pt-1 k (xt , ·) R, we then have 2 . t t ((ft-1 (xt ), yt ) - (g (xt ), yt ))-t R2 -2 g We can further bound t by noting that on every prediction mistake (ft-1 (xt ), yt ) 1. Overall we have ft-1 - g 2 - ft - g 2 2 t (1 - (g (xt ), yt )) - t R2 - 2 g Notice that the bound in Corollary 1 is similar to Thm. 5.1 in (Dekel et al., 2007) of the Forgetron algorithm. The difference is in the assumptions made: in the Forgetron, the size of the support set is guaranteed to be less than a fixed size B that depends on U , while in the Pro jectron we choose the value of or, equivalently, U , and there is no guarantee on the exact size of the support set. However, the experimental results suggest that, with the same assumptions used in the derivation of the Forgetron bound, the Pro jectron needs a smaller support set and produces less mistakes. It is also possible to give yet another bound by slightly changing the proof of Thm. 2. This theorem is a worstcase mistake bound for the Pro jectron algorithm. We state it here without the proof, leaving it for a long version of this paper. Theorem 3. Let (x1 , y1 ), · · · , (xT , yT ) be a sequence of instance-label pairs where xt X , yt {-1, +1}, and k (xt , xt ) R for al l t. Let g an arbitrary function in H. Assume that the Projectron algorithm is run 1 with 0 < g . Then, M , the number of prediction mistakes the Projectron makes on the sequence is at most R 2 2 g 2 + 4D Rg + 1 2(1 - g ) The last theorem suggests that the performance of the Pro jectron are slightly worse than the Perceptron (Shalev-Shwartz & Singer, 2005). Specifically the degradation in the performance of Pro jectron compared to the Perceptron are related to 1/(1- g )2 . In the next subsection we present a variant to the Pro jectron algorithm, which attains even better performance. 4.1. Going Beyond the Perceptron The proof of Thm. 2 and Corollary 1 direct us how to improve the Pro jectron algorithm to go beyond the performance of the Perceptron algorithm, while maintaining a bounded support set. Let us start from the algorithm in Corollary 1. We change it so an update takes place not only if there is a prediction mistake, but also when the prediction is correct with a low confidence. We indicate this latter case as a margin error, that is, 0 < yt ft-1 (xt ) < 1. This strategy improves the classification rate but also increases the size of the support set (Crammer et al., 2006). A possible solution to this obstacle is not to update every round a margin error occurs, but also when the new instance can be projected onto the support set. Hence, the update on margin error rounds . We sum over t both sides. Let t be an indicator function for a mistake on the t-th round, that is, t is 1 if there is a mistake on round t and 0 otherwise, hence it can be upper bounded by 1. The left hand side of the equation is a telescopic sum, hence it collapses to f0 - g 2 - fT - g 2, which can be upper bounded by g 2 , using the fact that f0 = 0 and that fT - g 2 is non-negative. Finally, we have 2 , g 2 + 2D1 M - R2 - 2 g where M is the number of mistakes. To compare with other similar algorithms it can be useful to change the formulation of the algorithm in order to use the maximum norm of g as parameter instead of . Hence we can fix an upper bound, U , on g and then we set to have a positive progress. Specifically, on each round we set to be . 12 (ft-1 (xt ), yt ) - Pt-1 k (xt , ·) 2 - 0.5 (13) 2U The next corollary, based on Thm. 2, provides a mistake bounds in terms of U rather than . Corollary 1. Let (x1 , y1 ), · · · , (xT , yT ) be a sequence of instance-label pairs where xt X , yt {-1, +1}, and k (xt , xt ) 1 for al l t. Let g be an arbitrary function in H, whose norm g is bounded by U . Assume that the Projectron algorithm is run with a parameter , which is set in each round according to Eq. (13). Then, the number of prediction mistakes the Projectron makes on the sequence is at most 2g 2 + 4D1 . The Pro jectron: a Bounded Kernel-Based Perceptron would be in the general form ft = ft-1 + yt t Pt-1 k (xt , ·) , (14) with 0 < t 1. The last constraint comes from proofs of Thm. 2 and Corollary 1 in which we upper bound t by 1. Note that setting t to 0 is equivalent to leave the hypothesis unchanged. The bound in Corollary 1 becomes { 2 M 2( g + 2D1 - t ) , (15) t:0