Online Learning by Ellipsoid Method

Liu Yang Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213-3891, USA

LIUY @ CS . CMU . EDU

Rong Jin RONGJIN @ CSE . MSU . EDU Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824, USA Jieping Ye JIEPING . YE @ ASU . EDU Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287-8809, USA

Abstract
In this work, we extend the ellipsoid method, which was originally designed for convex optimization, for online learning. The key idea is to approximate by an ellipsoid the classification hypotheses that are consistent with all the training examples received so far. This is in contrast to most online learning algorithms where only a single classifier is maintained at each iteration. Efficient algorithms are presented for updating both the centroid and the positive definite matrix of ellipsoid given a misclassified example. In addition to the classical ellipsoid method, an improved version for online learning is also presented. Mistake bounds for both ellipsoid methods are derived. Evaluation with the USPS dataset and three UCI data-sets shows encouraging results when comparing the proposed online learning algorithm to two state-of-the-art online learners.

of them are additive, i.e., given a misclassified example (xi , yi ), the classification model, denoted by a weight vector w, is usually updated by shifting along the direction of yi xi , i.e., w + i yi xi  w where i weights the misclassified example. (Grove et al., 2001) generalized the additive approaches by an quasi-additive framework which unifies a number of seemingly different online learning algorithms (e.g., Perception and Winnow). Several strategies were proposed to extend online learning algorithms, which were originally proposed for binary classification, to multilabel learning (Fink et al., 2006; Crammer & Singer, 2003; Crammer et al., 2006). (Herbster et al., 2005) extended graph-based approaches for online learning, and (ShalevShwartz & Singer, 2006; Amit et al., 2007) exploited the dual formation of optimization for online learning. One common feature shared by most online learning algorithms is that they only maintain a single solution for the classification model at any trials. We discuss the shortcoming of this feature from two different respective: (I) Bayesian viewpoint. By only maintaining a single solution, these online learning approaches are similar to the point estimation in statistics. This is insufficient from the Bayesian viewpoint, which requires computing not only the most likely solution but also the distribution of all possible solutions. (II) Information viewpoint. These online learning approaches essentially summarize all the information of training data into a single solution, and therefore is inefficient in exploiting the training data. To address the above problems, we propose ellipsoid methods for online learning. Instead of only maintaining a single solution, we follow the Bayesian spirit and approximate by an ellipsoid all the classification models that are consistent with the training examples received so far. Since each ellipsoid is described by two quantities, i.e., the centroid of ellipsoid and the positive definite matrix that decides the shape of ellipsoid, the ellipsoid methods are able to maintain more information of training data than most existing

1. Introduction
Online learning aims to learn statistical models from sequentially received training examples. Compared to batch model learing, one of the key requirement for online learning is that the statistical model has to be updated efficiently given a new training example. In the past decades, a large number of online learning algorithms have been proposed and studied (Li & Long, 2002; Gentile, 2002; Crammer & Singer, 2003; Crammer et al., 2006; Rosenblatt, 1958; Kivinen & M.K.Warmuth, 1997; Littlestone, 1989; Gentile & M.Warmuth, 1998; Kivinen et al., 2002). Most
Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

Online Learning by Ellipsoid Method

online learning algorithms.

2. Online Learning by Ellipsoid Methods
We first introduce the classical ellipsoid method for convex programming, followed by two variants of ellipsoid method for online learning. 2.1. Introduction to Ellipsoid Method for Convex Programming Ellipsoid method (Shor, 1977) is a first order method for convex programming. Given an optimization problem x = arg min{f (x) : x  G} where f (x) is a convex objective function and G  Rd is a convex solid, the ellipsoid method starts with a large ellipsoid E1  G. Let -1 Ek = {x|(x - xk ) Pk (x - xk )  1} be the ellipsoid available at the k iteration that includes the optimal solution d×d x . Here xk  Rd is the center of Ek , and Pk  S++ is k a positive definite matrix that defines the shape of E . The key question of the ellipsoid method is how to update the ellipsoid efficiently. To this end, it computes the gradient of f (x) at xk , denoted by hk , and constructs a half-plane Pk = {x|hk (x - xk )  0}. Using the convexity of f (x), it is easy to show x  Pk  Ek . Hence, the new ellipsoid -1 Ek+1 = {x|(x-xk+1 )Pk+1 (x-xk+1 )  1} is constructed to cover the interaction Pk  Ek , where xk+1 and Pk+1 are computed as follows xk+1 = xk - Pk+1 = d2 d2 - 1 Pk hk (d + 1) hk Pk hk (1) ,

According to the above definition, At includes all the classifiers z that are able to classify with margin a the training examples received in the first t iterations. Here 0  a  1 is predefined constant. The following lemma shows an important property of At . Lemma 1. Let B = {z||z - u|2  (1 - a)} denote a ball centering at u with radius r = (1 - a), where u  Rd is a -margin classifier for all labeled instances. We have At  B.

Proof. First, we have u  At because yi xi u    a, i = 1, . . . , t. Hence, to show At  B, it is sufficient to show the distance between u and hyper-plane yt xt z = a is upper bounded by (1 - a), which can be verified easily. Lemma 1 indicates that if there exists an -margin classifier u, the volume of At , denoted by vol(At ), is lower bounded by vol(B), which becomes the key to the proof of mistake bound. To efficiently represent At , we construct an ellipsoid Et = {z  Rd |(z - wt ) Pt-1 (z - wt )  1} (3)

such that Et  At . Since Et  At  B, our goal is to efficiently reduce vol(Et ). Below we describe how to efficiently update the ellipsoid Et given a misclassified example. Let xt  Rd be an example that is misclassified by wt , i.e., yt wt xi  0 where yt  {-1, +1} is the binary class label assigned to xt . Let Ct = {z  Rd |yt xt z  a} denote the half plane generated by the misclassified example. Evidently, we have u  Ct  Et since yt u xt  . For the convenience of discussion, we rewrite the set Ct as follows Ct = {z  Rd |t - gt (z - wt )  0} where t and gt are defined as t = a - yt wt xt xt Pt xt , gt = yt xt xt Pt xt (5) (4)

Pk -

2Pk hk hk Pk (d + 1)hk Pk hk

2.2. The Classical Ellipsoid Method for Online Learning (CELLIP) In this section, we focus on binary classification problems, and assume that there exists an -margin classifier u  Rd that classifies any instance (x, y) with a margin , i.e., yu x  , where x  Rd and y  {-1, +1}. For the convenience of discussion, we assume |u|2 = 1 and |x|2  1 for any instance. The extension of the ellipsoid method to the inseparable case and multiple-label learning will be discussed later. To exploit the ellipsoid method, we convert an online learning problem into a feasibility problem, namely how to efficiently find a solution that is close to the -margin classifier u given the sequentially received training examples. In particular, at each trial t, we consider constructing the set At that is defined as follows: At = {z  Rd |yi xi z  a, i = 1, . . . , t} (2)

Note that t  0 since yt wt xt  0 and gt Pt gt = 1. The following theorem shows a family of updating equations for wt and Pt that ensures Et+1  Et  Ct .

Theorem 1. Given a misclassified instance (xt , yt ), the following updating equations for wt+1 and Pt+1 will guarantee that the resulting new ellipsoid Et+1 covers the intersection Et  Ct : wt+1 Pt+1 = wt + (t + )Pt gt (6) = µ2 Pt + ([1 - t - ]2 - µ2 )Pt gt gt Pt (7)

Online Learning by Ellipsoid Method

Algorithm 1 The classical ellipsoid method (CELLIP) for online learning 1: INPUT: ·   0: the desired classification margin · a  [0, 1]: a tradeoff parameter 2: INITIALIZE: w1 = 0 and P1 = (1 + (1 - a))Id 3: for t = 1, 2, . . . , T do 4: receive an instance xt 5: predict its class label: yt = sign(wt xt ) ^ 6: receive correct class label yt 7: if yt = yt then ^ 8: compute wt+1 and Pt+1 using (10) and (11) 9: else 10: wt+1  wt and Pt+1  Pt 11: end if 12: end for if parameter  > 0 and µ > 0 satisfy the following constraint 1
2 - t µ2

all the training examples in D with a margin 0    1, i.e., yi u xi   for any (xi , yi ) in D. We then have the mistake made by the classical ellipsoid method when learning from D (Algorithm 1), denoted by M , upper bounded by M 2 log(1 - a) + 2 log  - log(1 + (1 - a)) log (1 - a2  2 /(1 + (1 - a))2 ) (13)

The proof of the above theorem can be found in Appendix B of the supplementary materials. 2.3. Improved Ellipsoid Method for Online Learning One major problem with the above classical ellipsoid method for online learning is that it is theoretically incapable of handling the inseparable case. In this subsection, we present an improved ellipsoid method for online learning that is able to address the inseparable case. Clearly, for the inseparable cases, we have to drop the idea of casting online learning as a feasibility problem since no classifier can classify all the instances correctly. Instead, we treat wt and Pt , i.e., the center and the positive definite matrix of ellipsoid, as a summarization of information from the received training examples. Since wt+1 is a linear combination of the training examples received in the first t trials, it can be viewed as a kind of first order statistics for training examples. To understand the relationship between Pt and received training examples, we derive the updating equation for Pt-1 using (11)
-1 Pt+1 =

+

 1 (1 - t - )2

2

(8)

The proof can be found in the Appendix A of the supplementary materials. The following corollary shows the volume reduction after the update. Corollary 2. Using the updating equations in (6) and (7), we have vol(Et+1 ) = µd-1 (1 - t - ) vol(Et ) (9)

For the convenience of discussion, we choose  = 0 and 2 µ = 1 - t . The corresponding updating equations for wt and Pt become wt+1 Pt+1 = wt + t Pt gt (10) 2 = (1 - t )Pt - 2t (1 - )Pt gt gt Pt (11)

1 2t -1 + gt g 2 Pt 1 - t (1 - t )2 (1 - t ) t

The above expression follows directly from the matrix inverse lemma. Using the above expression, it is not difficult to show
t -1 -1 Pt+1 = 0 P1 + i=1 t

i gi gi  0 P1 +

i xi xi
i=1

(14)

The volume reduction under the above updating equation is vol(Et+1 ) 2 = (1 - t )(d-1)/2 (1 - t ) vol(Et ) (12)

Algorithm 1 summarizes the classical ellipsoid method for online learning. Note that in Algorithm 1, we initialize P1 = (1 + (1 - a))I to ensure B = {z||z - u|2  (1 - a)}  E1 . We refer to it as Classical Ellipsoid Method for Online Learning, or CELLIP for short. The following theorem shows the mistake bound for CELLIP. Theorem 3. Let D = {(xi , yi ), i = 1, 2, . . . , T } be the set of training examples. Assume all the examples are normalized, i.e., xi 2  1. We assume that there exists an classifier u  Rd with u 2 = 1 that is able to classified 2

where i and i are functions of {j }t . The expression j=i in (14) indicates that Pt-1 can be viewed as a weighted covariance matrix that stores the second order information of training examples. The above observation motivates the development of an improved ellipsoid method for online learning. We keep the updating equation (10) for wt , and modify the updating equation for Pt as follows Pt+1 = 1 (Pt - ct Pt gt gt Pt ) 1 - ct (15)

where ct  [0, 1]. We set ct = cbt-1 where 0  c, b  1 are two constants that are set manually. The exponential

Online Learning by Ellipsoid Method

Algorithm 2 The improved ellipsoid method (IELLIP) for online learning INPUT: ·   0: the desired classification margin · 0  c, b  1: parameters controlling the memory of online learning INITIALIZE: w1 = 0 and P1 = Id for t = 1, 2, . . . , T do receive an instance xt predict its class label: yt = sign(wt xt ) ^ receive correct class label yt if yt = yt then ^ compute wt+1 and Pt+1 using (10) and (15) else wt+1  wt and Pt+1  Pt end if end for form for ct is important for the proof of mistake bound as revealed in the supplementary materials. It is not difficult to verify the inductive relationship for Pt-1 , i.e.,
-1 Pt+1

Theorem 4. Let D = {(xi , yi ), i = 1, 2, . . . , T } be the set of training examples. Let u be the optimal classifier with norm |u|2 = 1. Assume all the examples are normalized, 2 i.e., xi 2  1. If parameter c, and b satisfy conditions c+b < 1, we have the number of mistakes made by running Algorithm 2 upper bounded by the following expression M 2 1-b 1 + 2  1-b-c
T

li (u)
i=1

(18)

where li (u) = max(0,  - u xi ). The proof of Theorem 4 can be found in Appendix C of the supplementary materials. Note that when c = 0, the misT take bound is reduced to 1/ 2 + i=1 li (u)/, a common mistake bound for online learning. 2.4. Ellipsoid Methods for Multiple-Label Online Learning We now follow the framework by Crammer et al. (Crammer & Singer, 2002) and extend the ellipsoid method to multi-label learning. Let K be the total number of classes. We denote by wi  Rd , i = 1, . . . , K as the weight vectors for the K classes. Given an example x assigned to a subset of classes Y , we define the classification margin with respect to a classifier w as (W ; x, Y ) = min wz x -
zY

=

(1 - ct )Pt-1 + ct gt gt

-1 As indicated above, Pt+1 can be viewed as a mixture of ma-1 trices Pt and gt gt . Given ct = cbt-1 , it is not difficult to see that Pt+1 is a weighted sum of xi xi where the weight for xi xi decays exponentially in t. By varying constant c and b, we are able to adjust "memory" of Pt . In particular, the smaller b is, the shorter the memory is. The effect of b will be further revealed in our empirical study. Algorithm 2 summarizes the improved ellipsoid method for online learning. We refer to it as Improved Ellipsoid Method for Online Learning, or IELLIP for short.

max wz x. We then define the loss function l(W ; x, Y ) as
z Y /

l(W ; x, Y ) = max(0,  - (W ; x, Y )) where  is a predefined margin. To extend the ellipsoid method for multi-label learning, we construct vector v = (w1 , . . . , wK ). For a misclassified example (xi , Yi ), i.e., (W ; xi , Yi )  0, we define two class indices ai and bi as ai = max wy xi ,
y Yi /

Before we present the mistake bound for the improved ellipsoid method, like many online learning algorithms, we introduce the following quantity for measuring the progress of online learning qt = (u - wt ) Pt-1 (u - wt ) (16)

bi = min wy xi
yYi

where u is some optimal classifier. Note that compared to the conventional approaches for analysis of online learning algorithms, we introduce Pt-1 in (20) for measuring the distance between u and wt . The following lemma shows an important inductive property for qt , which is key to the proof of mistake bound Lemma 2.
2  (1 - ct )qt + t + ct (u gt )2 - 2t u g(17) t  where t = / xt Pt xt .

We the construct a big vector zi  RK×d that includes information from xi and Y , i.e.,  k j = (bi - 1)d + k  xi j -xk j = (ai - 1)d + k zi = i  0 otherwise Pt = {v  RK×d |t - (v - vt ) gt  0}

Similar to the previous discussion, we construct a half plane Pt for each misclassified example zt where t and gt are identical the expressions in (5) except that yt xt is replaced by zt . Using the definition of classifier v, misclassified example zi , t and gt , we can directly extend the two ellipsoid methods described in Algorithm 1 and 2 to multi-label learning. Similar mistake bounds can be derived for multi-label learning. Since the proof is literally a word-by-word copy of the proof for binary classification, we omit them completely.

qt+1

It is straightforward to verify the result in Lemma 2. We now state the mistake bound for the improved ellipsoid for online learning.

Online Learning by Ellipsoid Method

3. Evaluation
We focus on the evaluation of the improved ellipsoid method for online learning. This is because the classical ellipsoid method for online learning is theoretically unable to handle the inseparable cases as pointed out before. This is further confirmed by our empirical study, which showed the classical ellipsoid method is usually outperformed by the improved version. We thus omit the discussion for the classical ellipsoid method due to the space limitation. For IELLIP, we initialize P1 to be an identity matrix at the scale of 0.1; the vector w is randomly initialized around the origin. We set b = 0.3 for all experiments except for the experiment that is devoted to examining the role of b in the proposed online learning algorithm. Note that since only the relatively scale between P1 and c is useful, by setting the scale of P , we don't have to set c in the implementation of IELLIP. 3.1. Datasets The experiments are conducted on the USPS dataset of handwritten digits and three UCI multiclass data-sets (http://archive.ics.uci.edu/ml/ data-sets.html). The data information is summarized in Table 1. For the UCI Isolet and Letter datasets, we select 80% from each class to form the training data-set, and use the rest as testing data. For the USPS and UCI Shuttle dataset, we adopt the splitting between training and testing as provided in the original data packages. 3.2. Baseline Methods and Evaluation Metrics To demonstrate the efficiency and efficacy of IELLIP for multi-class learning, we compare it to two baseline algorithms. The first baseline is the Online PassiveAggressive algorithm (PA) (Crammer et al., 2006). We implement the PA algorithm, by using the aggressiveness parameter corresponding to the best performance evaluated in (Crammer et al., 2006). As indicated in (Crammer, 2004), PA in general performs better than the generalized Perceptron algorithms because of the aggressiveness (i.e. large margins). The second baseline algorithm is the Margin Infused Relaxed Algorithm (MIRA) (Crammer & Singer, 2003), an online learning algorithm for multiclass large margin classifiers with good generalization performance. We use in our experiment the implementation of MIRA downloaded from http://www.cis.upenn. edu/~crammer/code-index.html. For fair comparison, all methods are restricted to use linear classifiers. To this end, for MIRA, we set the polynomial degree to be 1. The margin parameter was set to be 0.1 for all algorithms and for all datasets. Test error (Crammer & Singer, 2003) is used as the main evaluation metric in our study.

It is defined as the number of prediction mistakes made on a given sequence of examples normalized by the length of the sequence. The traditional concept of "epoch" is adopted as an ordering (random permutations) of all the examples in the training set. For example, in our experiments of three epoches, we cycle through all the training examples three times, with a different random permutation for each epoch, before calling the online learner. We report the results averaged over three random permutations for all the four datasets. 3.3. Results of Multiclass Classification Fig 1 shows the classification results of the three online learning algorithms for datasets USPS, UCI Letter, UCI Isolet, and UCI Shuttle: the first row shows the test errors, and the second row shows the number of updates; the three columns, from left to right, correspond to the results of the first, second, and third epoch, respectively. First, as indicated in the first row of Fig 1, we observe that the test error of IELLIP is either comparable to or better than the best performance between PA and MIRA. The second row of Fig 1 reveals that in general, a smaller number of updates are required by IELLIP to achieve a test error that is either comparable to or better than that of PA and MIRA. For instance, for dataset UCI Shuttle, we found that both PA and IEELIP achieve similar test errors across three epoches. But, the number of updates made by IELLIP is significantly smaller than that of PA. One exception is dataset UCI Letter, in which the number of updates made by IELLIP and PA are significantly larger than that of MIRA for the second and third epoch. However, it is also important to note that the test error of IELLIP and PA are significantly lower than that of MIRA for both epoches. When we compare IELLIP with PA on dataset UCI Letter, we still observe a noticeable reduction in the number of updates by IELLIP. Therefore, we conclude that the proposed online learning algorithm is more efficient than the two baselines. Furthermore, since the number of updates is closely related to the number of examples used to construct the classifier, the above analysis indicates that the proposed approach tends to favor a sparse solution than PA and MIRA, a desirable property to have. 3.4. The Role of Parameter b: Tradeoff Between Accuracy and Sparseness As indicated in the previous analysis, parameter b controls the "memory" of the proposed algorithm. In order to examine the role of parameter b, we follow (Crammer & Singer, 2003), in which a natural tradeoff between accuracy and sparseness of solutions was revealed for a family of additive online learners. Fig 2 shows the number of updates vs. test errors of IELLIP for all four datasets at the 3rd epoch

Online Learning by Ellipsoid Method

Table 1. Data-sets used in the online learning experiments
NAME USPS 1 UCI L ETTER2 UCI I SOLET UCI S HUTTLE NO.
OF

T RAINING E G . S 7,291 15,998 800 43,500

NO.

OF

T ESTING E G . S 2,007 4,002 200 14,500

NO.

OF

C LASSES 10 26 10 7

NO.

OF

ATTRIBUTES 256 16 200 9

60 PA MIRA IELLIP

60 PA MIRA IELLIP

60 PA MIRA IELLIP

50

50

50

40 test error % test error %

40 test error %

40

balance between accuracy and sparseness of solutions, as revealed by the previous study.

30

30

30

20

20

20

10

10

10

0

usps

letter

isolet

shuttle

0

usps

letter

isolet

shuttle

0

usps

letter

isolet

shuttle

4. Conclusion
We present novel methods for online learning method (ELLIP) by exploiting the ellipsoid method for convex programming. Unlike the conventional approaches for online learning that only maintain a single classifier, the proposed method is able to capture all the classifiers that are consistent with training examples via an ellipsoid. In addition, the shape of the ellipsoid, represented by a positive definite matrix, allows us to store more information of training examples, and provide additional controls for online updating. We also present an analysis of mistake bound and a generalization to multi-label learning for the ellipsoid method. Empirically studies demonstrates the effectiveness of the proposed method, compared with two state-of-the-art online learners. In the future, we plan to examine other variants of ellipsoid methods for online learning.
1

3.5

x 10

4

7 PA MIRA IELLIP

x 10

4

10 PA MIRA IELLIP 9 8 7 number of updates 6 5 4 3 2

x 10

4

3

6

PA MIRA IELLIP

2.5 number of updates number of updates

5

2

4

1.5

3

1

2

0.5

1 1

0

usps

letter

isolet

shuttle

0

usps

letter

isolet

shuttle

0

usps

letter

isolet

shuttle

(a) 1st epoch

(b) 2nd epoch

(c) 3rd epoch

Figure 1. Experimental results for datasets USPS, UCI Letter, UCI Isolet, and UCI Shuttle. The first row shows the test errors, and the second row shows the number of updates.
1 0.98 0.96 test-error/MIRA-test-error 0.94 0.92 0.9 0.88 0.86 0.84 0.7 0.82 0.8 0 0.2 0.4 0.6 0.8 1 1.2 number-of-update/MIRA-number-of-updates 1.4 0.65 1 MIRA PA IELLIP (b = 0.1) IELLIP (b = 0.3) IELLIP (b = 0.6) IELLIP (b = 0.9) 1 1.9 1.8 0.95 1.7 test-error/MIRA-test-error test-error/MIRA-test-error 0.9 test-error/MIRA-test-error 1.6 1.5 1.4 1.3 1.2 1.1 1 3.5 0.9 0.5 1 MIRA PA IELLIP (b = 0.1) IELLIP (b = 0.3) IELLIP (b = 0.6) IELLIP (b = 0.9) 1.5 2 2.5 3 3.5 4 number-of-update/MIRA-number-of-updates 4.5 5 0.8 0.9 1 MIRA PA IELLIP (b = 0.1) IELLIP (b = 0.3) IELLIP (b = 0.6) IELLIP (b = 0.9)

0.85

0.7

0.8

0.6

0.75

MIRA PA IELLIP (b = 0.1) IELLIP (b = 0.3) IELLIP (b = 0.6) IELLIP (b = 0.9) 1.5 2 2.5 3 number-of-update/MIRA-number-of-updates

0.5

0.4

0.2

0.3

0.4 0.5 0.6 0.7 0.8 number-of-update/MIRA-number-of-updates

0.9

(a) USPS

(b) UCI Letter (c) UCI Isolet (d) UCI Shuttle Acknowledgments The work was partially supported by NSF (DBI0640543 and IIS-0643494) and US Army Research Office (W911NF-08-1-0403). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF and ARO.

Figure 2. Experimental results for IELLIP using different b. Xaxis and Y-axis represent the number of updates and test errors that are normalized by the corresponding quantities of MIRA.

when we vary b from 0.1 to 0.3, 0.6, and 0.9. For the convenience of comparison, both X-axis and Y-axis, which respond to the number of updates and test errors respectively, are normalized by the quantities of MIRA. Note that since the sparsity of solutions is closely related to the number of updates, the plots in Fig 2 essentially reveal the tradeoff between accuracy and sparseness of solutions that are controlled by the parameter b. We clearly see a overall trend between accuracy and sparseness across all four datasets. In particular, a larger b usually leads to a higher sparseness (i.e., a smaller number of updates) and a lower accuracy (i.e., a higher test error). This can be understood as follows: when we keep a longer history of training examples in P matrix (i.e., a large b), the learning algorithm is less likely to be updated, and as a consequence, those important examples may not be assigned enough weights, which could lead to a lower classification accuracy. Hence, by setting b a modest value (e.g., 0.3), we able to achieve a

Appendix A: Proof of Theorem 1
We define v = Pt (z - wt ), and a unit ball and a half plane for v as E = {v||v|2  1} and Ct = {v|t - 1/2 gt Pt v  0. We then rewrite Et and Ct as Et = {z = 1/2 1/2 wt + Pt v|v  E} and Ct = {z = wt + Pt v|v  Ct }. We thus have vol(Et  C) = |Pt |1/2 vol(E  Ct ) Figure 3 shows an example of the intersection between the 1/2 unit ball E and the hyper-plane t - gt Pt v  0. Note 1/2 1/2 1/2 that Pt gt is an unit vector because [Pt gt ] Pt gt = gt Pt gt = 1. Using the symmetry argument, the new ellipsoid in the transformed space, denoted by Et+1 =
-1/2

Online Learning by Ellipsoid Method

{v|(v - v0 ) Q-1 (v - v0 )  1}, should have its center 1/2 v0 move along the direction of Pt gt . We denote by  the distance between the center of Et+1 and the hyper-plane Ct . As shown in Figure 3, the center v0 is written as v0 = (t + )Pt
1/2

from the fact
t-1

Pt

2 (1 - t-1 )Pt-1

i=1

2 (1 - i )P1

gt

(19)

The property in (23) follows from the fact 2t (1 - t ) Pt gt gt Pt 2 1 - t 2t -1/2 -1/2 1/2 1/2 I+ Pt = Pt P gt gt Pt 1 - t t 2t = Pt-1 + gt g 1 - t t Pt -
-1

Furthermore, based on the argument of symmetry, the matrix Q of ellipsoid Et+1 should be isometric along almost 1/2 all the directions except for Pt gt , and therefore can be written as Q = µ2 I + ((1 - t - )2 - µ2 )Pt
1/2

gt gt Pt

1/2 1/2

(20)

where 1 - t -  is the length for axle Pt gt and µ > 0 is the length of other axles. Using the transform 1/2 z = wt + Pt v, we have Et+1 expressed in terms of both v0 and Q, which further leads to the updating equations in Theorem 1. To ensure Et+1  Et  Ct , we enforce the point e in Figure 3, i.e., an intersection point between E and Ct , to be on the surface of the new ellipsoid Et+1 . Note, if we use the center of Et+1 as the origin and its axles as bases, the coordinates of point e becomes 2 2 (, (1 - t )/(d - 1), . . . , (1 - t )/(d - 1)). Since e  Et+1 , we have
2 2 (1 - t )/(d - 1) + (d - 1)  1, (1 - t - )2 µ2

Lemma 4. We have the following properties for t a max (P1 )
t-1 i=1 2 (1 - i )-1/2  t  1

(24)

Proof. Since there exists an classifier u that classifies any labeled example with margin , we will have u  Et  {z|t - gt (z - wt )  0}. Therefore, we have t  gt (u - wt ) = (Pt 
1/2 -1/2 |Pt gt ||Pt (u 1/2

gt ) (Pt

-1/2

- wt )| = 1,

(u - wt ))

which proves the the upper bound for t . The lower bound follows from the fact t+1 = a - yt wt xt xt Pt xt  a xt Pt xt ,

the property of Pt in (22) and the fact max x P1 x 
|x|2 1

max (P1 ). Using the results in the above lemmas, we now show how to prove the mistake bound stated in Theorem 3. After receiving T misclassified examples, the volume of the ellipsoid, denoted by vol(ET ) is reduced to
T -1

Figure 3. Illustration of updating ellipsoids

Appendix B: Proof of Theorem 3
We will first show the properties of t and Pt that are useful for our proof of mistake bound. Lemma 3. We have the following properties for Pt . gt Pt gt = 1
t-1

vol(ET -1 ) = |P1 |1/2  |P1 |
1/2

t=1

2 (1 - t )(d-1)/2 (1 - t ) d(T -1)/2

1-

a max (P1 )

(21) (1 - t ) xt P1 xt 2t 2 gt g (1 - t )(1 - t ) t
2

xt Pt xt 

(22) (23)

=

(1 + (1 - a)) 1 -

a 1 + (1 - a) a
T -1

T -1

d/2

i=1

-1 2 Pt+1 = (1 - t )Pt-1 +

Since ET  B, we have (1 + (1 - a)) 1 - 1 + (1 - a)

d/2

Proof. The property in (21) can be easily verified by using the expressions for gt and Pt . The property in (22) follows

 (1 - a)d  d

Online Learning by Ellipsoid Method

Thus, T is upper bounded by T  2 log(1 - a) + 2 log  - log(1 + (1 - a)) log 1 - 
a 1+(1-a)

+1

Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive-aggressive algorithms. J. Mach. Learn. Res., 7, 551­585. Crammer, K., & Singer, Y. (2002). On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res., 2, 265­292. Crammer, K., & Singer, Y. (2003). Ultraconservative online algorithms for multiclass problems. J. Mach. Learn. Res., 3, 951­991. Fink, M., Shalev-Shwartz, S., Singer, Y., & Ullman, S. (2006). Online multiclass learning by interclass hypothesis sharing. Proc. Int. Conf. on Mach. Learn. (pp. 313­ 320). Gentile, C. (2002). A new approximate maximal margin classification algorithm. J. Mach. Learn. Res., 2, 213­ 242. Gentile, C., & M.Warmuth (1998). Linear hinge loss and average margin. In Advances in Neural Information Processing Systems (pp. 225­231). Grove, A. J., Littlestone, N., & Schuurmans, D. (2001). General convergence results for linear discriminant updates. Mach. Learn., 43, 173­210. Herbster, M., Pontil, M., & Wainer, L. (2005). Online learning over graphs. Proc. Int. Conf. on Mach. Learn. (pp. 305­312). Kivinen, J., & M.K.Warmuth (1997). Additive versus exponentiated gradient updates for linear prediction. Information and Computation, 132, 1­64. Kivinen, J., Smola, A. J., & C.Williamson, R. (2002). Online learning with kernels. IEEE Transactions on Signal Processing, 52, 2165­2176. Li, Y., & Long, P. M. (2002). The relaxed online maximum margin algorithm. Mach. Learn., 46, 361­387. Littlestone, N. (1989). Mistake bounds and logarithmic linear-threshold learning algorithms. Doctoral dissertation, U. C. Santa Cruz. Rosenblatt, F. (1958). The perceptron : A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386­407. Shalev-Shwartz, S., & Singer, Y. (2006). Online learning meets optimization in the dual. Proc. of the 19th Annual Conf. on Learning Theory (pp. 423­437). Shor, N. (1977). Cut-off method with space extension in convex programming problems. Cybernetics, 12, 94­94.

Appendix C: Proof of Theorem 4
We first simplified the inequality in Lemma 2 of the submitted draft qt+1
2  (1 - ct )qt + t + ct (u gt )2 - 2t u gt ( - lt (u)) 2  (1 - ct )qt + t + ct (u gt )2 - 2 xt Pt xt 2 2  - ct |u xt | lt (u) = (1 - ct )qt - +2 xt Pt xt xt Pt xt lt (u)  2 (1 - ct ) +2  (1 - ct )qt - xt Pt xt xt Pt xt t

 (1 - ct )qt -  2

i=1

(1 - ci ) + 2lt (u)

The last step in the above derivation follows from the fact Pt P1 and
t-1

xt Pt xt  xt P1 xt

i=1

1 1  1 - ci 1 - ci i=1 - t 2
t i=1 (1

t-1

We then put inequalities of all iterations together as qt+1 
t i=1 (1 - ci ) t i=1 li (u) t j=i+1 (1

+2

- cj )

- ci )

t

t 2

 1 + 2

i=1

li (u) i j=1 (1 - cj )
t

1-b  1 + 2 1-b-c

li (u)
i=1

We have the number of misclassified examples after training with T examples, denoted by M , is upper bounded 1 2 1-b M 2+   1-b-c
T

li (u)
i=1

References
Amit, Y., Shalev-Shwartz, S., & Singer, Y. (2007). Online classification for complex problems using simultaneous projections. In Advances in Neural Information Processing Systems (pp. 17­24). Crammer, K. (2004). Online learning of complex categorial problems. Doctoral dissertation, Hebrew Univeristy of Jerusalem.