Learning Prediction Suffix Trees with Winnow

Nikos Karampatziakis Dexter Kozen Department of Computer Science, Cornell University, Ithaca, NY 14853 USA

nk@cs.cornell.edu kozen@cs.cornell.edu

Abstract
Prediction suffix trees (PSTs) are a popular tool for modeling sequences and have been successfully applied in many domains such as compression and language modeling. In this work we adapt the well studied Winnow algorithm to the task of learning PSTs. The proposed algorithm automatically grows the tree, so that it provably remains competitive with any fixed PST determined in hindsight. At the same time we prove that the depth of the tree grows only logarithmically with the number of mistakes made by the algorithm. Finally, we empirically demonstrate its effectiveness in two different tasks.

1. Introduction
Prediction suffix trees are a well studied and compact model for problems such as temporal classification and probabilistic modeling of sequences (Buhlmann & Wyner, 1999; Helmbold & Schapire, 1997; Pereira & Singer, 1999; Ron et al., 1996; Willems et al., 1995). Different variants of PSTs are also called context tree weighting (Willems et al., 1995) and variable length Markov Models (Buhlmann & Wyner, 1999). PSTs operate in a setting similar to online learning. The model observes symbols from a sequence, one at a time, and makes a prediction about the next symbol based on the symbols it has observed so far. PSTs typically use only a few recently observed symbols which are called the context of the prediction. In this sense they are making a Markovian assumption but, unlike most other Markov models, the number of symbols that are used to predict the next symbol depends on the specific context in which the prediction is made. In this work, we show how to learn a PST by adaptAppearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

ing the Winnow algorithm. Even though our theoretical results are somewhat similar to (Dekel et al., 2004), where the perceptron algorithm is used for the same task, we empirically show that the proposed algorithm consistently obtains lower error rates and grows smaller trees than the one in (Dekel et al., 2004). Our motivating application is monitoring processes in a computer system. Each process produces a sequence of system calls which request different services from the operating system. Our task is to model this sequence and maintain a compact profile of the application. In this setting, we expect a multiplicative update algorithm to be more appropriate than the algorithm in (Dekel et al., 2004) for two reasons. First, complex applications are likely to exhibit different behaviors at various points during their execution. A multiplicative algorithm can quickly modify its model to reflect this. Second, multiplicative algorithms cope better with many irrelevant attributes and this will turn out to be true when we formalize our problem. The rest of the paper is organized as follows. In section 2 we review the balanced version of Winnow which forms the basis of our algorithm. Our main results are presented in section 3. In section 4 we present some empirical results and in section 5 we discuss related approaches. Section 6 states our conclusions.

2. Background
We start by describing Balanced Winnow (Littlestone, 1989). One way to express it is shown in Algorithm 1. At each round t, the algorithm updates a vector of parameters  t  Rd and then uses it to construct its weight vector wt . Then a prediction is obtained as the inner product of wt and the features xt . If the algorithm makes a mistake it updates its parameter vector  by adding to it a scaled version of the quantity yt xt , yt  {-1, +1}. This type of update is similar to the perceptron (Rosenblatt, 1958), however that would use directly  t instead of wt for prediction. Selecting a value for the parameter , known as the learning rate, is not trivial and we will always set it at the very end

Learning Prediction Suffix Trees with Winnow

Algorithm 1 Balanced Winnow Algorithm 1:  1  0 2: for t = 1, 2, . . . , T do et,i 3: wt,i  d t,j j=1 e 4: yt  w t , x t ^ 5: if yt yt  0 then ^ 6:  t+1   t + yt xt 7: else 8:  t+1   t 9: end if 10: end for so that it optimizes the quantities of interest. The main characteristic of Balanced Winnow is that the features have the form xt = [1, -1]  x+ where t x+ is the original vector of features and u  v is the t Kronecker product of u and v. This form of xt will later allow us to sparsify Balanced Winnow. 2.1. Mistake Bound for Balanced Winnow In this section we will briefly discuss a bound on the number of mistakes made by Balanced Winnow. Throughout the paper we will assume, or construct our features such that, ||xt ||  1 for all t to make the bounds simpler. In general, if the labels yt are chosen uniformly at random or even adversarially, no algorithm could achieve any meaningful bound. Therefore, we assume that there exists a vector u such that for all t, yt u, xt >  for some  > 0.1 The magnitude of  quantifies how hard the problem is. Furthermore, since wt is always inside the d-dimensional probabilistic simplex we also assume the same is true for u. The general strategy for proving bounds for online learning algorithms is to define some measure of progress (wt ) towards u and show that with each mistake the algorithm makes, wt comes closer to u. A natural measure of progress is the relative entropy between wt and u:
d

Proof. We present a sketch here because this has already been proved elsewhere in more general form (e.g. (Shalev-Shwartz, 2007)) and the proof of our main result in theorem 2 has a similar structure. The proof proceeds by upper bounding the initial potential and lower bounding (wt ) - (wt+1 ) using lemma 4 and the fact that when a mistake is made yt wt , xt  0. Combining the upper and lower bounds leads to M log d ( - log cosh())

where M is the number of mistakes made. This can be upper bounded using lemma 5. The optimal value of  is  and this yields the bound in the theorem.

3. Learning Prediction Suffix Trees
PSTs are popular models for sequential prediction. A PST is a tree data structure that can compactly store a set S of strings and all their suffixes. Each s  S corresponds to a unique node in the tree. In this node we keep a model for predicting the next symbol when s is a suffix of the current sequence. A PST makes a prediction using a weighted sum of the models in the nodes that match the suffixes of the current sequence. Section 3.2 gives a more detailed description. The advantage of PSTs compared to other Markov models is that the number of symbols used to predict the next symbol depends on the specific context in which the prediction is made. But how much context should be used for each prediction? When learning a PST one is faced with the following dilemma. If we consider contexts with more symbols than necessary we must estimate more parameters and we are likely to hurt the generalization performance of the algorithm. On the other hand underestimating the right amount of context will typically lead to many mistakes. Recently, (Dekel et al., 2004) proposed an algorithm that learns a PST of appropriate size by balancing the above tradeoff. The algorithm in (Dekel et al., 2004) uses the perceptron algorithm to learn a PST. The basic idea is that avoiding to grow the tree is like introducing a noise term in the perceptron update. We use the same idea here for Balanced Winnow but many details are different, especially how much noise is tolerated before the tree is grown. Before we discuss how Balanced Winnow can be adapted to work for PSTs we digress to discuss how to modify Balanced Winnow so that it effectively ignores some of its inputs. 3.1. Sparse Balanced Winnow Algorithm 1 assigns a nonzero weight to every feature, even if some of these weights may be exponentially

(wt ) = D(u||wt ) =
i=1

ui log

ui wt,i

which is always nonnegative. Then we can show that: Theorem 1. Let {(xt , yt )}, t = 1, . . . , T be a sequence of input output pairs with xt  Rd and yt  {-1, 1}. Let u be a a vector such that ui  0 for all i, i ui = 1, and yt u, xt   for some  > 0. Then Balanced Winnow will make at most 2 log d mistakes. 2
1

This assumption can be relaxed. See also theorem 2.

Learning Prediction Suffix Trees with Winnow

small. Here we explain why Balanced Winnow could be modified so that its predictions can ignore the values of some attributes. Then we can say that the algorithm has assigned zero weight to those attributes and our hypothesis is sparse. This may at first seem hard given that the elements of the weight vector are by construction always positive and that the proof of theorem 1 hinges on the use of relative entropy which becomes infinite if one of the entries of the weight vector is zero. However, if one considers the form of xt = [1, -1]x+ t and notices that w t , xt = w + , x+ - w - , x+ = w + - w - , x+ t t t t t t t where w+ corresponds to the first half of the weight t vector wt and w- corresponds to its second half, then t + - if wt,i = wt,i any decision of the form wt , xt will effectively ignore attribute xt,i . We can have a better insight on how the algorithm operates with inputs of the form [1, -1]  x+ if we rewrite the parameters  t t in the same manner as the weight vector ie.  t = [ + ,  - ]. The following lemma shows the relationship t t between the two vectors  + and  - t t Lemma 1. In Balanced Winnow, for all t  - = - + . t t Proof. We proceed by induction. For t = 1 we have  1 = 0 so  + =  - = 0 and the claim is true. If t t at time t + 1 there was no mistake the claim is true. If there was a mistake then the parameters were updated as follows  + =  + + yt x+ and  - = t t+1 t t+1  - + yt (-x+ ). Using the induction hypothesis the t t latter update is  - = - + - yt x+ = - + t t+1 t t+1
d + sinh(t,i ) d j=1 + cosh(t,j )

In a practical implementation, one should update the values of the parameters  t and use the unnormalized weights gt,i to make predictions. 3.2. Winnow for Prediction Suffix Trees Our formal setup is as follows. We wish to predict the items of a sequence y1 , y2 , . . . , yT , yi  , one at a time. For now, we assume that  = {-1, 1} and we discuss extensions to larger alphabets in section 4. Let j |y| be the length of a sequence y   and let yi denote the subsequence yi , . . . , yj . The prediction for yt will t-1 be based on a suffix of y1 , typically one that cont-1 tains only the last few symbols in y1 . We start with the preconception that symbols near t should be more important for the prediction task than symbols away from t (Helmbold & Schapire, 1997; Willems et al., 1995). This can be encoded in the features which will be of the form xt = [1, -1]  x+ where: t x+ = t,s (1 - )|s| 0
t-1 if s = yt-i otherwise

i = 1, . . . , t - 1

By lemma 1, the decision at time t can be written as: yt = ^ w+ t - w - , x+ t t =
i=1

x+ t,i

Therefore xt will conceptually have at most T (T - 1) dimensions, twice as many as the number of distinct T substrings of y1 . Here and below the vector x+ is indexed by strings from the language T and the notation x+ simply means x+ (s) where (s) is the position t,s t, of s in the lexicographic order of YT , the set of all subT strings of y1 . Another useful function is the inverse of (·): s(i) is the string at position i in the aforementioned lexicographic order. Furthermore, we extend s(·) to be a periodic function (with period |YT |). This allows to handle indices from the extended features xt so that we can write for example |xt,i | = x+ . t,s(i) As before, the algorithm will have to learn a vector of parameters  t = [ + ,  - ] while keeping its support, t t the strings for which the corresponding entries are non zero, small. The decisions of the algorithm will be yt = ^
i + sinh(t,i )

where we have used the definitions of hyperbolic sine and cosine2 . For the purposes of decisions, the quand + tity j=1 cosh(t,j ) is a positive constant so it will not + affect the sign of yt . Let gt,i = sinh(t,i ). x+ is ignored ^ t,i when gt,i is zero. Since the only root of the hyperbolic + sine is at zero, gt,i = 0 if and only if t,i = 0. The strategy to derive a multiplicative update algorithm whose hypothesis is sparse is to make sure that some of the parameters remain to zero by introducing some noise that cancels the update for these specific parameters. When analyzing the algorithm, we will have to show that the progress made with each mistake overshadows the effect of noise. The algorithm will be phrased in terms of the normalized weights to maintain the correspondence with Balanced Winnow.
2

gt,i x+ t,i

sinh(x) =

ex -e-x 2

and cosh(x) =

ex +e-x 2

where gt,i = and i ranges over the support of  t . The set A that contains the support of , every suffix of every element in the support, and the empty string can be viewed as a tree. This tree is constructed in the following way. Every element of A is a node in the tree. The root corresponds to the empty string. Let y   and u   . If v = yu  A then u  A by the definition of A and v is a direct child of u in the tree with the link from u to v being labeled with y. In this view,  + , and consequently g assigns weights to nodes in the tree and the predictions of the online algorithm amount to taking a weighted sum of the values in the

Learning Prediction Suffix Trees with Winnow
0 -1 1 -1 -1/2 -1 3 +1 2 +1 -2

components of the update: nt,i = -yt xt,i 0 if |s(i)| > dt otherwise

Figure 1. A PST whose support is the union of -1,-1 - 1 + 1, +1 - 1 + 1 and all their suffixes. If = 1/4 and t yt-2 = -1 - 1 + 1 then the prediction will be yt+1 = ^ 9 27 sign( 3 sinh(-2) + 16 sinh(- 1 ) + 64 sinh(3)) = +1. The 4 2 algorithm is able to overcome the prior belief that a longer suffix is less relevant by assigning its node a large value.

Lemma 2. For all t and i either t,i = 0 or nt,i = 0.

Note that ||nt || = (1 - )(dt +1) , a fact that we will use later in the derivation of the mistake bound. Let ht be the length of the maximal downward path from the root using the symbols yt-1 , yt-2 , . . .. Clearly, if at any point our way of setting dt suggests that it should be less than ht there is no reason to let this happen because we already have grown the tree beyond this point. Letting dt < ht will only make the norm of nt larger without any computational benefit. Therefore dt  ht and then it is easy to prove the following

nodes of a path in the tree. This path starts from the root and follows the links that correspond to the symbols yt-1 ,yt-2 , . . ., until we reach a leaf or the next child does not exist. Figure 1 shows a PST. Balanced Winnow starts with  1 = 0 therefore initially the support of  is the empty string and the corresponding tree has only one node. As learning progresses, each mistake the algorithm makes will cause some of the parameters to be updated to non zero values. Hence we expect the support of  to increase. In fact we will require that the support of  t is a subset of the support of  t+1 . Thus, the algorithm can be thought as growing the tree that corresponds to A.3 If we just apply Balanced Winnow to the task of learning such a PST we will need memory that is O(T 2 ). T This happens because every distinct substring of y1 will have an associated parameter. Hence, our strategy will be to have an adaptive bound dt on the length of the suffixes that we will be considering, which is the same as the depth up to which we will grow the tree. The proposed algorithm modifies Balanced Winnow so that the parameter update is t+1,i = t,i + yt xt,i t,i if |s(i)|  dt otherwise

Proof. There are two cases: either |s(i)|  dt or |s(i)| > dt . If |s(i)|  dt then nt,i = 0 by definition. Now suppose t,i = 0 and |s(i)| > dt . That means there was a point t < t when dt > dt . That would imply dt > ht but since the tree never shrinks this cannot happen. Therefore t,i = 0 if |s(i)| > dt .

Furthermore the sum of the norms of the noise vectors will turn up in the proof of the mistake bound, therefore we would like to keep it bounded by a function of the number of mistakes. Let Jt be the subset of rounds 1, . . . , t in which a mistake was made and let Mt = |Jt |. Also let Pt =
iJt

(1 - )(di +1)

and M0 = P0 = 0. We will require that dt is the smallest integer such that4 Pt  M t
2/3

Lemma 3. Setting dt = max{ht , b(Pt-1 )} ensures 2/3 that for all t, Pt  Mt where b(p) = log1- (
3

subject to dt  ht . We can then prove the following

p3 + 2p3/2 + 1 - p) - 1

(1)

This can be equivalently written as  t+1 =  t + yt xt + nt where nt is a noise vector that cancels some of the
Even if an update causes a parameter to become zero after it has taken a non zero value, we will still keep the corresponding node in the tree.
3

Proof. We proceed by induction. For t = 1 the claim clearly holds. If there is no mistake on round t then Pt = Pt-1 and Mt = Mt-1 so the claim holds. If there is a mistake then Pt = Pt-1 + ||nt || , Mt = Mt-1 + 1 3/2 3 and it suffices to show that Pt3  Pt-1 + 2Pt-1 + 1 3/2 3 since by induction Pt-1 + 2Pt-1 + 1  (Mt-1 + 1)2 . So
4 It would also be  possible to design the algorithm using the invariant Pt  Mt as in (Dekel et al., 2004). Our choice of a greater upper bound on the noise seems to allow for superior performance in practice (cf. section 4)

Learning Prediction Suffix Trees with Winnow

Algorithm 2 Balanced Winnow for PSTs 1: P0  0 2: A1  {} {the empty string} 3:  1  0 4: for t = 1, 2, . . . , T do t-1 5: ht  max{j : yt-j  At } 6: dt  max{ht , b(Pt-1 )} {Lemma 3} t-1 7: Compute xt from y1 8: wt,i  et,i / j et,j 9: yt  wt , xt ^ 10: if yt yt  0 then ^ 11: Pt  Pt-1 + (1 - )(dt +1) t-1 12: At+1  At  {yt-i : 1  i  dt }  t,i + yt xt,i if |s(i)|  dt 13:  t+1,i   t,i otherwise 14: else 15: Pt  Pt-1 16: At+1  At 17:  t+1   t 18: end if 19: end for
3 (Pt-1 + ||nt || )3  Pt-1 + 2Pt-1 + 1 3/2

Lemma 4. For all x  [-1, 1] and  > 0 the following holds ex  cosh() + sinh()x. Proof. First note that e- = cosh() - sinh() and e = cosh() + sinh(). The lemma follows by the convexity of ex . Lemma 5. For all  > 0, log cosh()   . 2 Proof. The bound follows by noticing that it is equiv x  x alent to 0 0 1 - tanh2 (u)dudx  0 0 1dudx
2

will make at most max

Our main result is the following theorem: Theorem 2. Let y1 , y2 , . . . , yT be a sequence of symbols  {-1, +1}. Let u be a vector such that for all i ui  0, i ui = 1 and t Lt = L. Then Algorithm 2
2L 

+

8 log T 2

, 64 3

mistakes.

Proof. We will use the relative entropy of u and w as our measure of progress. Recall that we are working with vectors of dimension d  T (T - 1) and let
d

(wt ) = D(u||wt ) =
i=1

ui log

ui wt,i

Let t = (wt ) - (wt+1 ). The telescoping sum
T t=1

Let z = ||nt || , z > 0. The above can be written as z + 3Pt-1 z +
3 2 2 3Pt-1 z

-

3/2 2Pt-1

-10

t = (w1 ) - (wT +1 )

and is valid when z is less than its only real root: z
3

is not greater than (w1 ) because of the nonnegativity of (wT +1 ). Furthermore
d d

3 Pt-1 + 2Pt-1 + 1 - Pt-1

3/2

(w1 ) 

ui log ui +
i=1 i=1

ui log(d)  log d  2 log T

Since z = ||nt || = (1 - )(dt +1) we finish the proof dt  log1-
3

3 Pt-1 + 2Pt-1 + 1 - Pt-1

3/2

-1

where we have used the non-negativity of the entropy of u and that w1 starts as a uniform distribution. Putting all these together, we have the upper bound
T

The statement of the lemma simply enforces that dt is an integer and that dt  ht .

t=1

t  2 log T

(2)

We can now state how everything is put together in Algorithm 2. The set of suffixes At can be implemented as a tree, as was described above. The algorithm starts with an empty tree and in each round it predicts the next symbol using the parameters in the tree. If a mistake is made and if dt > ht we grow the tree in line 12. Finally, we update the parameter values for the nodes in the tree and repeat. In the following theorem we show how the number of mistakes of this algorithm compares against the performance of the optimal tree in hindsight. This is measured by the cumulative -hinge loss of the optimal vector u. The -hinge loss that u attains at round t is Before we state our result, we prove two useful lemmas. Lt = max{0,  - yt u, xt }.

If on round t there was no mistake then t = 0. Now assume that a mistake was made on round t. To lower bound t , we first write wt+1 in terms of wt : et+1,i et,i +yt xt,i +nt,i wt,i eyt xt,i +nt,i wt+1,i = = = Zt Zt Zt where Zt is the normalization constant:
d

Zt =
j=1

wt,j eyt xt,j +nt,j
d wt+1,i wt,i ,

(3) which

Replacing the above in t = i=1 ui log is just another way to define t , leads to
d

t

=
i=1

ui log

wt,i eyt xt,i +nt,i Zt wt,i (4)

=

yt u, xt +  u, nt - log Zt

Learning Prediction Suffix Trees with Winnow

We bound the three terms from below. We use Lemma 4 to bound the exponential in (3) by a linear function:
d

3.3. Growth Bound We also bound how quickly Algorithm 2 grows the tree. In fact, we can show that the depth of the tree grows logarithmically with the number of mistakes: Theorem 3. Let A1 , A2 , . . . be the sequence of trees generated by Algorithm 2 with = 1 - 2-1/3 . Then, for all T  2 the depth of AT +1 is upper bounded by 5 log2 MT -1 + 3 log2 2 Proof. The depth of AT +1 is the maximum of the dt values, given by lemma 3, over all rounds t. However, if at round t there was no mistake or dt  ht then the tree is not grown. So, it suffices to show that, for all rounds t in which a mistake was made, the quantity b(Pt-1 ) in equation (1) is not greater than the bound. That expression is upper bounded
3 by log1- ( 3 Pt-1 + 2Pt-1 + 1 - Pt-1 ). Now we switch to base 2 because we want to relate the bound on the depth of the tree with the actual size of the tree, we substitute = 1 - 2-1/3 and get 1 3 log2 3/2 3 3 Pt-1 + 2Pt-1 + 1 - Pt-1 It is easy to show that the expression inside the log 3/2

Zt

=
i=1 d

wt,i e(yt xt,i +nt,i ) wt,i (sinh()(yt xt,i + nt,i ) + cosh())
i=1

 =

sinh()yt wt , xt + wt , nt + cosh()

We have assumed that the algorithm made a mistake on round t. This means that sinh()yt wt , xt  0. Furthermore in lemma 2 we showed that either t,i = 0 or nt,i = 0 and we have previously argued that when t,i is zero then wt , v does not depend on wt,i . Therefore wt , nt = 0. These observations lead to Furthermore, from H¨lder's inequality we have that o Moreover, yt u, xt   - Lt by the definition of Lt , so, by combining the three bounds, (4) becomes t   - Lt - ||nt || - log(cosh()). (5) u, nt  -||u||1 ||nt || = -||nt || Zt  cosh().

Summing up the individual lower bounds for all t and using the definitions of Mt , Pt and L we get
T

t=1

t  MT ( - log(cosh())) - PT - L
2/3

because arithm can be upper bounded by 2 such an inequality is equivalent to showing that the polynomial 36y 5 + 48y 4 + 7y 3 + 30y 2 + 36y is non negative for y  0 where y = Pt-1 . Therefore the above quantity is upper bounded by 3 log2 3
1/3

3

Pt-1 +2

Substituting the invariant Pt  Mt and combining this with the upper bound (2) leads to Let z = MT 1/3 . The inequality MT ( - log cosh()) - MT 2/3 - L  2 log T ( - log cosh())z 3 - z 2 - L - 2 log T  0

3Mt-1 + 2 Pt-1 + 2  3 log2 2 2 where we used the invariant Pt  Mt 2/3 . Mt is a non decreasing sequence and MT -1  1 since T  2 and the first prediction is always a zero which counts as a mistake. So we upper bound the above quantity by 3MT -1 + 2 5  log2 MT -1 + 3 log2 2 2 where we have used the inequality log2 (3µ + 2)  log2 (5µ), which is true for µ  1. 3 log2
1/3

is valid when z is less than the only real root5 of the polynomial in the left hand size. Let  be this root. Then the above inequality is equivalent to z  . Even though there is an exact analytical expression for , much intuition can be gained from a simple upper bound (Marden, 1966) theorem (30,2): For any   1   max (L + 2 log T )  - log cosh()
1 3

,

 -1 

 - log cosh() (6)

Using this bound we get that 2L 8 log T 64 MT  max + , 3  2  by setting  = /2,  = 3/2, and using lemma 5.

Theorem 3 says that the amount of memory needed by the tree grows only linearly with the number of mistakes. The respective growth bound given in (Dekel et al., 2004), also scales linearly but with a larger constant. This is reflected in the experiments, too. However, the comparison of their mistake bound and ours is complex and in general multiplicative and additive update algorithms have different merits (Kivinen & Warmuth, 1997). Our case is even more intricate because xt conceptually has twice as many dimensions as that in (Dekel et al., 2004) and the feature values are different. Therefore, the optimal weight vector in hindsight and the value of the margin will be different

We further discuss this bound in the light of the theoretical and empirical results in the next two sections.
5

This is true assuming L + 2 log T > 0, which is safe

Learning Prediction Suffix Trees with Winnow Table 1. Summary Ulysses A-Z Perc. Winn. 67.58 65.58 13.2M 10.3M of experimental results Excel Outlook Perc. Winn. Perc. Winn. 22.68 20.59 5.1 4.43 24402 15338 41239 25679

Dataset Update % Error PST Size

Ulysses Bit Perc. Winn. 24.32 20.49 675K 270K

Firefox Perc. Winn. 14.86 13.88 21081 12662

for the two algorithms. The latter is likely to be larger in our case because our feature values are larger. In this view our mistake bound dependence on 1 may 3 not be important since  may be large. Notice that for the multiplicative algorithm we can afford to push the features as close to 1 as we wish without changing the mistake bound which depends on just ||xt ||  1. The algorithm in (Dekel et al., 2004) however depends on ||xt ||2  1 and their features cannot be made larger without affecting their mistake or growth bound. In theorem 3 we set so that the features are as large as possible while keeping the PST size linear with the number of mistakes. Finally our bound also depends on log T and even though this dependence also exists in a lower bound (Kivinen & Warmuth, 1997), this assumes that an adversary selects the feature values which is not true in this case. Nevertheless, as the next section shows the actual number of mistakes in practice may differ markedly from these worst case bounds.

as usual). A problem with this formulation is that we (j) need the normalization constant Zt for each class j. (j) Strictly speaking, Zt cannot be computed without a priori knowledge of the length T of the sequence, because we conceptually have a weight for each substring of the whole sequence. However, the following approximation ~ (j) Zt =
iAt

et,i

(j)

works very well in practice, can be quickly computed ~ (j) from Zt-1 , and was used in all our experiments. We now turn to experiments from our motivating application. The task is to predict the next system call a program will execute based on the previous system calls. Our data consists of three applications, Firefox and Microsoft's Excel and Outlook, for each of which we have 40 sequences of system calls. Our monitoring application was recording 23 different system calls. The last three columns in table 1 summarize the results of all experiments. For brevity, we report for each application the average online prediction error and tree size over the 40 sequences. However, table 1 understates the difference of the two algorithms in this problem since the multiplicative algorithm is always making less mistakes and growing smaller trees than the additive one for all three applications and all tested sequences. All paired sample two sided t-tests showed that the differences of the reported quantities are significant (p < 10-5 for all tests). However, there exist sequences that can force Winnow to make more mistakes than perceptron. In two out of the 120 sequences Winnow was initially making more mistakes and started outperforming perceptron only after the first half of the sequence. Finally, it is interesting to note that if instead of 2/3 1/2 Pt  Mt we had enforced Pt  Mt we would have derived a mistake bound that always scales with 1 2 and a similar growth bound for = 1 - 2-1/2 . Surprisingly, this variant has empirically no clear advantage over the additive algorithm. Our analysis allows selecting a smaller i.e. larger features which in turn allow the margin  to be larger and this improves the mistake bound. At the same time larger features mean that the norms of the noise vectors will be larger. Hence we need to decide how much noise we should tolerate. Too much noise will hurt the mistake bound and too

4. Experiments
We have conducted some experiments to demonstrate the effectiveness of the proposed algorithm. We are interested in two quantities, the online error rate and the size of the PST. Since the data from our motivating application are not publicly available, we first describe two reproducible experiments. We used the text from James Joyce's Ulysses to define two problems: predicting the next bit and predicting the next alphabetic character in the text (non alphabetic characters were discarded in this case). The first two columns of table 1 show the results of these experiments. For both tasks, our Winnow variant makes fewer mistakes and grows a smaller tree than the perceptron of (Dekel et al., 2004). In all experiments,  was set to 0.1 and was set as in theorem 3. The baseline error rates are 50% and 88% respectively. Predicting the next letter as well as the problems from our motivating application are multiclass problems. To handle them, we adapt ideas from (Crammer & Singer, 2002) and maintain weights w(1) , . . . , w(k) one for each class. The decision at time t is yt = ^ y arg maxi w(i) , xt and if yt = yt we update w(^t ) ^ (^t ) y (yt ) and w by updating their parameters: t+1,i =
t t,it - xt,i and t+1,i = t,it + xt,i (if |s(i)|  dt

(^ ) y

(y )

(y )

Learning Prediction Suffix Trees with Winnow

little will cause the tree to grow very fast. A good balance can be achieved by requiring the size of the tree to scale linearly with the number of mistakes.

Office of Sponsored Research under contract FA955007-C-0127, Application Monitors for Not-Yet-Trusted Software. The Small Business Technology Transfer effort was carried out jointly by Cornell and ATC-NY.

5. Related Work
Apart from (Dekel et al., 2004), with which we compared in the experiments, there are many other methods for learning PSTs which are based on a wide range of principles such as PAC learning (Ron et al., 1996), structural risk minimization (Kearns & Mansour, 1998) and online Bayesian mixtures (Willems et al., 1995) and their generalizations (Helmbold & Schapire, 1997; Pereira & Singer, 1999). All these methods assume that we have a bound on the maximum depth of the tree. For many applications this is not known and the algorithm should be allowed to estimate a good value of this parameter from the data. (Kivinen & Warmuth, 1997) contains many results and insights about multiplicative algorithms including variants that are competitive with any vector u such that ||u||1  U . Extending our algorithm to be competitive with any vector in this ball is also possible. Additionally, many authors, starting with (Blum, 1997), have noticed that Winnow's weight vector can be sparsified by zeroing the entries which have small weights compared to the largest weight in the hypothesis. This procedure is effective in practice but comes with no theoretical guarantees. In the case of learning PSTs however, the tree implicitly defines a partial order for the costs of setting the parameters to non zero values. This allows us to have a sparse hypothesis and still be able to characterize the number of mistakes.

References
Blum, A. (1997). Empirical Support for Winnow and Weighted-Majority Algorithms: Results on a Calendar Scheduling Domain. Machine Learning, 26, 5­23. Buhlmann, P., & Wyner, A. (1999). Variable length Markov chains. Annals of Statistics, 27, 480­513. Crammer, K., & Singer, Y. (2002). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2, 265­292. Dekel, O., Shalev-Shwartz, S., & Singer, Y. (2004). The power of selective memory: Self-bounded learning of prediction suffix trees. Neural Information Processing Systems, 17, 345­352. Helmbold, D., & Schapire, R. (1997). Predicting Nearly As Well As the Best Pruning of a Decision Tree. Machine Learning, 27, 51­68. Kearns, M., & Mansour, Y. (1998). A fast, bottom-up decision tree pruning algorithm with near-optimal generalization. International Conference on Machine Learning, 269­277. Kivinen, J., & Warmuth, M. (1997). Exponentiated Gradient versus Gradient Descent for Linear Predictors. Information and Computation, 132, 1­63. Littlestone, N. (1989). Mistake bounds and logarithmic linear-threshold learning algorithms. Doctoral dissertation, Santa Cruz, CA, USA. Marden, M. (1966). Geometry of Polynomials. AMS. Pereira, F., & Singer, Y. (1999). An Efficient Extension to Mixture Techniques for Prediction and Decision Trees. Machine Learning, 36, 183­199. Ron, D., Singer, Y., & Tishby, N. (1996). The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length. Machine Learning, 25, 117­149. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386­408. Shalev-Shwartz, S. (2007). Online learning: Theory, algorithms, and applications. Doctoral dissertation, The Hebrew University of Jerusalem. Willems, F., Shtarkov, Y., & Tjalkens, T. (1995). The context-tree weighting method: basic properties. IEEE Transactions on Information Theory, 41, 653­664.

6. Conclusions
We presented a modification of Balanced Winnow for learning PSTs in order to predict the next item in a sequence. The algorithm presented here does not rely on any assumptions about an underlying PST that generates the data. Our algorithm comes with theoretical guarantees about the number of mistakes it will make relative to the best PST determined in hindsight and about the amount of memory it will use to store its hypothesis. In all our experiments we found that it makes fewer mistakes and uses less memory than an additive algorithm with similar guarantees (Dekel et al., 2004).

Acknowledgments
We thank Robert Kleinberg for many valuable discussions, Carla Marceau and ATC-NY for providing the data, and the anonymous reviewers for their helpful comments. This work was supported by the Air Force