A Stochastic Memoizer for Sequence Data

Frank Wood fwood@gatsby.ucl.ac.uk C´dric Archambeau e c.archambeau@cs.ucl.ac.uk Jan Gasthaus j.gasthaus@gatsby.ucl.ac.uk Lancelot James lancelot@ust.hk Yee Whye Teh ywteh@gatsby.ucl.ac.uk  Gatsby Computational Neuroscience Unit University College London, 17 Queen Square, London, WC1N 3AR, UK  Centre for Computational Statistics and Machine Learning University College London, Gower Street, London, WC1E 6BT, UK  Department of Information and Systems Management Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong

Abstract
We propose an unbounded-depth, hierarchical, Bayesian nonparametric model for discrete sequence data. This model can be estimated from a single training sequence, yet shares statistical strength between subsequent symbol predictive distributions in such a way that predictive performance generalizes well. The model builds on a specific parameterization of an unbounded-depth hierarchical Pitman-Yor process. We introduce analytic marginalization steps (using coagulation operators) to reduce this model to one that can be represented in time and space linear in the length of the training sequence. We show how to perform inference in such a model without truncation approximation and introduce fragmentation operators necessary to do predictive inference. We demonstrate the sequence memoizer by using it as a language model, achieving state-of-the-art results.

cess in terms of a set of conditional distributions that describe the dependence of future values on a finite history (or context) of values. The length of this context is called the order of the Markov model. The literature provides ample evidence of the fact that making such an assumption is often reasonable in a practical sense. Even data that is clearly not Markov in nature (for instance natural language) is often wellenough described by Markov models for them to be of significant practical utility. Increasing the order of the Markov model often improves application performance. Unfortunately it is often difficult to increase the order in practice because increasing the order requires either vastly greater amounts of training data or significantly more complicated smoothing procedures. In this work we propose a non-Markov model for stationary discrete sequence data. The model is nonMarkov in the sense that the next value in a sequence is modelled as being conditionally dependent on all previous values in the sequence. It is immediately clear that such a model must have a very large number of latent variables. To constrain the learning of these latent variables, we employ a hierarchical Bayesian prior based on Pitman-Yor processes which promotes sharing of statistical strength between subsequent symbol predictive distributions for equivalent contexts of different lengths (Teh, 2006). We find that we can analytically marginalize out most latent variables, leaving a number that is linear in the size of the input sequence. We demonstrate that inference in the resulting collapsed model is tractable and efficient. Posterior inference in the model can be understood as stochastically "memoizing" (Michie, 1968) con-

1. Introduction
A Markov assumption is often made when modeling sequence data. This assumption stipulates that conditioned on the present value of the sequence, the past and the future are independent. Making this assumption allows one to fully characterize a sequential proAppearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

text/observation pairs. While memoization refers to deterministic caching of function outputs given inputs, what we mean by stochastic memoization is exactly that used by (Goodman et al., 2008): calling a function multiple times with the same arguments may return an instance from a set of previous return values, but also may return a new value. We call our contribution a stochastic memoizer for sequence data (sequence memoizer (SM) for short) because given a context (the argument) it will either return a symbol that was already generated in that full context, a symbol that was returned given a context that is shorter by one symbol, or, at the recursion base, potentially something entirely novel. The stochastic memoizer for sequence data consists of a model and efficient algorithms for model construction and inference. In the next section we formalize what we mean by non-Markov model and define the prior we use in the sequence memoizer. In Section 3 we explain how a posterior sampler for the sequence memoizer given a finite sequence of observations can be constructed and represented in linear space and time. In Section 4 we explain the marginalization operations necessary to achieve such a representation. In Section 5 we discuss sequence memoizer inference, particularly the novel steps necessary to perform predictive inference in contexts that do not occur in the training data. Finally, in Section 6 we use the sequence memoizer for language modelling and demonstrate promising empirical results.

finite sequence of symbols (of arbitrary length). Let G[s] (v) be the probability of the following variable taking value v given the context s. Denote by G[s] the vector of probabilities (parameters) with one element for each v  . Estimating parameters that generalize well to unseen contexts given a single training sequence might seem a priori unreasonable. For example, if our training sequence were x1:T = s, it is easy to see that there is only a single observation xi = si in the context x1:i-1 = s1:i-1 for every prefix s1:i-1 . In most cases this single observation clearly will not be sufficient to estimate a whole parameter vector G[s1:i-1 ] that generalizes in any reasonable way. In the following we describe a prior that hierarchically ties together the vector of predictive probabilities in a particular context to vectors of probabilities in related, shorter contexts. By doing this we are able to use observations that occur in very long contexts to recursively inform the estimation of the predictive probabilities for related shorter contexts and vice versa. The way we do this is to place a hierarchical Bayesian prior over the set of probability vectors {G[s] }s . On the root node we place a Pitman-Yor prior (Pitman & Yor, 1997; Ishwaran & James, 2001) on the probability vector G[] corresponding to the empty context []: G[] |d0 , c0 , H  PY(d0 , c0 , H), (2)

2. Non-Markov Model
Consider a sequence of discrete random variables x1:T = (x1 x2 · · · xT ) of arbitrary length T , each taking values in a symbol set . The joint distribution over the sequence is
T

where d0 is the discount parameter, c0 the concentration parameter and H the base distribution.1 For simplicity we take H to be the uniform distribution over the (assumed) finite symbol set . At the first level, the random measures {G[s] }s are conditionally independent given G[] , with distributions given by Pitman-Yor processes with discount parameter d1 , concentration parameter c1 and base distribution G[] : G[s] |d1 , c1 , G[]  PY(d1 , c1 , G[] ). (3)

P (x1:T ) =
i=1

P (xi |x1:i-1 ),

(1)

where each factor on the right hand side is the predictive probability of xi given a context consisting of all preceding variables x1:i-1 . When one makes a nth order Markov approximation to (1) it is assumed that only the values taken by at most the preceding n variables matter for predicting the value of the next variable in the sequence, i.e. P (xi |x1:i-1 ) = P (xi |xi-n:i-1 ) for all i. If the context is not truncated to some fixed context length, we say the model is non-Markovian. When learning such a model from data, a vector of predictive probabilities for the next symbol given each possible context must be learned. Let s   be a

The hierarchy is defined recursively for any number of levels. For each non-empty finite sequence of symbols s, we have G[s] |d|s| , c|s| , G[s ]  PY(d|s| , c|s| , G[s ] ), (4)

where [s] = [ss ] for some symbol s  , that is, s is s with the first contextual symbol removed and |r| is the length of string r. The resulting graphical model can be infinitely deep and is tree-structured, with a random probability vector on each node. The number
In the statistics literature the discount parameter is typically denoted by  and the concentration parameter by . In the machine learning literature  is often used to denote the concentration parameter instead. We use different symbols here to avoid confusion.
1

H G[ ] G[c] c G[ac] G[cac] c G[acac] a G[oacac] o
a o c a o o o

H G[ ] G[o] G[ca] a
a a ac o o

G[ ]
1 1 1

G[a]
c

G[a]

G[o]
a oac

d0:0

a

c

o

G[a]
11

G[oa]

G[ac]
o

G[oa]
c

o
d1:1

c

G[aca]
o oac

G[oa]
1

G[oac]
a

G[oaca]
c

G[oac]
a

d2:2

c

G[oaca]
G[oaca]
c
d2:4
1

G[oacac]

c

(a) Prefix trie for oacac.

(b) Prefix tree for oacac.

(c) Initialisation.

Figure 1. (a) prefix trie and (b) corresponding prefix tree for the string oacac. Note that (a) and (b) correspond to the suffix trie and the suffix tree of cacao. (c) Chinese restaurant franchise sampler representation of subtree highlighted in (b).

of branches descending from each node is given by the number of elements in . The hierarchical Pitman-Yor process (HPYP) with finite depth has been applied to language models (Teh, 2006), producing state-of-the-art results. It has also been applied to unsupervised image segmentation (Sudderth & Jordan, 2009). Defining an HPYP of unbounded depth is straightforward given the recursive nature of the HPYP formulation. One contribution of this paper to make inference in such a model tractable and efficient. A well known special case of the HPYP is the hierarchical Dirichlet process (Teh et al., 2006), which arises from setting dn = 0 for n  0. Here, we will use a lesswell-known special case where cn = 0 for n  0. In this parameter setting the Pitman-Yor process specializes to a normalized stable process (Perman, 1990). We use this particular prior because, as we shall see, it makes it possible to construct representations of the posterior of this model in time and space linear in the length of a training observation sequence. The trade-off between this particular parameterization of the PitmanYor process and one in which non-zero concentrations are allowed is studied in Section 6 and shown to be inconsequential in the language modelling domain. This is largely due to the fact that the discount parameter and the concentration both add mass to the base distribution in the Pitman-Yor process. This notwithstanding, the potential detriment of using a less expressive prior is often outweighed when gains in computational efficiency mean that more data can be modelled albeit using a slightly less expressive prior.

mately in the predictive distribution for a continuation of the original sequence (or a new sequence of observations y1: ), conditioned on having already observed x1:T . Inference in the sequence memoizer as described is computationally intractable because it contains an infinite number of latent variables {G[s] }s . In this section we describe two steps that can be taken to reduce the number of these variables such that inference becomes feasible (and efficient). First, consider a single, finite training sequence s consisting of T symbols. The only variables that will have observations associated with them are the ones that correspond to contexts that are prefixes of s, i.e. {G[] }{s1:i |0i<T } . These nodes depend only on their ancestors in the graphical model, which correspond to the suffixes of the contexts . Thus, the only variables that we need perform inference on are precisely all those corresponding to contexts which are contiguous subsequences of s, i.e. {G[sj:i ] }1ji<T . This reduces the effective number of variables to O(T 2 ). The structure of the remaining graphical model for the sequence s = oacac is given in Figure 1(a). This structure corresponds to what is known as a prefix trie, which can be constructed from an input string in O(T 2 ) time and space (Ukkonen, 1995). The second marginalization step is more involved and requires a two step explanation. We start by highlighting a marginalization transformation of this prefix trie graphical model that results in a graphical model with fewer nodes. In the next section we describe how such analytic marginalization operations can be done for the Pitman-Yor parameterization (cn = 0  n) we have chosen. Consider a transformation of the branch of the graphical model trie in Figure 1(a) that starts with a. The transformation of interest will involve marginalizing

3. Representing the Infinite Model
Given a sequence of observations x1:T we are interested in the posterior distribution over {G[s] }s , and ulti-

out variables like G[ca] and G[aca] . In general we are interested in marginalizing out all variables that correspond to non-branching interior nodes in the trie. Assume for now that we can in fact marginalize out such variables. What remains is to efficiently identify those variables that can be marginalized out. However, just building a prefix trie is of O(T 2 ) time and space complexity so using the trie to identify such nodes is infeasible for long observation sequences. Interestingly, the collapsed graphical model in Figure 1(b) has a structure called a prefix tree that can be built directly from an input string in O(T ) time and space complexity (Weiner, 1973; Ukkonen, 1995).2 The resulting prefix tree retains precisely the nodes (variables) of interest, eliminating all non-branching nodes in the trie by allowing each edge label to be a sequence of symbols (or meta-symbol), rather than a single symbol. The marginalization results of the next section are used to determine the correct Pitman-Yor conditional distributions for remaining nodes.

Figure 2. Depiction of coagulation and fragmentation.

G3 have just one child as well. The resulting prefix tree graphical model has conditional distributions that are also Pitman-Yor distributed, with discounts obtained by multiplying the discounts along the paths of the uncollapsed prefix trie model. This is the key operation we use to build the prefix tree graphical model. Theorem 1 follows from results on coagulation operators. In the following we shall outline coagulation operators as well as their inverses, fragmentation operators. This will set the stage for Theorem 2 from which Theorem 1 follows. Consider the stick-breaking construction of G2 |G1 and G3 |G2 . The weights or "sticks" are distributed according to two-parameter GEM distributions3 with concentration parameters equal to 0:


4. Marginalization
Now that we can identify the variables that can be marginalized out it remains to show how to do this. When we perform these marginalizations we would like to ensure that the required marginalization operations result in a model whose conditional distributions remain tractable and preferably stay in the same Pitman-Yor family. In this section we show that this is the case. We establish this fact by reviewing coagulation and fragmentation operators (Pitman, 1999; Ho et al., 2006). For the rest of this section we shall consider a single path in the graphical model, say G1  G2  G3 , with G2 having no children other than G3 . Recall that many marginalizations of this type will be performed during the construction of the tree. Marginalizing out G2 leaves G1  G3 , and the following result shows that the conditional distribution of G3 given G1 stays within the same Pitman-Yor family: Theorem 1. If G2 |G1  PY(d1 , 0, G1 ) and G3 |G2  PY(d2 , 0, G2 ) then G3 |G1  PY(d1 d2 , 0, G1 ) with G2 marginalized out. Clearly, Theorem 1 can be applied recursively if G1 or
For purposes of clarity it should be pointed out that the literature for constructing these data structures is focused entirely on suffix rather than prefix tree construction. Conveniently, however, the prefixes of a string are the suffixes of its reverse so any algorithm for building a suffix tree can be used to construct a prefix tree by simply giving it the reverse of the input sequence as input.
2

G2 =
i=1 

i i ,  j  j ,
j=1

i  G1 , j  G2 ,
iid

iid

  GEM(d1 , 0),   GEM(d2 , 0),

G3 =

where  is a point mass located at . The child measure G3 necessarily has the same support as its parent G2 . Hence, we can sum up (i.e. coagulate) the sticks associated with each subset of the point masses of G3 that correspond to a point mass of G2 , such that
 

G3 =
i=1

i i ,

i =
j=1

j I(zj = i).

(5)

Here, I(·) is the indicator function and zj = i if j = i . Note that zj iid . The above describes a coagulation of  by GEM(d1 , 0). In general, we shall write ( , , z)  COAGGEM(d,c) () if   GEM(d, c), each zj iid  and  is as given in Eq. (5). Intuitively, the zj 's define a partitioning of , and the elements of each partition are subsequently summed up to obtain the coagulated sticks  . The coagulation operator is the downward operation shown in Figure 2. The reverse operation is called fragmentation. It takes each stick i , breaks it into an infinite number of sticks, and reorders the resulting shorter sticks by size-biased
3 We say that  = (k ) is jointly GEM(d, c) if k = k=1 Qk-1 bk l=1 (1 - bl ) and bk  Beta(1 - d, c + kd) for all k.

permutation. The size-biased permutation of a set of positive numbers is obtained by iteratively picking (without replacement) entries with probabilities proportional to their sizes. To be more precise, we define a fragmentation of  by GEM(d, c) as follows. For each stick i , draw i  GEM(d, c), define ik = i ik for all k and let  = ~ (j ) be the size-biased permutation of (~ ik ) .  ik=1 j=1 The fragmentation operation corresponds to the upward operation shown in Figure 2. We also require the fragmentation operation to return  = (i ) . i=1 These sticks are directly extracted from the reversal of the size-biased permutation, which maps each j to some i . We set zj = i in this case and define i as the asymptotic proportion of zj 's that take the value i:

At test time we may need to compute the predictive probability of a symbol v given a context s that is not in the training set. It is easy to see that the predictive probability is simply E[G[s] (v)] = E[G[s ] (v)], where s is the longest suffix of s that occurs in the prefix trie and the expectations are taken over the posterior. E[G[s ] (v)] can be estimated by averaging over the seating arrangements of the restaurant corresponding to s . However s itself may not appear in the prefix tree, in which case we will have to reinstantiate the corresponding restaurant.

For concreteness, in the rest of this section we consider s = [oca] in Figure 1(b). The longest suffix in the prefix trie is s = [ca], but this does not appear in the prefix tree as G[ca] has been marginalized out. Thus for predictive inference we need to reinstantiate G[ca] j (or rather, its Chinese restaurant representation). We 1 I(zl = i). (6) i = lim do this by fragmenting G[oaca] |G[a] into G[ca] |G[a] and j j l=1 G[oaca] |G[ca] . Using the equivalence between Chinese restaurant processes and stick-breaking constructions We write (, , z)  FRAGGEM(d,c) ( ) if ,  and z for Pitman-Yor processes, we can translate the fragare as constructed above. mentation operation in Corollary 1 into a fragmentaTheorem 2. The following statements are equivalent: tion operation on the Chinese restaurant representa(1)   GEM(d2 , c), ( , , z)  COAGGEM(d1 ,c/d2 ) (); tion of G[oaca] |G[a] instead. This results in the pro(2)   GEM(d1 d2 , c), (, , z)  FRAGGEM(d2 ,-d1 d2 ) ( ). cedure given in the next paragraph for reinstantiating the G[ca] |G[a] restaurant. The above theorem was proven by (Pitman, 1999; Ho Suppose there are K tables in the G[oaca] |G[a] restauet al., 2006). While coagulation is important for conrant, table k having nk customers. Independently for structing the collapsed model, fragmentation is imporeach table k, sample a partition of nk customers in tant for reinstantiating nodes in the graphical model a restaurant corresponding to a Pitman-Yor process corresponding to contexts in test sequences that did with discount parameter d3 d4 and concentration panot occur in the training data. For instance, we might rameter -d2 d3 d4 . Say this results in Jk tables, with need to perform inference in the context correspondJk numbers of customers being nkj , with j=1 nkj = nk . ing to node G2 given only G1 and G3 in the collapsed The nk customers in the original table are now seated representation (see Section 5). This is formalized in at Jk tables in the G[oaca] |G[ca] restaurant with table the following: j having nkj customers. Each of these tables sends a Corollary 1. Suppose G3 |G1  PY(d1 d2 , 0, G1 ), with  customer to the G[ca] |G[a] restaurant; these customers stick-breaking representation G3 = i=1 i i where are all seated at the same table. There was one cus  GEM(d1 d2 , 0) and i iid G1 . Let (, , z)   tomer in the G[a] |G[] restaurant corresponding to the FRAGGEM(d2 ,-d1 d2 ) ( ). Then G2 = i=1 i i is a original table in G[oaca] |G[a] with nk customers. There draw from G2 |G1 , G3 and we can equivalently write  is still one customer in G[a] |G[] corresponding to the G3 = j=1 j j where j = zj . new table in G[ca] |G[a] , thus this restaurant's seating arrangement needs not be altered.

5. Inference and Prediction

Once the collapsed graphical model representation has been built, inference proceeds as it would for any other hierarchical Pitman-Yor process model. In this work we used Gibbs sampling in the Chinese restaurant franchise representation, and refer the reader to (Teh, 2006) for further details. Figure 1(c) depicts a valid seating arrangement for the restaurants in the Chinese restaurant franchise representation corresponding to each bold node in Figure 1(b).

6. Experiments
We are interested in understanding the potential impact of using a sequence memoizer in place of a Markov model in general modelling contexts. To start we explore two issues: first, whether using prefix trees instead of prefix tries empirically gives the computational savings that is expected under worst-case analysis; and second, whether the predictive performance

of the sequence memoizer compares favorably to a Markov model with similar complexity. To provide concrete answers to these questions we turn to n-gram language modeling. Applying the sequence memoizer in this application domain is equivalent to letting n   in an n-gram HPYP language model. For this reason we will refer to the sequence memoizer as an -gram HPYP in language modeling contexts. For comparison, we used n-gram HPYPs with finite n as state-of-the-art baselines (Teh, 2006). The sequence of observations used as training data will be referred to as the training corpus and the predictive power of the models will be measured in terms of test corpus perplexity. The datasets used in our experiments were an excerpt from the New York Times (NYT) corpus and the entire Associated Press (AP) corpus. The latter corpus is exactly the same as that used in (Bengio et al., 2003; Teh, 2006; Mnih & Hinton, 2009), allowing us to compare perplexity scores against other state-of-theart models. The AP training corpus (with 1 million word validation set folded in) consisted of a total of 15 million words while the AP test corpus consisted of 1 million words. The AP dataset was preprocessed to replace low frequency words (< 5 appearances) with a single "unknown word" symbol, resulting in 17964 unique word types. This preprocessing is semiadversarial for the -gram model because the number of unique prefixes in the data is lowered, resulting in less computational savings for using the prefix tree relative to the trie. The NYT training corpus consisted of approximately 13 million words and had a 150,000 word vocabulary. The NYT test corpus consisted of approximately 200,000 words. In this more realistic dataset no preprocessing was done to replace low frequency words. For this reason we used the NYT dataset to characterize the computational savings of using the prefix tree. We used the CRF sampler outlined in Section 5 with the addition of Metropolis-Hastings updates for the discount parameters (Wood & Teh, 2009). The discounts in the collapsed node restaurants are products of subsets of discount parameters making other approaches difficult. We use distinct discount parameters for each of the first four levels of the trie, while levels below use a single shared discount parameter. Theoretically the model can use different discounts for every depth or node in the trie. Our choice in this regard was somewhat arbitrary and warrants more experimentation. The discounts were initialized to d[0,1,2,...] = (.62, .69, .74, .80, .95, .95, . . .). We used extremely short burn-in (10 iterations) and collected only

Figure 3. Total number of nodes in the tree and number of nodes that have to be sampled as a function of number of NYT observations. The number of nodes in the corresponding trie scales quadratically in the number of observations and is not shown. For reference the number of nodes in the trie corresponding to the rightmost data point is 8.2 × 1013 .

5 samples. We found that this produced the same perplexity scores as a sampler using 125 burn-in iterations and 175 samples (Teh, 2006), which indicates that the posterior structure is simple and efficiently traversed by our sampler. The CRF states were initialized such that all customers of the same type in each restaurant were seated at a single table. This initial configuration corresponds to interpolated Kneser-Ney (Teh, 2006). Figure 3 shows the number of nodes in prefix trees growing linearly in the corpus size. We found that the total number of nodes in the trie indeed grows quadratically in the corpus size. We do not show this quadratic growth in the figure because its scale is so extreme. Instead, the figure also shows the number of nodes that have to be sampled in the -gram model, which also grows linearly in the corpus size, albeit at a lower rate. In the tree CRF representation none of the leaf nodes need to be sampled because they all will contain a single customer sitting at a single table, thus the number of nodes that have to be sampled is approximately the number of internal nodes in the prefix tree. While the growth rate of the trie graphical model is quadratic, n-gram HPYP models do not instantiate more nodes than are necessary to represent the unique contexts in the training corpora. Figure 4 explores the numbers of nodes created in n-gram and -gram HPYP language models, for differently sized NYT corpora and for different values of n. The figure uncovers two interesting fact. First, in all n-gram models the growth in the number of nodes is intially quadratic and then becomes linear. If the plot is extended to the right

Figure 4. Nodes in prefix tries and trees as a function of n (of n-gram) and for different NYT corpora sizes. Horizontal lines are prefix tree node counts. Curved lines are prefix trie node counts. Sizes of corpora are given in the legend.

Figure 6. AP test perplexity vs. AP training corpus size for the 5-gram HPYP language model vs. the -gram

Table 1. Reported AP test perplexities. Source (Mnih & Hinton, 2009) (Bengio et al., 2003) 4-gram Modified Kneser-Ney (Teh, 2006) 4-gram HPYP (Teh, 2006) -gram (Sequence Memoizer) Perplexity

112.1 109.0 102.4 101.9 96.9

Figure 5. NYT test perplexity for n-gram and -gram HPYP language models given a 4 million word subset of the NYT training corpus. The dotted line indicates the first n-gram model that has more nodes than the -gram model.

computational cost associated with the n-gram surpasses the -gram after n = 5. This indicates that there is no reason, computational or statistical, for preferring n-gram models over the -gram when n  5. In the next set of experiments we switch to using the AP corpus instead. Figure 6 shows the test perplexities of both the 5-gram HPYP and the -gram fit to AP corpora of increasing size. For small training corpora its performance is indistinguishable from that of the -gram. Furthermore, as the corpus size grows, enough evidence about meaningful long contexts begins to accrue to give the -gram a slight relative advantage. It bears repeating here that the AP corpus is preprocessed in a way that will minimize this advantage. Finally, despite the AP corpus being semiadversarially preprocessed, the -gram achieves the best test perplexity of any language model we know of that has been applied to the AP corpus. Comparative results are given in Table 1. This is somewhat surprising and worth further study. Remember the trade-off between using the -gram vs. the n-gram HPYP: the n-gram HPYP allows for non-zero concentrations at all levels, whereas the -gram requires all concentrations to be set to zero. Conversely, the advantage of the -gram is that it can utilize arbitrarily long contexts whereas the

significantly beyond n = 6 we observe that this linear growth continues for a long time. This transition between quadratic and linear growth can be explained by observing that virtually all branch points of the trie occur above a certain depth, and below this depth only linear paths remain. Also, at n = 5 the number of nodes in the n-gram trie is roughly the same (greater in all but one case) as the number of nodes in the -gram. Questions of model size automatically lead to questions of the expressive power of the models. Figure 5 compares the expressive power of the n-gram HPYP language model against the -gram model, using the test set perplexity as a proxy. We see that the predictive performance of the n-gram HPYP asymptotically approaching that of the -gram. While the performance gain over the n-gram model is modest, and certainly goes to zero as n increases, remember that the

n-gram model cannot. That the -gram produces better results than the HPYP can thus be explained by the fact that at least some long contexts in (English) language must sharpen predictive performance. Further, the advantage of having free concentration parameters must be minimal.

ulation fragmentation laws induced by general coagulations of two-parameter Poisson-Dirichlet processes. http://arxiv.org/abs/math.PR/0601608. Ishwaran, H. & James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of American Statistical Association, 96 (453), 161­173. Michie, D. (1968). Memo functions and machine learning. Nature, 218, 19­22. Mnih, A. & Hinton, G. (2009). A scalable hierarchical distributed language model. In Neural Information Processing Systems 22. to appear. Mochihashi, D. & Sumita, E. (2008). The infinite Markov model. In Advances in Neural Information Processing Systems 20, (pp. 1017­1024). Perman, M. (1990). Random Discrete Distributions Derived from Subordinators. PhD thesis, Department of Statistics, University of California at Berkeley. Pitman, J. (1999). Coalescents with multiple collisions. Annals of Probability, 27, 1870­1902. Pitman, J. & Yor, M. (1997). The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Annals of Probability, 25, 855­900. Sudderth, E. B. & Jordan, M. I. (2009). Shared segmentation of natural scenes using dependent pitman-yor processes. In Neural Information Processing Systems 22. to appear. Teh, Y. W. (2006). A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the Association for Computational Linguistics, (pp. 985­992). Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101 (476), 1566­1581. Ukkonen, E. (1995). On-line construction of suffix trees. Algorithmica, 14, 249­260. Weiner, P. (1973). Linear pattern matching algorithms. In IEEE 14th Annual Symposium on Switching and Automata Theory, (pp. 1­11). Wood, F. & Teh, Y. W. (2009). A hierarchical nonparametric Bayesian approach to statistical language model domain adaptation. In Journal of Machine Learning, Workshop and Conference Proceedings: Artificial Intelligence in Statistics 2009, volume 5, (pp. 607­614).

7. Discussion
The sequence memoizer is a model for sequence data that allows one to model extremely long contextual dependencies. Using hierarchical Bayesian sharing of priors over conditional distributions with similar contexts we are able to learn such a model from a single long sequence and still produce good generalization results. We show that the model can be efficiently constructed using suffix trees and coagulation operators. We provide a sampler for the model and introduce the fragmentation operations necessary to perform predictive inference in novel contexts. The sequence memoizer (or -gram) performs well as a language model. That it achieves the best known test perplexity on a well studied corpus may be of interest to a large community of natural language researchers. Diving deeper to uncover the precise reason for the improvement (deeper than acknowledging that language is non-Markov and long contexts do matter at least in some instances) is worthy of additional research. In a similar respect, a direct comparison of the sequence memoizer as a language model to existing variable order Markov models might be of interest (Mochihashi & Sumita, 2008). In the authors' view the language modelling success is merely a validating consequence of the primary contribution of the paper: the sequence memoizer itself. We emphasize that there are other potential applications for our model for instance text compression using character level sequential models (Cleary & Teahan, 1997).

References
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137­1155. Cleary, J. G. & Teahan, W. J. (1997). Unbounded length contexts for PPM. The Computer Journal, 40, 67­75. Goodman, N. D., Mansinghka, V. K., Roy, D., Bonawitz, K., & Tenenbaum, J. B. (2008). Church: a language for generative models. In Uncertainty and Artificial Intelligence. to appear. Ho, M. W., James, L. F., & Lau, J. W. (2006). Coag-