A Stochastic Memoizer for Sequence Data Frank Wood fwood@gatsby.ucl.ac.uk C´dric Archambeau e c.archambeau@cs.ucl.ac.uk Jan Gasthaus j.gasthaus@gatsby.ucl.ac.uk Lancelot James lancelot@ust.hk Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London, 17 Queen Square, London, WC1N 3AR, UK Centre for Computational Statistics and Machine Learning University College London, Gower Street, London, WC1E 6BT, UK Department of Information and Systems Management Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong Abstract We propose an unbounded-depth, hierarchical, Bayesian nonparametric model for discrete sequence data. This model can be estimated from a single training sequence, yet shares statistical strength between subsequent symbol predictive distributions in such a way that predictive performance generalizes well. The model builds on a specific parameterization of an unbounded-depth hierarchical Pitman-Yor process. We introduce analytic marginalization steps (using coagulation operators) to reduce this model to one that can be represented in time and space linear in the length of the training sequence. We show how to perform inference in such a model without truncation approximation and introduce fragmentation operators necessary to do predictive inference. We demonstrate the sequence memoizer by using it as a language model, achieving state-of-the-art results. cess in terms of a set of conditional distributions that describe the dependence of future values on a finite history (or context) of values. The length of this context is called the order of the Markov model. The literature provides ample evidence of the fact that making such an assumption is often reasonable in a practical sense. Even data that is clearly not Markov in nature (for instance natural language) is often wellenough described by Markov models for them to be of significant practical utility. Increasing the order of the Markov model often improves application performance. Unfortunately it is often difficult to increase the order in practice because increasing the order requires either vastly greater amounts of training data or significantly more complicated smoothing procedures. In this work we propose a non-Markov model for stationary discrete sequence data. The model is nonMarkov in the sense that the next value in a sequence is modelled as being conditionally dependent on all previous values in the sequence. It is immediately clear that such a model must have a very large number of latent variables. To constrain the learning of these latent variables, we employ a hierarchical Bayesian prior based on Pitman-Yor processes which promotes sharing of statistical strength between subsequent symbol predictive distributions for equivalent contexts of different lengths (Teh, 2006). We find that we can analytically marginalize out most latent variables, leaving a number that is linear in the size of the input sequence. We demonstrate that inference in the resulting collapsed model is tractable and efficient. Posterior inference in the model can be understood as stochastically "memoizing" (Michie, 1968) con- 1. Introduction A Markov assumption is often made when modeling sequence data. This assumption stipulates that conditioned on the present value of the sequence, the past and the future are independent. Making this assumption allows one to fully characterize a sequential proAppearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). text/observation pairs. While memoization refers to deterministic caching of function outputs given inputs, what we mean by stochastic memoization is exactly that used by (Goodman et al., 2008): calling a function multiple times with the same arguments may return an instance from a set of previous return values, but also may return a new value. We call our contribution a stochastic memoizer for sequence data (sequence memoizer (SM) for short) because given a context (the argument) it will either return a symbol that was already generated in that full context, a symbol that was returned given a context that is shorter by one symbol, or, at the recursion base, potentially something entirely novel. The stochastic memoizer for sequence data consists of a model and efficient algorithms for model construction and inference. In the next section we formalize what we mean by non-Markov model and define the prior we use in the sequence memoizer. In Section 3 we explain how a posterior sampler for the sequence memoizer given a finite sequence of observations can be constructed and represented in linear space and time. In Section 4 we explain the marginalization operations necessary to achieve such a representation. In Section 5 we discuss sequence memoizer inference, particularly the novel steps necessary to perform predictive inference in contexts that do not occur in the training data. Finally, in Section 6 we use the sequence memoizer for language modelling and demonstrate promising empirical results. finite sequence of symbols (of arbitrary length). Let G[s] (v) be the probability of the following variable taking value v given the context s. Denote by G[s] the vector of probabilities (parameters) with one element for each v . Estimating parameters that generalize well to unseen contexts given a single training sequence might seem a priori unreasonable. For example, if our training sequence were x1:T = s, it is easy to see that there is only a single observation xi = si in the context x1:i-1 = s1:i-1 for every prefix s1:i-1 . In most cases this single observation clearly will not be sufficient to estimate a whole parameter vector G[s1:i-1 ] that generalizes in any reasonable way. In the following we describe a prior that hierarchically ties together the vector of predictive probabilities in a particular context to vectors of probabilities in related, shorter contexts. By doing this we are able to use observations that occur in very long contexts to recursively inform the estimation of the predictive probabilities for related shorter contexts and vice versa. The way we do this is to place a hierarchical Bayesian prior over the set of probability vectors {G[s] }s . On the root node we place a Pitman-Yor prior (Pitman & Yor, 1997; Ishwaran & James, 2001) on the probability vector G[] corresponding to the empty context []: G[] |d0 , c0 , H PY(d0 , c0 , H), (2) 2. Non-Markov Model Consider a sequence of discrete random variables x1:T = (x1 x2 · · · xT ) of arbitrary length T , each taking values in a symbol set . The joint distribution over the sequence is T where d0 is the discount parameter, c0 the concentration parameter and H the base distribution.1 For simplicity we take H to be the uniform distribution over the (assumed) finite symbol set . At the first level, the random measures {G[s] }s are conditionally independent given G[] , with distributions given by Pitman-Yor processes with discount parameter d1 , concentration parameter c1 and base distribution G[] : G[s] |d1 , c1 , G[] PY(d1 , c1 , G[] ). (3) P (x1:T ) = i=1 P (xi |x1:i-1 ), (1) where each factor on the right hand side is the predictive probability of xi given a context consisting of all preceding variables x1:i-1 . When one makes a nth order Markov approximation to (1) it is assumed that only the values taken by at most the preceding n variables matter for predicting the value of the next variable in the sequence, i.e. P (xi |x1:i-1 ) = P (xi |xi-n:i-1 ) for all i. If the context is not truncated to some fixed context length, we say the model is non-Markovian. When learning such a model from data, a vector of predictive probabilities for the next symbol given each possible context must be learned. Let s be a The hierarchy is defined recursively for any number of levels. For each non-empty finite sequence of symbols s, we have G[s] |d|s| , c|s| , G[s ] PY(d|s| , c|s| , G[s ] ), (4) where [s] = [ss ] for some symbol s , that is, s is s with the first contextual symbol removed and |r| is the length of string r. The resulting graphical model can be infinitely deep and is tree-structured, with a random probability vector on each node. The number In the statistics literature the discount parameter is typically denoted by and the concentration parameter by . In the machine learning literature is often used to denote the concentration parameter instead. We use different symbols here to avoid confusion. 1 H G[ ] G[c] c G[ac] G[cac] c G[acac] a G[oacac] o a o c a o o o H G[ ] G[o] G[ca] a a a ac o o G[ ] 1 1 1 G[a] c G[a] G[o] a oac d0:0 a c o G[a] 11 G[oa] G[ac] o G[oa] c o d1:1 c G[aca] o oac G[oa] 1 G[oac] a G[oaca] c G[oac] a d2:2 c G[oaca] G[oaca] c d2:4 1 G[oacac] c (a) Prefix trie for oacac. (b) Prefix tree for oacac. (c) Initialisation. Figure 1. (a) prefix trie and (b) corresponding prefix tree for the string oacac. Note that (a) and (b) correspond to the suffix trie and the suffix tree of cacao. (c) Chinese restaurant franchise sampler representation of subtree highlighted in (b). of branches descending from each node is given by the number of elements in . The hierarchical Pitman-Yor process (HPYP) with finite depth has been applied to language models (Teh, 2006), producing state-of-the-art results. It has also been applied to unsupervised image segmentation (Sudderth & Jordan, 2009). Defining an HPYP of unbounded depth is straightforward given the recursive nature of the HPYP formulation. One contribution of this paper to make inference in such a model tractable and efficient. A well known special case of the HPYP is the hierarchical Dirichlet process (Teh et al., 2006), which arises from setting dn = 0 for n 0. Here, we will use a lesswell-known special case where cn = 0 for n 0. In this parameter setting the Pitman-Yor process specializes to a normalized stable process (Perman, 1990). We use this particular prior because, as we shall see, it makes it possible to construct representations of the posterior of this model in time and space linear in the length of a training observation sequence. The trade-off between this particular parameterization of the PitmanYor process and one in which non-zero concentrations are allowed is studied in Section 6 and shown to be inconsequential in the language modelling domain. This is largely due to the fact that the discount parameter and the concentration both add mass to the base distribution in the Pitman-Yor process. This notwithstanding, the potential detriment of using a less expressive prior is often outweighed when gains in computational efficiency mean that more data can be modelled albeit using a slightly less expressive prior. mately in the predictive distribution for a continuation of the original sequence (or a new sequence of observations y1: ), conditioned on having already observed x1:T . Inference in the sequence memoizer as described is computationally intractable because it contains an infinite number of latent variables {G[s] }s . In this section we describe two steps that can be taken to reduce the number of these variables such that inference becomes feasible (and efficient). First, consider a single, finite training sequence s consisting of T symbols. The only variables that will have observations associated with them are the ones that correspond to contexts that are prefixes of s, i.e. {G[] }{s1:i |0i