Sequential Bayesian Prediction in the Presence of Changepoints Roman Garnett, Michael A. Osborne, Stephen J. Roberts {rgarnett, mosb, sjrob}@robots.ox.ac.uk Department of Engineering Science, University of Oxford, Oxford, UK OX1 3PJ Abstract We introduce a new sequential algorithm for making robust predictions in the presence of changepoints. Unlike previous approaches, which focus on the problem of detecting and locating changepoints, our algorithm focuses on the problem of making predictions even when such changes might be present. We introduce nonstationary covariance functions to be used in Gaussian process prediction that model such changes, then proceed to demonstrate how to effectively manage the hyperparameters associated with those covariance functions. By using Bayesian quadrature, we can integrate out the hyperparameters, allowing us to calculate the marginal predictive distribution. Furthermore, if desired, the posterior distribution over putative changepoint locations can be calculated as a natural byproduct of our prediction algorithm. 1. Introduction We consider the problem of performing time-series prediction in the face of abrupt changes to the properties of the variable of interest. For example, a data stream might undergo a sudden shift in its mean, variance, or characteristic input scale; a periodic signal might have a change in period, amplitude, or phase; or a signal might undergo a change so drastic that its behavior after a particular point in time is completely independent of what happened before. A robust prediction algorithm must be able to make accurate predictions even under such unfavorable conditions. The problem of detecting and locating abrupt changes in data sequences has been studied under the name Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). changepoint detection for decades. A large number of methods have been proposed for this problem; see (Basseville & Nikiforov, 1993; Brodsky & Darkhovsky, 1993; Csorgo & Horvath, 1997; Chen & Gupta, 2000) and the references therein for more information. Relatively few algorithms perform prediction simultaneously with changepoint detection, although sequential Bayesian methods do exist for this problem (Chernoff & Zacks, 1964; Adams & MacKay, 2007). However, these methods, and most methods for changepoint detection in general, make the assumption that the data stream can be segmented into disjoint sequences, such that in each segment the data represent i.i.d. observations from an associated probability distribution. The problem of changepoints in dependent processes has received less attention. Both Bayesian (Carlin et al., 1992; Ray & Tsay, 2002) and non-Bayesian (Muller, 1992; Horv´th & Kokoszka, 1997) solutions do exist, a although they focus on retrospective changepoint detection alone; their simple dependent models are not employed for the purposes of prediction. Sequential and dependent changepoint detection has been performed (Fearnhead & Liu, 2007) only for a limited set of changepoint models. We introduce a fully Bayesian framework for performing sequential time-series prediction in the presence of drastic changes in the characteristics of the data. We introduce classes of nonstationary covariance functions to be used in Gaussian process inference for modelling functions with changepoints. In this context, the position of a particular changepoint becomes a hyperparameter of the model. We proceed as usual; for making predictions, the full marginal predictive distribution is estimated. If the locations of changepoints in the data is of interest, we estimate the full posterior distribution of the related hyperparameters given the data. The result is a robust time-series prediction algorithm that makes well-informed predictions even in the presence of sudden changes in the data. If desired, the algorithm additionally performs changepoint detection as a natural byproduct of the prediction process. Sequential Bayesian Prediction in the Presence of Changepoints The remainder of this paper is arranged as follows. In the next section, we briefly introduce Gaussian processes and discuss the marginalization of hyperparameters using Bayesian Monte Carlo numerical integration. A similar technique is presented to produce posterior distributions and their means for any hyperparameters of interest. Next we introduce a class of nonstationary covariance functions to model functions with changepoints. In Section 5 we provide a brief expository example of our algorithm. Finally, we provide results demonstrating the ability of our model to make robust predictions and locate changepoints effectively. rive our predictive equations for the vector of function values y at inputs x p(y |x , , Id ) = N y ; m (y |Id ), C (y |Id ) , (1) where we have: m (y |Id ) C (y |Id ) = µ (x ) + K (x , xd )K (xd , xd )-1 (y d - µ (xd )) = K (x , x ) - K (x , xd )K (xd , xd )-1 K (xd , x ) . 2. Gaussian Process Prediction Gaussian processes (GPs) offer a powerful method to perform Bayesian inference about functions (Rasmussen & Williams, 2006). A GP is defined as a distribution over the functions X R such that the distribution over the possible function values on any finite set F X is multivariate Gaussian. The prior distribution over the values of a function y(x) are completely specified by a mean vector µ and covariance matrix K p( y | µ, K, I ) N(y; µ, K) 1 1 exp - (y - µ)T K-1 (y - µ) , 2 det 2K We use the sequential formulation of a GP given by (Osborne et al., 2008) to perform sequential prediction using a moving window. After each new observation, we use rank-one updates to the covariance matrix to efficiently update our predictions in light of the new information received. We efficiently remove the trailing edge of the window using a similar rankone "downdate." The computational savings made by these choices mean our algorithm can be feasibly run on-line. 3. Marginalization Of course, we can rarely be certain about a priori. For each hyperparameter we take an independent Gaussian prior distribution (or if our hyperparameter is restricted to the positive reals, we instead assign a Gaussian distribution to its log) such that E where I, the context, includes prior knowledge of both the mean and covariance functions, which generate µ and K respectively. The prior mean function is chosen as appropriate for the problem at hand (often a constant), and the covariance function is chosen to reflect any prior knowledge about the structure of the function of interest, for example periodicity or differentiability. A large number of covariance functions exists, and appropriate covariance functions can be constructed for a wide variety of problems (Rasmussen & Williams, 2006). For this reason, GPs are ideally suited for both linear and nonlinear time-series prediction problems with complex behavior. We take y to be a potentially dependent dynamic process, such that X contains a time dimension. Note that our approach considers functions of continuous time; we have no need to discretize our observations into time steps. Our GP distribution is specified by various hyperparameters e : e = 1, . . . , E, collectively denoted as {e : e = 1, . . . , E}. includes the mean function µ, as well as parameters required by the covariance function, input and output scales, amplitudes, periods, etc. as needed. Define Id as the conjunction of I and the observations available to us within the window, (xd , y d ). Taking both Id and as given, we are able to analytically de- p( | I ) N e ; e , e 2 . e=1 These hyperparameters must hence be marginalized as p(y |x , Id ) = p(y |x , , Id ) p(y d |xd , , I) p(|I) d . p(y d |xd , , I) p(|I) d Although these required integrals are non-analytic, we can efficiently approximate them by use of Bayesian Monte Carlo (Rasmussen & Ghahramani, 2003) (BMC) techniques. Following (Osborne et al., 2008), we take a grid of hyperparameter samples {s : s = 1, . . . , S} E e , where e is a column e=1 vector of samples for the eth hyperparameter and is the Cartesian product. We thus have a different mean ms (y |Id ), covariance Cs (y |Id ) and likelihood ls p(y d |xd , s , I) for each. BMC supplies these samples to a GP to perform inference about our integrand for other values of the hyperparameters. In particular, Sequential Bayesian Prediction in the Presence of Changepoints we assign a Gaussian covariance function for this GP E 4. Covariance Functions for Prediction in the Presence of Changepoints We now describe how to construct appropriate covariance functions for functions that experience sudden changes in their characteristics. This section is meant to be expository; the covariance functions we describe are intended as examples rather than an exhaustive list of possibilities. To ease exposition, we assume the input variable of interest x is entirely temporal. If additional features are available, they may be readily incorporated into the derived covariances (Rasmussen & Williams, 2006). We consider the family of isotropic stationary covariance functions of the form K(x1 , x2 ; {, }) 2 |x1 -x2 | K(, ) e=1 K e e , e K e e , e 2 N e ; e , we . We define Ne e , e E N 2 +w2 e e , e 2 e ; e e e -1 2 e 2 2 e +we -1 M e=1 Ke e , e Ne e , e Ke e , e M lS , T Ml 1S,1 S where 1S,1 is a column vector containing only ones of dimensions equal to l {ls : s = 1, . . . , S}, and is the Kronecker product. Using these, BMC leads us to S , (5) p(y |x , Id ) s=1 s N y ; ms (y |Id ), Cs (y |Id ) . (2) BMC can also estimate the posterior distribution for hyperparameter f by marginalizing over all other hyperparameters -f p(f |Id ) = p(y d |xd , , I) p(|I) d -f p(y d |xd , , I) p(|I) d T where is an appropriately chosen function. The parameters and represent respectively the characteristic output and input scales of the process. An example isotropic covariance function is the squared exponential covariance, given by KSE (x1 , x2 ; {, }) 1 2 exp - 2 |x1 -x2 | 2 . (6) . With the definitions Ke,f f , e mT (f ) f e=1 E 2 N e ; e , e 2 N e ; e , we , e=f e=f N E 2 T e ; e , 2 +we , e Ke,f f , e Ke e , e T -1 Many other covariances of the form (5) exist to model functions with a wide range of properties, including the rational quadratic, exponential, and Mat´rn family e of covariance functions. Many choices for are also available; for example, to model periodic functions, we can use the covariance KP E (x1 , x2 ; {, }) 1 2 exp - 2 sin2 |x1 -x2 | , , nT e=1 2 N e ; e , 2 +we e Ke e , e -1 , in which case the output scale serves as the amplitude, and the input scale serves as the period. We demonstrate how to construct appropriate covariance functions for three types of changepoints: a sudden change in the input scale, a sudden change in the output scale, and a drastic change rendering values after the changepoint independent of the function values before. The last is the simplest, and we consider it first. 4.1. A drastic change in covariance Suppose a function of interest is well-behaved except for a drastic change at the point xc , which separates the function into two regions with associated covariance functions K1 (·, ·; 1 ) before xc and K2 (·, ·; 2 ) after, where 1 and 2 represent the values of any hyperparameters associated with K1 and K2 , respectively. If the change is so drastic that the observations before xc we arrive at p(f |Id ) mT (f ) l f nT l . (3) Joint posteriors for sets of hyperparameters are also readily obtained in a similar manner. Making the definitions ¯ Ke,f e ¯f mT 2 2 T +we e 2 T e e N e ; e , 2 +we , 2 e 2+we e 2 T N e ; e , 2 +we , e E -1 ¯ , Ke,f e Ke e , e e=1 e=f e=f the posterior mean is given by f p(f |Id ) df ¯f mT l . nT l (4) Sequential Bayesian Prediction in the Presence of Changepoints Drastic changepoint 1 0.8 K K 1 0.8 KSE K B Changepoint in input scale 4.2. A sudden change in input scale Suppose a function of interest is well-behaved except for a drastic change in the input scale at time xc , which separates the function into two regions with different degrees of long-term dependence. 4 SE K(0,x) 0.4 0.2 0 0 1 2 3 4 K(0,x) 0.6 A 0.6 0.4 0.2 0 0 1 2 3 x x Changepoint in output scale 1 0.8 K K SE Changepoint in input and output scales 1 0.8 K K SE D K(0,x) 0.4 0.2 0 0 1 2 3 4 K(0,x) 0.6 C 0.6 0.4 0.2 0 0 1 2 3 Let 1 and 2 represent the input scale of the function before and after the changepoint at xc , respectively. Suppose we wish to model the function with an isotropic covariance function K of the form (5) that would be appropriate except for the change in input scale. We may model the function using the covariance function KB defined by KB (x1 , x2 ; {2 , 1 , 2 , xc }) K(x1 , x2 ; {, 1 }) (x1 , x2 < xc ) K(x1 , x2 ; {, 2 }) (x1 , x2 xc ) 2 |xc -x | + |xc -x | 2 1 otherwise. 1 2 4 x x Figure 1. Example covariance functions for the modelling of data with changepoints. (8) are completely uninformative about the observations after the changepoint; that is, if p y xc | I