Variational Inference for Diffusion Processes ´ Cedric Archambeau University College London c.archambeau@cs.ucl.ac.uk Yuan Shen Aston University y.shen2@aston.ac.uk Manfred Opper Technical University Berlin opperm@cs.tu-berlin.de John Shawe-Taylor University College London jst@cs.ucl.ac.uk Dan Cornford Aston University d.cornford@aston.ac.uk Abstract Diffusion processes are a family of continuous-time continuous-state stochastic processes that are in general only partially observed. The joint estimation of the forcing parameters and the system noise (volatility) in these dynamical systems is a crucial, but non-trivial task, especially when the system is nonlinear and multimodal. We propose a variational treatment of diffusion processes, which allows us to compute type II maximum likelihood estimates of the parameters by simple gradient techniques and which is computationally less demanding than most MCMC approaches. We also show how a cheap estimate of the posterior over the parameters can be constructed based on the variational free energy. 1 Introduction Continuous-time diffusion processes, described by stochastic differential equations (SDEs), arise naturally in a range of applications from environmental modelling to mathematical finance [13]. In statistics the problem of Bayesian inference for both the state and parameters, within partially observed, non-linear diffusion processes has been tackled using Markov Chain Monte Carlo (MCMC) approaches based on data augmentation [17, 11], Monte Carlo exact simulation methods [6], or Langevin / hybrid Monte Carlo methods [1, 3]. Within the signal processing community solutions to the so called Zakai equation [12] based on particle filters [8], a variety of extensions to the Kalman filter/smoother [2, 5] and mean field analysis of the SDE together with moment closure methods [10] have also been proposed. In this work we develop a novel variational approach to the problem of approximate inference in continuous-time diffusion processes, including a marginal likelihood (evidence) based inference technique for the forcing parameters. In general, joint parameter and state inference using naive methods is complicated due to dependencies between state and system noise parameters. We work in continuous time, computing distributions over sample paths1 , and discretise only in our posterior approximation, which has advantages over methods based on discretising the SDE directly [3]. The approximate inference approach we describe is more computationally efficient than competing Monte Carlo algorithms and could be further improved in speed by defining a variety of sub-optimal approximations. The approximation is also more accurate than existing Kalman smoothing methods applied to non-linear systems [4]. Ultimately, we are motivated by the critical requirement to estimate parameters within large environmental models, where at present only a small number of Kalman filter/smoother based estimation algorithms have been attempted [2], and there have been no likelihood based attempts to estimate the system noise forcing parameters. 1 A sample path is a continuous-time realisation of a stochatic process in a certain time interval. Hence, a sample path is an infinite dimensional object. 1 In Section 2 and 3, we introduce the formalism for a variational treatment of partially observed diffusion processes with measurement noise and we provide the tools to estimate the optimal variational posterior process [4]. Section 4 deals with the estimation of the drift and the system noise parameter, as well as the estimation of the optimal initial conditions. Finally, the approach is validated on a bi-stable nonlinear system in Section 5. In this context, we also discuss how to construct an estimate of the posterior distribution over parameters based on the variational free energy. 2 Diffusion processes with measurement error Consider the continuous-time continuous-state stochastic process X = {Xt , t0 t tf }. We assume this process is a d-dimensional diffusion process. Its time evolution is described by the following SDE (to be interpreted as an Ito stochastic integral): dXt = f (t, Xt ) dt + 1/2 dWt , dWt N (0, dtI). (1) The nonlinear vector function f defines the deterministic drift and the positive semi-definite matrix Rd×d is the system noise covariance. The diffusion is modelled by a d-dimensional Wiener process W = {Wt , t0 t tf } (see e.g. [13] for a formal definition). Eq. (1) defines a process with additive system noise. This might seem restrictive at first sight. However, it can be shown [13, 17, 6] that a range of state dependent stochastic forcings can be transformed into this form. It is further assumed that only a small number of discrete-time latent states are observed and that the observations are subject to measurement error. We denote the set of observations at the discrete times {tn }N=1 by Y = {yn }N=1 and the corresponding latent states by {xn }N=1 , with xn = Xt=tn .For n n n simplicity, the measurement noise is modelled by a zero-mean multivariate Gaussian density,with covariance matrix R Rd×d . 3 Approximate inference for diffusion processes Our approximate inference scheme builds on [4] and is based on a variational inference approach (see for example [7]). The aim is to minimise the variational free energy, which is defined as follows: l q p(Y , X | , ) F (q , ) = - n , X = {Xt , t0 t tf }, (2) q (X |) where q (X |) is an approximate posterior process over sample paths in the interval [t0 , tf ] and are the parameters, excluding the stochastic forcing covariance matrix . Hence, this quantity is an upper bound to the negative log-marginal likelihood: - ln p(Y | , ) = F (q , ) - KL [q (X |) p(X|Y, , )] F (q , ). (3) As noted in Appendix A, this bound is finite if the approximate process is another diffusion process with a system noise covariance chosen to be identical to that of the prior process induced by (1). The standard approach for learning the parameters in presence of latent variables is to use an EM type algorithm [9]. However, since the variational distribution is restricted to have the same system noise covariance (see Appendix A) as the true posterior, the EM algorithm would leave this covariance completely unchanged in the M step and cannot be used for learning this crucial parameter. Therefore, we adopt a different approach, which is based on a conjugate gradient method. 3.1 Optimal approximate posterior process We consider an approximate time-varying linear process with the same diffusion term, that is the same system noise covariance: dXt = g(t, Xt ) dt + 1/2 dWt , dWt N (0, dtI), (4) where g(t, x) = -A(t)x+b(t), with A(t) Rd×d and b(t) Rd . In other words, the approximate posterior process q (X |) is restricted to be a Gaussian process [4]. The Gaussian marginal at time t is defined as follows: q (Xt |) = N (Xt |m(t), S(t)), 2 t0 t tf , (5) where m(t) Rd and S(t) Rd×d are respectively the marginal mean and the marginal covariance at time t. In the rest of the paper, we denote q (Xt |) by the shorthand notation qt . For fixed parameters and assuming that there is no observation at the initial time t0 , the optimal approximate posterior process q (X |) is the one minimizing the variational free energy, which is given by (see Appendix A) tf tf n F (q , ) = Esde (t) dt + Eobs (t) (t - tn ) dt + KL [q0 p0 ] . (6) t0 t0 The function (t) is Dirac's delta function. The energy functions Esde (t) and Eobs (t) are defined as follows: q 1( Esde (t) = f (t, Xt ) - g(t, Xt )) -1 (f (t, Xt ) - g(t, Xt )) , (7) t 2 q 1( d 1 Eobs (t) = Yt - Xt ) R-1 (Yt - Xt ) t + ln 2 + ln |R|. (8) 2 2 2 where {Yt , t0 t tf } is the underlying continuous-time observable process. 3.2 Smoothing algorithm The variational parameters to optimise in order to find the optimal Gaussian process approximation are A(t), b(t), m(t) and S(t). For a linear SDE with additive system noise, it can be shown that the time evolution of the means and the covariances are described by a set of ordinary differential equations [13, 4]: m(t) = -A(t)m(t) + b(t), (9) S(t) = -A(t)S(t) - S(t)A (t) + , (10) where denotes the time derivtive. These equations provide us with consistency constraints for the marginal means and the marginal covariances along sample paths. To enforce these constraints we formulate the Lagrangian tf d L, = F (q , ) - (t) m(t) + A(t)m(t) - b(t) t t0 tf - where (t) Rd and (t) R tr t0 d×d (t) S(t) + 2A(t)S(t) - d t, (11) are time dependent Lagrange multipliers, with (t) symmetric. First, taking the functional derivatives of L, with respect to A(t) and b(t) results in the following gradient functions: AL, (t) = AEsde (t) - (t)m (t) - 2(t)S(t), bL, (t) = bEsde (t) + (t). AEsde (t) and bEsde (t) are derived in Appendix B. (12) (13) The gradients Secondly, taking the functional derivatives of L, with respect to m(t) and S(t), setting to zero and rearranging leads to a set of ordinary differential equations, which describe the time evolution of the Lagrange multipliers, along with jump conditions when there are observations: (t) = - mEsde (t) + A (t)(t), + = - - mEobs (t)|t=t , (14) n n n (t) = - SEsde (t) + 2(t)A(t), + = - - n n SEobs (t)|t=tn . (15) The optimal variational functions can be computed by means of a gradient descent technique, such as the conjugate gradient (see e.g., [16]). The explicit gradients with respect to A(t) and b(t) are given by (12) and (13). Since m(t), S(t), (t) and (t) are dependent on these parameters, one needs also to take the corresponding implicit derivatives into account. However, these implicit gradients vanish if the consistency constraints for the means (9) and the covariances (10), as well as the ones for the Lagrange multipliers (14-15), are satisfied. One way to achieve this is to perform a forward propagation of the means and the covariances, followed by a backward propagation of the Lagrange multipliers, and then to take a gradient step. The resulting algorithm for computing the optimal posterior q (X |) over sample paths is detailed in Algorithm 1. 3 Algorithm 1 Compute the optimal q (X |). 1: input(m0 , S0 , , , t0 , tf , t, ) 2: K (tf - t0 )/t 3: initialise {Ak , bk }k0 4: repeat 5: for k = 0 to K - 1 do 6: mk+1 mk - (Ak mk - bk )t 7: Sk+1 Sk - (Ak Sk + Sk Ak - )t 8: end for{forward propagation} 9: for k = K to 1 do 10: k-1 k + ( mEsde |t=tk - Ak k )t 11: k-1 k + ( SEsde |t=tk - 2k Ak )t 12: if observation at tk-1 then 13: k-1 k-1 + mEobs |t=tk-1 14: k-1 k-1 + SEobs |t=tk-1 15: end if{jumps} 16: end for{backward sweep (adjoint operation)} 17: update {Ak , bk }k0 using the gradient functions (12) and (13) 18: until minimum of L, is attained {optimisation loop} 19: return {Ak , bk , mk , Sk , k , k }k0 4 Parameter estimation The parameters to optimise include the parameters of the prior over the initial state, the drift function parameters and the system noise covariance. The estimation of the parameters related to the observable process are not discussed in this work, although it is a straightforward extension. The smoothing algorithm described in the previous section computes the optimal posterior process by providing us with the stationary solution functions A(t) and b(t). Therefore, when subsequently optimising the parameters we only need to compute their explicit derivatives; the implicit ones vanish since AL, = 0 and bL, = 0. Before computing the gradients, we integrate (11) by parts to make the boundary conditions explicit. This leads to tf d A - ( ( L, = F (q , ) - t) (t)m(t) - b(t) t)m(t) t - f mf + 0 m0 t0 tf - t0 tr d 2 - (t) A(t)S(t) - (t)S(t) t - tr {f Sf } + tr {0 S0 } , (16) At the final time tf , there are no consistency constraints, that is f and f are both equal to zero. 4.1 Initial state The initial variational posterior q (x0 ) is chosen equal to N (x0 |m0 , S0 ) to ensure that the approximate process is a Gaussian one. Taking the derivatives of (16) with respect to m0 and S0 results in the following expressions: , 1 -1 - m0 L, = 0 + 0 1 (m0 - µ0 ), S0 L, = 0 + I - S-1 (17) 0 20 where the prior p(x0 ) is assumed to be an isotropic Gaussian density with mean µ0 . Its variance 0 is taken sufficiently large to give a broad prior. 4.2 Drift The gradients for the drift function parameters f only depend on the total energy associated to the SDE. Their general expression is given by tf f L, = f Esde (t) dt, (18) t0 4 where f Esde (t) = ( f (t, Xt ) - g(t, Xt )) -1 f f (t, Xt ) q t . Note that the observations do play a role in this gradient as they enter through g(t, Xt ) and the expectation w.r.t. q (Xt |). 4.3 System noise Estimating the system noise covariance (or volatility) is essential as the system noise, together with the drift function, determines the dynamics. In general, this parameter is difficult to estimate using an MCMC approach because the efficiency is strongly dependent on the discrete approximation of the SDE and most methods break down when the time step t gets too small [11, 6]. For example in a Bayesian MCMC approach, which alternates between sampling paths and parameters, the latent paths imputed between observations must have a system noise parameter which is arbitrarily close to its previous value in order to be accepted by a Metropolis sampler. Hence, the algorithm becomes extremely slow. Note, that for the same reason, a naive EM algorithm within our approach breaks down. However, in our method, we can simply compute approximations to the marginal likelihood and its gradient directly. In the next section, we will compare our results to a direct MCMC estimate of the marginal likelihood which is a time consuming method. The gradient of (16) with respect to is given by tf L, = Esde (t) dt + t0 tf (t) dt, t0 q t (19) -1 . where 1 Esde (t) = - 2 -1 ( f (t, Xt ) - g(t, Xt )) (f (t, Xt ) - g(t, Xt )) 5 Experimental validation on a bi-stable system (20) In order to validate the approach, we consider the 1 dimensional double-well system: , f (t, x) = 4x - x2 > 0, where f (t, x) is the drift function. This dynamical system is highly nonlinear and its stationary distribution is multi-modal. It has two stable states, one in x = - and one in x = +. The system is driven by the system noise, which makes it occasionally flip from one well to the other. In the experiments, we set the drift parameter to 1, the system noise standard deviation to 0.5 and the measurement error standard deviation r to 0.2. The time step for the variational approximation is set to t = 0.01, which is identical to the time resolution used to generate the original sample path. In this setting, the exit time from one of the wells is 4000 time units [15]. In other words, the transition from one well to the other is highly unlikely in the window of roughly 8 time units that we consider and where a transition occurs. Figure 1(a) compares the variational solution to the outcomes of a hybrid MCMC simulation of the posterior process using the true parameter values. The hybrid MCMC approach was proposed in [1]. At each step of the sampling process, an entire sample path is generated. In order to keep the acceptance of new paths sufficiently high, the basic MCMC algorithm is combined with ideas from Molecular Dynamics, such that the MCMC sampler moves towards regions of high probability in the state space. An important drawback of MCMC approaches is that it might be extremely difficult to monitor their convergence and that they may require a very large number of samples before actually converging. In particular, over 100, 000 sample paths were necessary to reach convergence in the case of the double-well system. The solution provided by the hybrid MCMC is here considered as the base line solution. One can observe that the variational solution underestimates the uncertainty (smaller error bars). Nevertheless, the time of the transition is correctly located. Convergence of the smoothing algorithm was achieved in approximately 180 conjugate gradient steps, each one involving a forward and backward sweep. The optimal parameters and the optimal initial conditions for the variational solution are given by ^ = 0.72, f = 0.85, m0 = 0.88, s0 = 0.45. ^ ^ ^ (21) Convergence of the outer optimization loop is typically reached after less then 10 conjugate gradient steps. While the estimated value for the drift parameter is within 15% percent from its true value, 5 2 1.5 1 1.6 1.4 1.2 0.5 0 !0.5 !1 !1.5 !2 0 1 2 3 4 5 6 7 8 1 0.8 0.6 0.4 0.2 0 0.3 0.4 0.5 0.6 x t !2 0.7 0.8 0.9 (a) (b) Figure 1: (a) Variational solution (solid) compared to the hybrid MCMC solution (dashed), using the true parameter values. The curves denote the mean paths and the shaded regions are the twostandard deviation noise tubes. (b) Posterior of the system noise variance (diffusion term). The plain curve and the dashed curve are respectively the approximations of the posterior shape based on the variational free energy and MCMC. the deviation of the system noise is worse. Deviations may be explained by the fact that the number of observations is relatively small. Furthermore, we have chosen a sample path which contains a transition between the two wells within a small time interval and is thus highly untypical with respect to the prior distribution. This fact was experimentally assessed by estimating the parameters on a sample path without transition, in a time window of the same size. In this case, we obtained ^ estimate roughly within 5% of the true parameter values: = 0.46 and f = 0.92. Finally, it turns ^ out that our estimate for is close to the one obtained from the MCMC approach as discussed next. ^ Posterior distribution over the parameters Interestingly, minimizing the free energy F2 for different values of provides us with much more information than a single point estimate for the parameters [14]. Using a suitable prior over p( ), we can approximate the posterior over the system noise variance via p( 2 |Y ) e-F2 p( 2 ), (22) where we take e-F2 (at its minimum) as an approximation to the marginal likelihood of the observations p(Y | 2 ). To illustrate this point, we assume a non-informative Gamma prior p( 2 ) = G (, ), with = 10-3 and = 10-3 . A comparison with preliminary MCMC estimates for p(Y | 2 ) for = 1 and a set of system noise variances indicates that the shape of our approximation is a reasonable indicator of the shape of the posterior. Figure 1(b) shows that at least the mean and the variance of the density come out fairly well. 6 Conclusion We have presented a variational approach to the approximate inference of stochastic differential equations from a finite set of noisy observations. So far, we have tested the method on a one dimensional bi-stable system only. Comparison with a Monte Carlo approach suggests that our method can reproduce the posterior mean fairly well but underestimates the variance in the region of the transition. Parameter estimates also agree well with the MC predictions. In the future, we will extend our method in various directions. Although our approach is based on a Gaussian approximation of the posterior process, we expect that one can improve on it and obtain non-Gaussian predictions at least for various marginal posterior distributions, including that of the latent variable Xt at a fixed time t. This should be possible by generalising our method for the computation of a non-Gaussian shaped probability density for the system noise parameter using the free energy. An important extension of our method will be to systems with many degrees of freedom. 6 We hope that the possibility of using simpler suboptimal parametrisations of the approximating Gaussian process will allow us to obtain a tractable inference method that scales well to higher dimensions. Acknowledgments This work has been funded by the EPSRC as part of the Variational Inference for Stochastic Dynamic Environmental Models (VISDEM) project (EP/C005848/1). A The Kullback-Leibler divergence interpreted as a path integral In this section, we show that the Kullback-Leibler divergence between the posterior process p(Xt |Y , , ) and its approximation q (X |) can be interpreted as a path integral over time. It is an average over all possible realisations, called sample paths, of the continuous-time (i.e., infinite dimensional) random variable described by the SDE in the time interval under consideration. Consider the Euler-Muryama discrete approximation (see for example [13]) of the SDE (1) and its linear approximation (4): xk = fk t + 1/2 wk , 1/2 (23) xk = gk t + wk , (24) where xk xk+1 - xk and wk N (0, tI). The vectors fk and gk are shorthand notations for f (tk , xk ) and g(tk , xk ). Hence, the joint distributions of discrete sample paths {xk }k0 for the true process and its approximation follow from the Markov property: k p(x0 , . . . , xK |) = p(x0 ) N (xk+1 |xk + fk t, t), (25) >0 q (x0 , . . . , xK |) = q (x0 ) k >0 N (xk+1 |xk + gk t, t), (26) where p(x0 ) is the prior on the intial state x0 and q (x0 ) is assumed to be Gaussian. Note thate we do not restrict the variational posterior to factorise over the latent states. The Kullback-Leibler divergence between the two discretized prior processes is given by q k KL [q p] = KL [q (x0 ) p(x0 )] - (xk ) ln p(xk+1 |xk ) q(xk+1 |xk ) dxk >0 1k = KL [q (x0 ) p(x0 )] + 2 ( fk - gk ) >0 -1 (fk - gk ) q (xk ) t, where we omitted the conditional dependency on for simplicity. The second term on the right hand side is a sum in t. As a result, taking limits for t 0 leads to a proper Riemann integral, which defines an integral over the average sample path: tf ( q 1 KL [q (X |) p(X|, )] = KL [q0 p0 ] + ft - gt ) -1 (ft - gt ) t dt, (27) 2 t0 where X = {Xt , t0 t tf } denotes the stochastic process in the interval [t0 , tf ]. The distribution qt = q (Xt |) is the marginal at time t for a given system noise covariance . It is important to realise that the KL between the induced prior process and its approximation is finite because the system noise covariances are chosen to be identical. If this was not the case, the normalizing constants of p(xk+1 |xk ) and q (xk+1 |xk ) would not cancel. This would result in KL when t 0. If we assume that the observations are i.i.d., it follows also that n F (q , ) = - ln p(yn |xn ) q(xn ) + KL [q (X |) p(X|, )] . Clearly, minimising this expression with respect to the variational parameters for a given system noise and for a fixed parameter vector is equivalent to minimising the KL between the variational posterior q (X |) and the true posterior p(X |Y , , ), since the normalizing constant is independent of sample paths. 7 B The gradient functions The general expressions for the gradients of Esde (t) with respect to the variational functions are given by x S f (t, Xt ) qt + A(t) (t) - bEsde (t)m (t), (28) AEsde (t) = -1 - , f (t, Xt ) qt - A(t)m(t) + b(t) (29) bEsde (t) = -1 f q where xf (t, Xt ) qt S(t) = (t, Xt ) (Xt - m(t)) is invoked in order to obtain (28). t References [1] F. J. Alexander, G. L. Eyink, and J. M. Restrepo. Accelerated Monte Carlo for optimal estimation of time series. Journal of Statistical Physics, 119:1331­1345, 2005. [2] J. D. Annan, J. C. Hargreaves, N. R. Edwards, and R. Marsh. Parameter estimation in an intermediate complexity earth system model using an ensemble Kalman filter. Ocean Modelling, 8:135­154, 2005. [3] A. Apte, M. Hairer, A. Stuart, and J. Voss. Sampling the posterior: An approach to non-Gaussian data assimilation. Physica D, 230:50­64, 2007. [4] C. Archambeau, D. Cornford, M. Opper, and J. Shawe-Taylor. Gaussian process approximation of stochastic differential equations. Journal of Machine Learning Research: Workshop and Conference Proceedings, 1:1­16, 2007. [5] D. Barber. Expectation correction for smoothed inference in switching linear dynamical systems. Journal of Machine Learning Research, 7:2515­2540, 2006. [6] A. Beskos, O. Papaspiliopoulos, G. Roberts, and P. Fearnhead. Exact and computationally efficient likelihood-based estimation for discretely observed diffusion processes (with discussion). Journal of the Royal Statistical Society B, 68(3):333­382, 2006. [7] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, 2006. [8] D. Crisan and T. Lyons. A particle approximation of the solution of the Kushner-Stratonovitch equation. Probability Theory and Related Fields, 115(4):549­578, 1999. [9] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via EM algorithm. Journal of the Royal Statistical Society B, 39(1):1­38, 1977. [10] G. L. Eyink, J. L. Restrepo, and F. J. Alexander. A mean field approximation in data assimilation for nonlinear dynamics. Physica D, 194:347­368, 2004. [11] A. Golightly and D. J. Wilkinson. Bayesian inference for nonlinear multivariate diffusion models observed with error. Computational Statistics and Data Analysis, 2007. Accepted. [12] A. H. Jazwinski. Stochastic Processes and Filtering Theory. Academic Press, New York, 1970. [13] Peter E. Kloeden and Eckhard Platen. Numerical Solution of Stochastic Differential Equations. Springer, Berlin, 1999. [14] H. Lappalainen and J. W. Miskin. Ensemble learning. In M. Girolami, editor, Advances in Independent Component Analysis, pages 76­92. Springer-Verlag, 2000. [15] R. N. Miller, M. Ghil, and F. Gauthiez. Advanced data assimilation in strongly nonlinear dynamical systems. Journal of the Atmospheric Sciences, 51:1037­1056, 1994. [16] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, 2000. [17] G. Roberts and O. Stramer. On inference for partially observed non-linear diffusion models using the Metropolis-Hastings algorithm. Biometirka, 88:603­621, 2001. 8