Learning the Linear Dynamical System with ASOS James Martens University of Toronto, Ontario, M5S 1A1, Canada JMARTENS @ CS . TORONTO . EDU Abstract We develop a new algorithm, based on EM, for learning the Linear Dynamical System model. Called the method of Approximated Second-Order Statistics (ASOS) our approach achieves dramatically superior computational performance over standard EM through its use of approximations, which we justify with both intuitive explanations and rigorous convergence results. In particular, after an inexpensive precomputation phase, the iterations of ASOS can be performed in time independent of the length of the training dataset. and the fact that exact inference and prediction within the model can be done efficiently. 1.2. Learning the LDS Learning the parameters of the LDS, sometimes called "system identification", is a well-studied problem. The available algorithms fall into three broad categories: the Prediction Error Method (PEM), Subspace Identification (4SID) and Expectation Maximization (EM). In the PEM approach (e.g. Ljung, 2002), a 1-step prediction-error objective is minimized via gradient-based optimization methods. Typical implementations use either gradient descent and thus require many iterations to converge, or use 2nd order optimization methods but then become impractical for large models. EM, a broadly applied algorithm for maximum likelihood parameter estimation for hidden-variable models, can be applied to the LDS. The maximum likelihood objective can be seen as the special case of the prediction error objective (Ljung, 2002), but EM takes a different approach to PEM in optimizing this objective making it more practical for large models. Both PEM and EM are iterative optimization algorithms where each iteration requires a pass over the entire dataset. Since very many iterations can be required, neither of these algorithms scale well to very long time-series. In the 4SID approach (Overschee & Moor, 1991), the LDS equations are re-written as large block matrix formulae, which are used to produce an estimate of the hidden states sequence via matrix projections (this boils down to computing a large singular value decomposition), which is then used to estimate the parameters. The block formulae generate predictions for future data-points using only the i previous ones, where i is a meta-parameter which controls the quality of the solution at the cost of computational performance. By contrast, statistically optimal state estimators (such as those used in the E-step of EM) use the entire time-series, including both past and future observations. 4SID is not an iterative optimization algorithm like PEM or EM and thus often tends to be faster than these methods while avoiding the problem of bad local minima. However, the solutions it produces, while of high quality, tend not to be locally optimal for any particular objective function 1. Introduction 1.1. The LDS Model The time-invariant discrete Linear Dynamical System (LDS) is a classical and widely used model of real-valued multivariate time-series data {yt RNy }T . Hidden t=1 states {xt RNx }T are generated via the time-evolution t=1 matrix A RNx ×Nx as: xt+1 = Axt + t (1) where { t }T are i.i.d. multivariate normal with mean 0 t=1 and covariance matrix Q. Observations yt are generated from xt via the matrix C RNy ×Nx according to: yt = Cxt + t (2) where {t }T are also i.i.d. multivariate normal with mean t=1 0 and covariance matrix R. The initial state (x1 ) distribution is multivariate normal with mean 1 and covariance matrix 1 . The LDS is arguably the most commonly used time-series model for real-world engineering and financial applications. This is due to its relative simplicity, its mathematically predictable behavior, the existence of many physical systems that are known to be accurately modeled by it, Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). Learning the Linear Dynamical System with ASOS (such as the log-likelihood). The approach thus often advocated is to run EM initialized with the solution produced by 4SID, thus avoiding bad local minima while also achieving statistical optimality (Smith & Robinson, 2000). One disadvantage of 4SID is that standard implementations of it have considerable space requirements that prevent them from scaling nicely with the length of the training timeseries. 1.3. Our Contribution This paper develops a new method for dramatically increasing the efficiency of EM via an approximation scheme which we call the method of Approximate Second-Order Statistics (ASOS). The ASOS scheme approximates the Estep so that it can be computed in time independent of T , the length of the training time-series. This allows EM to be practical for time-series of nearly unbounded size, making it much more useful as an optimization tool to be used in conjunction with 4SID, or even as a replacement in cases where excessively long time-series make 4SID infeasible. Since the 4SID and EM algorithms have been analyzed and compared (Smith & Robinson, 2000; Smith et al., 1999), our goal in this paper will instead be to compare "ASOSEM" with standard EM and show that the approximations have only a minimal effect on the solution quality while providing a huge computational performance gain, both in an theoretical and practical sense. rameter estimate as n+1 = arg max Qn (). 2.1. M-step for the LDS As a consequence of the linear and Gaussian nature of the LDS model, the function Qn () can be written in terms of and statistics that are first and second order in x and y (and are highly non-linear in n ). With these statistics computed, optimizing Qn () with respect to reduces to a straightforward application of matrix calculus and is similar to linear regression with an unknown covariance. For a full specification and derivation of the M-step see Ghahramani & Hinton (1996). For the sake of brevity we will use the following standard notation for the remainder of this paper: xk En [ xt | yk ] t k Vt,s Covn [ xt , xs | yk ] yt yt - En [ yt | yt-1 ] ~ ~ St Covn [ yt | yt ] k Noting that En [xt xs | yk ] = xk xk + Vt,s , and using t s T -k the notation (a, b)k t=1 at+k bt (where v denotes the transpose of v) the complete list of statistics required to compute the M-step may be written as: T T -1 T Vt,t , t=1 (y, xT )0 , (xT , xT )0 + (xT , xT )1 + t=1 T Vt+1,t 2. EM for LDS learning The objective function maximized during maximum likelihood learning is the log probability of the observation sequence y given the model parameters (also known as the log likelihood function). This can be obtained from the previous joint probability by integrating out the hidden states: log p(y|) = log x p(x, y|). While the gradient and Hessian of this function are difficult to compute (indeed, implementations of PEM often resort to finite differences), it is relatively simple to compute the log joint probability and its expectation given y for a particular setting of the parameters. The EM algorithm, which was first applied to LDS parameter learning by Shumway & Stoffer (1982), can indirectly optimize the log-likelihood by iteratively maximizing this later quantity, which we denote Qn (). Qn () En [log p(x, y|)|y] = These expressions contain both sums of covariances and sums of products of means. For the remainder of this report we will refer to the later sums as the "M-statistics", which are particular examples of "2nd -order statistics". 2.2. E-Step for the LDS (Kalman recursions) The Kalman recursions (see Ghahramani & Hinton, 1996) are a set of recursive relations that define a computational procedure for exact inference of the distribution over the hidden states x in the LDS model. The standard approach for computing the M-statistics is to apply this procedure to find xT for each value of t in succession and then pert form the required sums. Doing this has time complexity 3 O(Nx T ) which is the reason that the EM algorithm scales poorly to long time-series. 3. The ASOS Approach The key observation that motivates the ASOS approach is that, for the purposes of the M-step, we don't actually care about the individual moments for each xt , but rather certain sums over the products of these, the M-statistics, along with sums over covariance matrices. Thus we can focus on the M-statistics directly and come up with approximations that will make them more efficiently computable p(x|y, n ) log p(x, y|) dx 3.1. Overview The EM algorithm iteratively alternates between two phases called the "E-step" and the "M-step". For a given estimate of the parameters n , the E-step computes the expectations which appear in the expression for Qn (), allowing it to be easily evaluated and optimized with respect to in the M-step. The M-step then computes the new pa- Learning the Linear Dynamical System with ASOS without having to concern ourselves with computing the individual xT . To this end we will derive a set of "2nd t order recursions" from the Kalman recursions which define a recursive scheme to compute various 2nd -order statistics over yt , xt , and xT , culminating in the computation t t of the M-statistics. Whereas the Kalman recursions relate 1st -order moments across different time-steps, these 2nd order recursions will relate 2nd -order statistics across different "time-lags", where by "time-lag" we mean the value of k in (a, b)k T -k at+k bt , i.e. the difference in the t=1 temporal-indices as they appear in the sum. The number of distinct time-lags is equal to T , the numbers of time-steps (statistics with time-lag larger than T are equal to the 0 matrix), and so solving them exactly would entail just as much or more computation than simply running the original Kalman recursions. Fortunately it turns out that, at the cost of introducing some fairly liberal approximations which have some favorable statistical and asymptotic properties, we can evaluate the 2nd -order recursions much more efficiently than the Kalman recursions. In particular, by approximating statistics of "large" time-lag by carefully chosen unbiased estimators we can derive a compact system of linear equations that can be solved very efficiently. The size of this system, and thus the cost of solving it, turns out to be a function of the cut-off time-lag "klim " at which we choose to approximate the 2nd -order statistics. The resulting algorithm only depends on the valT -k ues of (y, y)k t=1 yt+k yt for 0 k kkim . And while computing these clearly requires time proportional to T , they only need to be pre-computed once before the EM-iterations begin. To realize this approach we need to simplify the Kalman recursions by using a tool from LDS theory known as "steady-state", which we discuss next. 3.2. The steady state assumption The Kalman recursions, in addition to computing the conditional means xT for each state, also compute the cot T variance matrices (e.g. Vt,t ) between hidden state vectors, along with the filtering and smoothing matrices, Kt and Jt (for the precise definitions of these, we defer again to Ghahramani & Hinton (1996)). Notably, the recursions for these quantities do not involve the actual time-series data. Moreover, a well-known result is that under certain control-theoretic conditions for the model parameters these matrices rapidly approach constant matrices as t grows, a phenomenon known as "steady state" (e.g. Goodwin & Sin, 1984). In particular, as t grows Kt converges to a constant matrix which we denote K (without a subscript). And simiT T larly, Vt,t , Vt,t-1 and Jt converge to 0 , 1 and J respectively as min(t, T - t) grows. Computing these matrices reduces to solving discrete algebraic Riccati equations and simpler Lyapunov equations, for which there are efficient algorithms. For simplicity we will assume that the steady-state condition applies over the entire time-series. Later we will see how this strong assumption can be replaced with a more realistic one that will approximate the truth to an arbitrary precision. Under steady state the Kalman recursions for the state means can be written compactly as: x = Hx + Kyt t t-1 xT = JxT + P x t t+1 t where we have defined x xt , H A - KCA and t t P I - JA. The usefulness of the switch in notation from xt to x is that it allows us to use our special notation t t T -k for 2nd -order statistics (a, b)k t=1 at+k bt with the vector-list xt as an argument, e.g. a = x . t In addition to simplifying the Kalman recursions, assuming steady state makes it much easier to compute the covariance-matrix sums required by the M-step; we just multiply the corresponding steady-state value by T or T - 1, as the case may be. Thus to complete the E-step it remains only to compute the M-statistics. 3.3. Recursions for the 2nd -order statistics In this section we will give the general approach for deriving the 2nd -order recursions and then provide the complete list. To find the equation that computes the statistic (a, b)k we right-multiply the Kalman recursion for at+k (or if a = y, just the trivial equation yt+k = yt+k ) by the transpose of the one for bt and then sum both sides from t = 1 to T - k. As a simple example, suppose we wish to find the recursion for (x , y)k . We simply right-multiply the simplified Kalman recursion for x by yt , sum both sides over t, t+k and then re-write everything using our special notation for 2nd -order statistics: T -k T -k x yt = t+k t=1 T -k t=1 (Hx t+k-1 yt + Kyt+k yt ) T -k =H t=1 x t+k-1 yt + K t=1 yt+k yt = (x , y)k = H((x , y)k-1 - x yT -k+1 ) + K(y, y)k T Complicating this idea somewhat is the fact that the Kalman recursions for xt are not defined for the special t cases xt for t = 1 and thus we must add in an addit tional nuisance term a1+k b1 to compensate. Similarly, we must sometimes subtract an additional term from a statistic before using it in the equation for a statistic of a higher time lag since the latter is summed over a smaller range (1...T - k instead of 1...T - k + 1). The complete list Learning the Linear Dynamical System with ASOS of 2nd -order recursions, which we will call the "2nd -order recursions", is: (y, x )k = (y, x )k+1 H + ((y, y)k - y1+k y1 )K + (x , y)k = H((x , y)k-1 - Proof. For any k > 1 we have: En [ yt+1 x t-k | yt ] = En [ yt+1 | yt ]xt-k = Cxt x t+1 t-k = CAxt xt-k y1+k x 1 x yT -k+1 ) T + K(y, y)k (x , x )k = (x , x )k+1 H + ((x , y)k - x y1 )K + x x 1+k 1 1+k (x , x )k = H((x , x )k-1 - x x -k+1 ) + K(y, x )k T T (xT , y)k = J(xT , y)k+1 + P ((x , y)k - x yT -k ) + xT yT -k T T (xT , xT )k = ((xT , xT )k-1 - xT xT )J + (xT , x )k P k 1 Then taking the expectation of both sides and using the law of iterated expectations we get: En [ yt+1 x t-k ] = En [ CAxt xt-k ] (xT , x )k = J(xT , x )k+1 + P ((x , x )k - x x -k ) + xT x -k Taking k = klim and summing both sides from t = 0 to T T T T t = T - klim we have: E [ (y, x )klim +1 ] = En [ CA (x , x )klim - x x -klim T T (xT , xT )k = J(xT , xT )k+1 + P ((x , xT )k - x xT -k ) + xT xT -k n T T T T ] which is the claim. 3.4. The ASOS Approximations Examining the 2 -order recursion for (y, x )k we see that it depends on (y, x )k+1 . If we had access to its exact value of (y, x )k+1 we could use the recursion to compute (y, x )k exactly. But since we don't we will have to rely on an approximation. In particular we will approximate (y, x )k for some sufficiently large value of k, which we will denote klim , and then use the recursion to recursively compute approximate versions of each (y, x )k from k = klim down to k = 0. There are several reasons why we might expect this could be a reasonable thing to do. Firstly, for large time-lags these statistics express relationships between variables that are far apart in time in the LDS model and thus likely less important than relationships between variables that are close. Later we will show how this intuition can be made formal by quantifying the approximation error and identifying sufficient conditions under which it is negligible. Another reason that this approximation is appealing is that it is reminiscent (although not equivalent) of one of the approximations implicitly made by the 4SID algorithm, namely that state vectors at each time-step are estimated via a nonsteady state Kalman filter starting i time-steps in the past and initialized from 0, where i is 4SID's "block-size" metaparameter. Finally, by using estimators that are unbiased under the model we expect that the quality of the approximation will become better as the model parameters converge to a setting that fits the data and/or the amount of data increases. In a later section we will give a formal result which quantifies the relative error of the approximations and establishes that it goes to zero as the amount of data grows, under the condition that the data is generated from the model. The approximation we will use for (y, x )klim +1 is CA (x , x )klim - x x -klim . This seemingly arbiT T trary choice is justified by the following result: Claim 1. If the data is generated from the model's distribution then this approximation is unbiased. nd In order to completely evaluate the 2nd -order recursions we will also need similar approximations to start the recursions for (x , x )k , (xT , y)k , (xT , x )k and (xT , xT )k (note that the recursion for (x , y)k can be started from (x , y)0 = (y, x )0 ). The following two approximations can be shown to be unbiased using a proof similar to the one given above: (xT , x )klim (x , x )klim (xT , y)klim (x , y)klim Together with the approximation for (y, x )klim +1 we will call these the "ASOS approximations". Unfortunately there are no obvious candidates for unbiased approximations of either (x , x )klim or (xT , xT )klim that could be used to start the corresponding 2nd -order recursions. In the next section we will show how this problem can be circumvented by deriving two additional equations from the Kalman recursions that will sufficiently constrain the solution. Finally, we need to approximate the "first-order statistics" x and xT for the first and last klim + 1 time-steps since t t these appear as "nuisance terms" in the 2nd -order equations. This can be done easily by running the steady-state Kalman recursions on the first and last "klag " time-steps, where klag is some constant klim + 1. For the first klag timesteps the Kalman recursions can be initialized from 1 . In our experiments we used klag = 2klim . 3.5. Solving the approximated system Solving the 2nd -order recursions subject to the ASOS approximations is a non-trivial task. One key difficulty is that we have no way of starting either the recursions for (x , x )k and (y, x )k without first obtaining some kind of approximation for (x , x )klim . In this section we will show how this difficulty can be overcome, and derive a complete method for solving the 2nd -order recursions subject to the ASOS approximations. We will assume that the 1st -order nuisance terms have already been approximated. Learning the Linear Dynamical System with ASOS To overcome the aforementioned difficulty, we can use the fact that there are two 2nd -order recursions for (x , x )k , one which "increments" k and one which "decrements" so as to derive a new equation that keep k constant, thus relating (x , x )k to itself. In particular we can plug in the fourth 2nd -order recursion for (x , x )k+1 into the third recursion and simplify: (x , x )k = H(x , x )k H + ((x , y)k - x y1 )K 1+k - Hx x -k H + K(y, x )k+1 H + x x T T 1+k 1 Then using the same basic method we can derive a similar equation for (xT , xT )k : (xT , xT )k = J(xT , xT )k J + P ((x , xT )k - x xT -k ) T T - JxT xT J + J(xT , x )k+1 P + xT xT -k k+1 1 T T We will call these two equations the "ASOS equations". Our basic strategy will be to exploit the self-referential nature of the first ASOS equation in order to find a solution for (x , x )klim (taking k = klim ). Complicating this idea is the presence of additional unknown matrix quantities in the equation and so before we can proceed we must find a way to express these in terms of (x , x )klim . By repeated application of the first and second 2nd -order recursions, followed by an application of the first ASOS approximation, (x , y)klim can be expressed as: (x , y)klim = (x , y)klim + H 2klim +1 (x , x )klim A C where (x , y)klim is the value of (x , y)klim as computed by solving these recursions starting from (x , x )klim = 0. This formula can be easily verified by following the dependency on (x , x )klim through said recursions. Substituting this expression for (x , y)klim and the first ASOS approximation (y, x )klim +1 into the first ASOS equation, then simplifying, gives: (x , x )k = A(x , x )k H + H 2k+1 (x , x )k A C K + G G -Ax x -k H + (x , y)k - x y1 K + x x T T 1+k 1+k 1 term H 2klim +1 (x , x )klim A C K . The good news is that we have developed an iterative algorithm for solving this equation which seems to converge very quickly in practice (for details, see the supplement available at http://www.cs.toronto.edu/~jmartens/ASOS). With the solution of the primary equation we can utilize the ASOS approximation for (y, x )klim +1 and recursively compute (y, x )k for k = klim down to 0 using the first 2nd -order recursion. Then using the fact that (x , y)0 = (y, x )0 we may recursively compute (x , y)k for k = 0 to klim using the second 2nd -order recursions. With (x , y)k computed we may then use the third 2nd -order recursion to recursively compute (x , x )k for k = klim down to 0. Having computed (x , y)k also allows us to recursively compute (xT , y)k via the fifth 2nd -order recursion, starting the recursion with the ASOS approximation for (xT , y)klim . Next, with (x , x )k computed for k = 0 to klim we may use the sixth 2nd -order recursion to recursively compute (xT , x )k , starting the recursion with the second ASOS approximation (i.e. the one for (xT , x )klim ). Finally, we may compute (xT , xT )0 by solving the second ASOS equation (which can be done efficiently since it has the form of a Lyapanov equation) and use the seventh 2nd -order recursion to compute (xT , xT )1 from this. 3.6. Relaxing the Steady-state Assumption To derive the 2nd -order equations in their simple form it is critical that the filtering and smoother matrices K, H and P do not vary with the time-step t. Otherwise they can't be factored out of the sums, making it impossible to write the recursions only in terms of 2nd -order statistics and nuisance terms. We know that the LDS rapidly obtains steady-state (up to an arbitrary precision) everywhere except for some leading and trailing i time-steps, where i is not a function of T and generally i T . Thus we can apply the ASOS method to approximate the statistics over this middle interval and use the non-steady-state Kalman recursions to (approximately) compute the statistics associated with the first and last i time-steps. i can be determined by monitoring convergence of Kt to K while running the Kalman-filter, or just set at some reasonably large fixed value. where k = klim for brevity. Since we can compute (x , y)klim by just running the recursions under (x , x )klim = 0, the only unknown quantity in this equation, which we will call the "primary equation", is (x , x )klim . Moreover this equation is linear in (x , x )klim which gives us some hope that we can solve it. Unfortunately, it is not clear at first glance how we can do this efficiently. This equation almost has the form of a Sylvester equation, for which there are well-known efficient algorithms (Bartels & Stewart, 1972), but is slightly more complicated due to the presence of the 4. Error analysis 4.1. The relationship between klim and the approximation error In this section we will derive a set of formulae which quantify the error in the M-statistics as computed via the 2nd order recursions in terms of the error introducted due to the Learning the Linear Dynamical System with ASOS approximations. This ends up being a linear relationship relationship because the 2nd -order recursions are linear in the 2nd -order statistics. The notable feature of this relationship is that its `strength' decays exponentially as klim grows, thus providing one justification for idea that the quality of the approximation increases with the value of klim . Consider each of the three 2nd -order statistics approximated directly by the ASOS approximations, adding (x , x )klim to this list. We will call these the "Directly Approximated Statistics" or the DAS. The following result helps quantify the error in the M-statistics in terms of error in the DAS. Note that error due to any approximation in the 1st -order nuisance terms will be ignored in this analysis for the purpose of simplicity. We will briefly address this problem at the end of the section. Claim 2. Given a fixed setting of the parameters there exists some 0 < 1 such that for each M-statistic the difference between the true value and the value as approximated by the ASOS procedure can be expressed as a linear function of the approximation error in the DAS whose operator norm is bounded above by cklim 2 klim -1 for some constant c that doesn't depend on klim . Proof. The proof is straightforward. turns out to be the spectral radius of H and J (they are equal). For a detailed proof see the supplement available at http://www.cs.toronto.edu/~jmartens/ASOS. the above claim still holds and our proof can be easily extended. 4.2. Asymptotic Behavior In this section we will characterize the asymptotic behavior of the ASOS approximations as T under the condition that the data is in fact generated from the model. While this scenario is artificial, it can nevertheless inform us about how the approximations will behave in practice. In particular, if the model is close to describing the data, and T is sufficiently large, then this characterization should at least describe the real situation approximately. We have already established in section 3.4 that the ASOS approximations are unbiased in this setting. The first objective of this section is to establish a deeper result, that the error in the ASOS approximations converges to 0 in the expected squared · 2 -norm (viewing the matrices as vec1 tors) as long as we scale everything by T . This rescaling nd is a natural thing to do because the 2 -order statistics are all sums over T elements, and thus their expected size grows with T . Then having established this result we will outline the proof of an important consequence, namely that the M-step updates which use the ASOS-approximated Mstatistics will converge in probability to the exact updates as T . Note that since we are scaling all of the equations and 1 statistics by T the effect of the nuisance terms in each equation will go to zero as T and so we can ignore them in the analysis. Let i be the (true) value of the error in the ith ASOS approximation, i.e. the value of the left side minus the right. So for example, 2 = (xT , x )klim - (x , x )klim . Then we have the following claim which characterizes the expected size of each i : Claim 3. For i = 1, 2, 3: lim E [ 1 vec(i ) T 2 2 Note that the above result does not assume anything about the particular approximations being used for the DAS. So unless the approximation error of one of the DAS grows extremely quickly with klim we can conclude that the error in the M-statistics will decay exponentially as klim grows. And since the expected size of any 2nd -order statistic can be bounded, even a naive approximation of 0 for each DAS will ensure that the associated expected error is bounded. In practice we have found that can often be significantly less than 1, even when the spectral radius of A is relatively close to 1. However, as the EM algorithm progresses and the model (as determined by the evolving parameters) becomes more "confident", may occasionally grow large enough that klim -1 won't be very close to 0. Fortunately, there is another result which we present in the next section that allows us to bound the error in a manner that doesn't depend on but is instead related to the value of T . Having ignored the issue of approximating the the 1st -order nuisance terms in the above analysis we will now briefly address it. If these terms are approximated by applying the steady-state Kalman recursions to the leading and trailing klim + 1 time-steps, which is the approach we advocate, and if we add xT klim+1 and xT -klim to the DAS list, then T ]=0 Proof. The ASOS approximations were derived by finding unbiased estimators at each time-step and then summing over time. It turns out that the approximation errors also have zero correlation across time which is the critical property required to prove this result. See the supplement for details. 1 Claim 4. The approximation error in T -scaled 2nd -order statistics as estimated by the ASOS procedure converges to 0 in expected squared · 2 -norm as T . Proof. This follows from the fact that the ASOS procedure is just an efficient method solving a large linear system Learning the Linear Dynamical System with ASOS Table 1. Per-iteration computational complexity EM 3 O(Nx T ) SS-EM 2 3 O(Nx T + Nx i) ASOS-EM 3 O(Nx klim ) 6. Relationship to 1st -order approximations and 4SID A natural question to ask is if there is some approximation for the individual mean terms (i.e. x and xT t), that t t when multiplied and summed appropriately, gives the same estimates for the 2nd -order statistics that ASOS does. If such an equivalence did exist then the approximated statistic (xT , xT )0 would always be positive definite, which isn't true in general (although it will always be symmetric). Comparisons to 4SID can be made in terms of the approximation being used. As mentioned in our previous discussion of 4SID, the state estimates it (implicitly) computes for each time-step are equivalent to the estimates which would be produced by a non-steady-state Kalman filter that starts i time-steps in the past. The estimates produced by ASOS are of a different character in the sense that when they include information from the future as well as they are derived from both the filtering and smoothing Kalman recursions. Note however that the ASOS approximations require the model parameters to be available (i.e. the estimate produced by the EM iteration) while the 4SID estimates do not require any pre-existing parameter estimate, which is why the algorithm is non-iterative. whose coefficients are not a function of T , and thus the procedure can only "amplify" the errors due to the three ASOS approximations by a constant factor. For a detailed proof, see the supplement. Claim 5. The parameter updates produced by the M-step using the approximated M-statistics will converge to those produced using the true M-statistics as T . Proof. For the covariance parameters R and Q the update 1 formula are linear in the T -scaled M-statistics (for A it's 1 actually a T -1 scaling, but this is equivalent in the limit) and thus converge in the expected squared · 2 -norm, which implies convergence in probability. For the parameters A and C we cannot prove convergence in the expected squared · 2 -norm but we can still prove convergence in probability. First, note that we may replace 1 the M-statistics in the update formula with their T -scaled 1 counterparts since the scaling factor T will be canceled due to the matrix inversion. Second, note that convergence in expected squared · 2 -norm of the approximate Mstatistics to the true ones implies their convergence in probability. Finally, note that the exact value of the M-statistic which gets inverted is non-singular (it must be, since otherwise the update formula is undefined) and thus the formula is continuous at this point. Convergence in probability of the update formula then follows by the Continuous Mapping Theorem for random variables. 7. Experimental Setup Our experiments were designed to examine the trade-off between the solution quality and speed of ASOS-EM as a function of the meta-parameter klim , while using standard EM and SS-EM as baselines. All algorithms were implemented using carefully vectorized MATLAB code and run on an Intel 3.2GHz quad-core machine. Exact loglikelihoods were computed every 10 iterations as each algorithm ran. The runs were all initialized with the same random initial parameters. Our implementations of ASOSEM and SS-EM both used the "relaxed" steady-state approximation with i fixed to 25. We used 3 datasets in our experiments. The first was a 3dimensional time-series of length 6305 which consisted of sensor readings from an industrial milk evaporator. This is a standard dataset used in system identification and is available on-line from the Database for the Identification of Systems (DaISy). The second dataset consisted of the first 10 dimensions of a 49-dimensional time-series of length 15300 consisting of transformed sensor readings from a motion capture experiment. This dataset is available online from the Carnegie Mellon University Motion Capture Database and was preprocessed as in Taylor et al. (2007). The third dataset was from the Signal Processing Information Base (SPIB) based at Rice University and consisted of the first 750,000 time-steps (38 seconds) of an audio recording taken in the noisy `Operations Room" of a de- 5. Computational complexity The per-iteration computational complexity for EM, ASOS-EM (EM approximated via ASOS), and SS-EM (EM via direct evaluation of the steady-state approximated Kalman recursions) is given in Table 1. Note that we are assuming that T is the dominant term, followed by Nx , then klim , and finally i (i is defined as in section 3.6). The key difference between the per-iteration running time of ASOS-EM and that of EM or SS-EM is that there is no dependency on T . The only T -dependent computation required for ASOS-EM is the pre-computation of (y, y)k for 0 k klim + 1 which only needs to be performed once, before the EM iterations begin. The y statistics can even be computed online so that the complete time-series never even needs to be stored in memory. Learning the Linear Dynamical System with ASOS 3200 Negative Log Likelihood ASOS-EM-5 (13.1667 s) ASOS-EM-10 (13.9436 s) ASOS-EM-20 (16.6538 s) ASOS-EM-35 (21.0141 s) ASOS-EM-75 (27.1616 s) SS-EM (56.804 s) EM (692.6135 s) 3000 2800 vectorized MATLAB code). Whether or not the reader accepts them as reasonable indicators of relative performance, the fact remains that ASOS-EM is asymptotically faster than either SS-EM and EM per iteration since its iteration cost is independent of T . Where ASOS-EM seems to diverge from standard EM (when it does at all) is in the later stages of convergence. This is likely explained by the fact that, up until the end of the optimization, the parameter estimates reflect a shorterterm temporal dependency (as indicated by the value of ), and thus the ASOS approximation is close to exact. It is also apparent from the non-monotonic log-likelihood trend observed in the results for ASOS-EM-5 in the first graph that ASOS-EM cannot guarantee, in general, a decrease in the log likelihood for each iteration. Overall these results are very encouraging and motivate further exploration of the ASOS method and its applications. It remains to be seen if this approach can be extended to a continuous-time version of the LDS, or one that uses control signal inputs. 2600 0 5 x 10 -1.96 100 200 300 Iterations 400 500 600 Negative Log Likelihood -1.97 -1.98 -1.99 -2 -2.01 0 6 x 10 50 100 150 200 250 Iterations ASOS-EM-20 (52.0094 s) ASOS-EM-30 (55.623 s) ASOS-EM-50 (59.9489 s) SS-EM (215.6504 s) EM (6018.4049 s) 300 350 400 450 Negative Log Likelihood 6.71 6.708 6.706 6.704 6.702 0 50 100 150 ASOS-EM-150 (49.6942 s) ASOS-EM-300 (84.4925 s) ASOS-EM-850 (214.039 s) ASOS-EM-1700 (409.0395 s) SS-EM (9179.0774 s) Acknowledgments 200 250 Iterations 300 350 400 Figure 1. NOTE: Running times are included in the figure legends in brackets. Top: Results from the evaporator dataset with Nx = 15. Middle: Results from the mo-cap data with Nx = 40. Bottom: Results from the destroyer operations room audio dataset with Nx = 20. NOTE: Graphs are highly zoomed so that the differences between the algorithms may appear more significant than they actually are. The author would like to thank Geoffrey Hinton and Richard Zemel for their helpful advice. This work was supported by NSERC and the University of Toronto. R EFERENCES Bartels, R. H. and Stewart, G. W. Solution of the matrix equation AX + XB = C. Commun. ACM, 15(9), 1972. Ghahramani, Z. and Hinton, G.E. Parameter estimation for linear dynamical systems. Technical report, 1996. Goodwin, G.C. and Sin, K.S. Adaptive Filtering Prediction and Control. Prentice-Hall, 1984. Ljung, L. Prediction error estimation methods. Circuits, Systems, and Signal Processing, 2002. Overschee, P. Van and Moor, B. De. Subspace algorithms for the stochastic identification problem. In 30th IEEE Conference on Decision and Control, 1991. Shumway, R.H. and Stoffer, D.S. An approach to time series smoothing and forecasting using the em algorithm. Journal of Time Series Analysis, 1982. Smith, G., Freitas, J.F. De, Niranjan, M., and Robinson, T. Speech modelling using subspace and em techniques. In NIPS, 1999. Smith, G.A. and Robinson, A.J. A comparison between the em and subspace identification algorithms for timeinvariant linear dynamical systems. Technical report, Cambridge University, 2000. Taylor, G.W., Hinton, G.E., and Roweis, S. Modeling human motion using binary latent variables. In NIPS, 2007. stroyer warship. We will present our results as a series of graphs of log-likelihood versus iteration number, with the metaparameters and other important details of the experiment given in the captions. `ASOS-EM-n' is a run of ASOS-EM with klim = n. 8. Discussion of Results Our experiments demonstrate that ASOS-EM converges in roughly the same number of iterations as standard EM with the solution quality approaching that achieved by standard EM as klim is raised, as predicted by the theory. Moreover, the computational performance advantages of ASOS-EM over EM and SS-EM are clearly evident in these experiments, even for values of klim where the solution quality is virtually identical to standard EM. We included run-times with our results only to demonstrate that ASOS-EM can achieve real performance improvements in a reasonable implementation setting (carefully