Nonparametric Bayesian Learning of Switching L in e a r D y n a m ic a l S y s t e m s Emily B. Fox Electrical Engineering & Computer Science, Massachusetts Institute of Technology ebfox@mit.edu Erik B. Sudderth, Michael I. Jordan Electrical Engineering & Computer Science, University of California, Berkeley { sudderth, jordan} @eecs.berkeley.edu Alan S. Willsky Electrical Engineering & Computer Science, Massachusetts Institute of Technology willsky@mit.edu Abstract Many nonlinear dynamical phenomena can be effectively modeled by a system that switches among a set of conditionally linear dynamical modes. We consider two such models: the switching linear dynamical system (SLDS) and the switching vector autoregressive (VAR) process. In this paper, we present a nonparametric approach to the learning of an unknown number of persistent, smooth dynamical modes by utilizing a hierarchical Dirichlet process prior. We develop a sampling algorithm that combines a truncated approximation to the Dirichlet process with efficient joint sampling of the mode and state sequences. The utility and flexibility of our model are demonstrated on synthetic data, sequences of dancing honey bees, and the IBOVESPA stock index. 1 Introduction Linear dynamical systems (LDSs) are useful in describing dynamical phenomena as diverse as human motion [9], financial time-series [4], maneuvering targets [6, 10], and the dance of honey bees [8]. However, such phenomena often exhibit structural changes over time and the LDS models which describe them must also change. For example, a coasting ballistic missile makes an evasive maneuver; a country experiences a recession, a central bank intervention, or some national or global event; a honey bee changes from a waggle to a turn right dance. Some of these changes will appear frequently, while others are only rarely observed. In addition, there is always the possibility of a new, previously unseen dynamical behavior. These considerations motivate us to develop a nonparametric Bayesian approach for learning switching LDS (SLDS) models. We also consider a special case of the SLDS--the switching vector autoregressive (VAR) model--in which direct observations of the underlying dynamical process are assumed available. Although a special case of the general linear systems framework, autoregressive models have simplifying properties that often make them a practical choice in applications. One can view switching dynamical processes as an extension of hidden Markov models (HMM) in which each HMM state, or mode, is associated with a dynamical process. Existing methods for learning SLDSs and switching VAR processes rely on either fixing the number of HMM modes, such as in [8], or considering a change-point detection formulation where each inferred change is to a new, previously unseen dynamical mode, such as in [14]. In this paper we show how one can remain agnostic about the number of dynamical modes while still allowing for returns to previously exhibited dynamical behaviors. 1 Hierarchical Dirichlet processes (HDP) can be used as a prior on the parameters of HMMs with unknown mode space cardinality [2, 12]. In this paper we make use of a variant of the HDP-HMM--the sticky HDP-HMM of [5]--that provides improved control over the number of modes inferred by the HDP-HMM; such control is crucial for the problems we examine. Although the HDP-HMM and its sticky extension are very flexible time series models, they do make a strong Markovian assumption that observations are conditionally independent given the state. This assumption is often insufficient for capturing the temporal dependencies of the observations in real data. Our nonparametric Bayesian approach for learning switching dynamical processes extends the sticky HDP-HMM formulation to learn an unknown number of persistent, smooth dynamical modes and thereby capture a wider range of temporal dependencies. 2 Background: Switching Linear Dynamic Systems A state space (SS) model provides a general framework for analyzing many dynamical phenomena. The model consists of an underlying state, xt Rn , with linear dynamics observed via y t Rd . A linear time-invariant SS model, in which the dynamics do not depend on time, is given by xt = Axt-1 + et y t = C xt + w t , (1 ) where et and wt are independent Gaussian noise processes with covariances and R, respectively. An order r VAR process, denoted by VAR(r), with observations y t Rd , can be defined as ir Ai y t-i + et et N (0, ). (2 ) yt = =1 Here, the observations depend linearly on the previous r observation vectors. Every VAR(r) process can be described in SS form by, for example, the following transformation: I A1 A2 . . . Ar 0 ... 0 0 I y t = [I 0 . . . 0] xt . (3 ) xt = . . xt-1 + . et . .. . . . . . . . . . 0 ... I 0 0 Note that there are many such equivalent minimal SS representations that result in the same inputoutput relationship, where minimality implies that there does not exist a realization with lower state dimension. On the other hand, not every SS model may be expressed as a VAR(r) process for finite r [1]. We can thus conclude that considering a class of SS models with state dimension r · d and arbitrary dynamic matrix A subsumes the class of VAR(r) processes. The dynamical phenomena we examine in this paper exhibit behaviors better modeled as switches between a set of linear dynamical models. Due to uncertainty in the mode of the process, the overall model is nonlinear. We define a switching linear dynamical system (SLDS) by xt = A(zt ) xt-1 + et (zt ) y t = C xt + w t . (4 ) Here, zt indexes the mode-specific LDS at time t and is taken to have a Markov structure. We similarly define a switching VAR(r) process by i r (z ) et N (0, (zt ) ). (5 ) Ai t y t-i + et (zt ) yt = =1 Note that the underlying state dynamics of the SLDS are equivalent to a switching VAR(1) process. 3 Background: Dirichlet Processes and the Sticky HDP-HMM A Dirichlet process (DP), denoted by DP( , H ), is a distribution over countably infinite random measures k G0 () = k ( - k ) k H (6 ) =1 on a parameter space . The weights are sampled via a stick-breaking construction [11]: k = k k -1 =1 (1 - ) k Beta(1, ) (7 ) 2 (a) (b) (c) (d) Figure 1: For all graphs, GEM( ) and k H (). (a) DP mixture model in which zi and yi f (y | zi ). (b) HDP mixture model with j DP(, ), zj i j , and yj i f (y | zji ). (c)-(d) Sticky HDP-HMM prior on switching VAR(2) and SLDS processes with the mode evolving as zt+1 zt for k DP( + , ( + k )/( + )). The dynamical processes are as in Eq. (13). We denote this distribution by GEM( ). The DP is commonly used as a prior on the parameters of a mixture model of unknown complexity, resulting in a DP mixture model (see Fig.1(a)). To ¯ ¯ generate observations, we choose i G0 and yi F (i ). This sampling process is often described via a discrete variable zi indicating which component generates yi F (zi ). The hierarchical Dirichlet process (HDP) [12] extends the DP to cases in which groups of data are produced by related, yet distinct, generative processes. Taking a hierarchical Bayesian approach, the HDP places a global Dirichlet process prior DP(, G0 ) on , and then draws group specific distributions Gj DP(, G0 ). Here, the base measure G0 acts as an "average" distribution (E [Gj ] = G0 ) encoding the frequency of each shared, global parameter: Gj () = = t ~ j t ( - j t ) ~ j k ( - k ) j GEM() ~ j DP(, ) . (8 ) (9 ) =1 k =1 ~ Because G0 is discrete, multiple j t G0 may take identical values k . Eq. (9) aggregates these probabilities, allowing an observation yj i to be directly associated with the unique global parameters via an indicator random variable zj i j . See Fig. 1(b). An alternative, non­constructive characterization of samples G0 DP( , H ) from a Dirichlet process states that for every finite partition {A1 , . . . , AK } of , (G0 (A1 ), . . . , G0 (AK )) Dir( H (A1 ), . . . , H (AK )). (1 0 ) Using this expression, it can be shown that the following finite, hierarchical mixture model converges in distribution to the HDP as L [7, 12]: Dir( /L, . . . , /L) j Dir(1 , . . . , L ). (1 1 ) This weak limit approximation is used by the sampler of Sec. 4.2. The HDP can be used to develop an HMM with an unknown, potentially infinite mode space [12]. For this HDP-HMM, each HDP group-specific distribution, j , is a mode-specific transition distribution and, due to the infinite mode space, there are infinitely many groups. Let zt denote the mode of the Markov chain at time t. For Markov chains zt zt-1 , so that zt-1 indexes the group to which yt is assigned. The current HMM mode zt then indexes the parameter zt used to generate observation yt . See Fig. 1(c), ignoring the direct correlation in the observations. By sampling j DP(, ), the HDP prior encourages modes to have similar transition distributions (E [j k ] = k ). However, it does not differentiate self­transitions from moves between modes. When modeling dynamical processes with mode persistence, the flexible nature of the HDP-HMM prior allows for mode sequences with unrealistically fast dynamics to have large posterior probability. Recently, it has been shown [5] that one may mitigate this problem by instead considering a sticky HDP-HMM where j is distributed as follows: ( + j j DP + , 12) + 3 Here, ( + j ) indicates that an amount > 0 is added to the j th component of . The measure of j over a finite partition (Z1 , . . . , ZK ) of the positive integers Z+ , as described by Eq. (10), adds an amount only to the arbitrarily small partition containing j , corresponding to a self-transition. When = 0 the original HDP-HMM is recovered. We place a vague prior on and learn the self-transition bias from the data. 4 The HDP-SLDS and HDP-AR-HMM Models For greater modeling flexibility, we take a nonparametric approach in defining the mode space of our switching dynamical processes. Specifically, we develop extensions of the sticky HDP-HMM for both the SLDS and switching VAR models. For the SLDS, we consider conditionally-dependent emissions of which only noisy observations are available (see Fig. 1(d).) For this model, which we refer to as the HDP-SLDS, we place a prior on the parameters of the SLDS and infer their posterior from the data. We do, however, fix the measurement matrix, C , for reasons of identifiability. Let ~ ~ C Rd×n , n d, be the measurement matrix associated with a dynamical system defined by A, ~ and assume C has full row rank. Then, without loss of generality, we may consider C = [I 0] since ~ ~ there exists an invertible transformation T such that the pair C = C T = [I 0] and A = T -1 AT defines an equivalent input-output system. The dimensionality of I is determined by that of the data. Our choice of the number of columns of zeros is, in essence, a choice of model order. The previous work of Fox et al. [6] considered a related, yet simpler formulation for modeling a maneuvering target as a fixed LDS driven by a switching exogenous input. Since the number of maneuver modes was assumed unknown, the exogenous input was taken to be the emissions of a HDP-HMM. This work can be viewed as an extension of the work by Caron et. al. [3] in which the exogenous input was an independent noise process generated from a DP mixture model. The HDP-SLDS is a major departure from these works since the dynamic parameters themselves change with the mode and are learned from the data, providing a much more expressive model. The switching VAR(r) process can similarly be posed as an HDP-HMM in which the observations are modeled as conditionally VAR(r). This model is referred to as the HDP-AR-HMM and is depicted in Fig. 1(c). The generative processes for these two models are summarized as follows: HDP-AR-HMM HDP-SLDS Mode dynamics zt zt-1 zt zt-1 r (z t ) Observation dynamics y t = i=1 Ai y t-i + et (zt ) xt = A(zt ) xt-1 + et (zt ) y t = C xt + v t Here, j is as defined in Sec. 3 and the additive noise processes as in Sec. 2. 4.1 Posterior Inference of Dynamic Parameters In this section we focus on developing a prior to regularize the learning of different dynamical modes conditioned on a fixed mode assignment z1:T . For the SLDS, we analyze the posterior distribution of the dynamic parameters given a fixed, known state sequence x1:T . Methods for learning the number of modes and resampling the sequences x1:T and z1:T are discussed in Sec. 4.2. Conditioned on the mode sequence, one may partition the observations into K different linear regression problems, where K = |{z1 , . . . , zT }|. That is, for each mode k , we may form a matrix Y(k) with Nk columns consisting of the observations y t with zt = k . Then, ¯ Y(k) = A(k) Y(k) + E(k) , (1 4 ) ¯ where A(k) = [A1 . . . Ar ], Y(k) is a matrix of lagged observations, and E(k) the associated (k ) (k ) ¯ (k ) noise vectors. Let D = {Y , Y }. The posterior distribution over the VAR(r) parameters associated with the k th mode decomposes as follows: p(A(k) , (k) | D(k) ) = p(A(k) | (k) , D(k) )p((k) | D(k) ). (k ) (k ) (k ) (k ) (1 3 ) (1 5 ) We place a conjugate matrix-normal inverse-Wishart prior on the parameters {A , } [13], providing a reasonable combination of flexibility and analytical convenience. A matrix A Rd×m has a matrix-normal distribution MN (A; M , V , K ) if |K | 2 - 1 tr((A-M )T V -1 (A-M )K ) 2 , p(A) = me |2 V | 2 4 d (1 6 ) where M is the mean matrix and V and K -1 are the covariances along the rows and columns, respectively. A vectorization of the matrix A results in p(vec(A)) = N (vec(M ), K -1 V ), where denotes the Kronecker product. The resulting posterior is derived as p(A(k) | (k) , D(k) ) = MN (A(k) ; Syy Syy , -(k) , Syy ), ¯¯ ¯ ¯¯ with B (k ) Syy ¯¯ -(k ) (k ) -(k ) (k ) (1 7 ) (1 8 ) T denoting (B Y ¯ (k )T (k ) -1 ) for a given matrix B , and ¯ Syy = Y(k) Y(k) + M K ¯ (k ) T =Y ¯ (k ) +K ( Syk) = Y(k) Y(k) + M K M T . y We place an inverse-Wishart prior IW(S0 , n0 ) on (k) . Then, p((k) | D(k) ) = IW(Sy|y + S0 , Nk + n0 ), ¯ (k ) (k ) (k ) -(k ) (k ) T (k ) (1 9 ) where Sy|y = Syy - Syy Syy Syy . When A is simply a vector, the matrix-normal inverse¯ ¯ ¯¯ ¯ Wishart prior reduces to the normal inverse-Wishart prior with scale parameter K . For the HDP-SLDS, we additionally place an IW(R0 , r0 ) prior on the measurement noise covariance R, which is shared between modes. The posterior distribution is given by p(R | y 1:T , x1:T ) = IW(SR + R0 , T + r0 ), where SR = t=1 (y t - C xt )(y t - C xt ) . Additional details are provided in Appendix I. 4.2 Gibbs Sampler For the switching VAR(r) process, our sampler iterates between sampling the mode sequence, z1:T , and both the dynamic and sticky HDP-HMM parameters. The sampler for the SLDS is identical to that of a switching VAR(1) process with the additional step of sampling the state sequence, x1:T , and conditioning on the state sequence when resampling dynamic parameters. The resulting Gibbs sampler is described below and further elaborated upon in Appendix II. Sampling Dynamic Parameters Conditioned on a sample of the mode sequence, z1:T , and the observations, y 1:T , or state sequence, x1:T , we can sample the dynamic parameters = {A(k) , (k) } from the posterior density described in Sec. 4.1. For the HDP-SLDS, we additionally sample R. Sampling z1:T As shown in [5], the mixing rate of the Gibbs sampler for the HDP-HMM can be dramatically improved by using a truncated approximation to the HDP, such as the weak limit approximation, and jointly sampling the mode sequence using a variant of the forward-backward algorithm. Specifically, we compute backward messages mt+1,t (zt ) p(y t+1:T |zt , y t-r+1:t , , ) and then recursively sample each zt conditioned on zt-1 from (2 1 ) p(zt | zt-1 , x1:T , , ) p(zt | zt-1 )p(y t | y t-r:t-1 , A(zt ) , (zt ) )mt+1,t (zt ), r (z ) where p(y t | y t-r:t-1 , A(zt ) , (zt ) ) = N ( i=1 Ai t y t-i , (zt ) ). Joint sampling of the mode sequence is especially important when the observations are directly correlated via a dynamical process since this correlation further slows the mixing rate of the direct assignment sampler of [12]. Note that the approximation of Eq. (11) retains the HDP's nonparametric nature by encouraging the use of fewer than L components while allowing the generation of new components, upper bounded by L, as new data are observed. Sampling x1:T (HDP-SLDS only) Conditioned on the mode sequence z1:T and the set of dynamic parameters , our dynamical process simplifies to a time-varying linear dynamical system. We can then block sample x1:T by first running a backward filter to compute mt+1,t (xt ) p(y t+1:T |xt , zt+1 , ) and then recursively sampling each xt conditioned on xt-1 from p(xt | xt-1 , y 1:T , z1:T , ) p(xt | xt-1 , A(zt ) , (zt ) )p(y t | xt , R)mt+1,t (xt ). The messages are given in information form by mt,t-1 (xt-1 ) N the information parameters are recursively defined as t,t-1 = A -(z t ) -1 (2 0 ) T T (2 2 ) (xt-1 ; t,t-1 , t,t-1 ), where (2 3 ) -(z t ) t,t-1 = A-(zt ) -(zt ) (-(zt ) + C T R-1 C + t+1,t )-1 (C T R-1 y t + t+1,t ) -(z t ) A (z t ) -A -(z t ) -(z t ) ( -(z t ) +C R T -1 C + t+1,t ) -1 A (z t ) . See Appendix II for a more numerically stable version of this recursion. 5 14 Normalized Hamming Distance 12 10 8 6 4 2 0 -2 0 16 Normalized Hamming Distance Normalized Hamming Distance 0.4 0.3 0.2 0.1 0 0.4 0.3 0.2 0.1 0 0.4 0.3 0.2 0.1 0 Normalized Hamming Distance 200 400 600 Iteration 800 1000 0.5 0.5 0.5 0.5 0.4 0.3 0.2 0.1 0 500 1000 Time 1500 2000 200 400 600 Iteration 800 1000 200 400 600 Iteration 800 1000 200 400 600 Iteration 800 1000 Normalized Hamming Distance Normalized Hamming Distance Normalized Hamming Distance 14 12 10 8 6 4 2 0 0 200 400 600 Time 800 1000 0.5 0.4 0.3 0.2 0.1 0 1000 2000 3000 Iteration 4000 5000 0.5 0.4 0.3 0.2 0.1 0 1000 2000 3000 Iteration 4000 5000 0.5 0.4 0.3 0.2 0.1 0 1000 2000 3000 Iteration 4000 5000 Normalized Hamming Distance 0.6 0.6 0.6 0.6 0.5 0.4 0.3 0.2 0.1 0 1000 2000 3000 Iteration 4000 5000 Normalized Hamming Distance Normalized Hamming Distance Normalized Hamming Distance 0.5 0.4 0.3 0.2 0.1 0 1000 2000 3000 Iteration 4000 5000 0.5 0.4 0.3 0.2 0.1 0 1000 2000 3000 Iteration 4000 5000 0.5 0.4 0.3 0.2 0.1 0 1000 2000 3000 Iteration 4000 5000 Normalized Hamming Distance 200 0.6 0.6 0.6 0.6 0.5 0.4 0.3 0.2 0.1 0 1000 2000 3000 Iteration 4000 5000 150 100 50 0 0 200 400 600 Time 800 1000 (a) (b) (c) (d) (e) Figure 2: (a) Observation sequence (blue, green, red) and associated mode sequence (magenta) for a 5-mode switching VAR(1) process (top), 3-mode switching AR(2) process (middle), and 3-mode SLDS (bottom). The associated 10th, 50th, and 90th Hamming distance quantiles over 100 trials are shown for the (b) HDP-VAR(1)HMM, (c) HDP-VAR(2)-HMM, (d) HDP-SLDS with C = I (top and bottom) and C = [1 0] (middle), and (e) sticky HDP-HMM using first difference observations. 5 Results Synthetic Data In Fig. 2, we compare the performance of the HDP-VAR(1)-HMM, HDP-VAR(2)HMM, HDP-SLDS, and a baseline sticky HDP-HMM on three sets of test data (see Fig. 2(a).) The Hamming distance error is calculated by greedily mapping the indices of the estimated mode sequence to those maximizing overlap with the true sequence. For the first scenario, the data were generated from a 5-mode switching VAR(1) process. The three switching linear dynamical models provide comparable performance since both the HDP-VAR(2)-HMM and HDP-SLDS with C = I contain the class of HDP-VAR(1)-HMMs. Note that the HDP-SLDS sampler is slower to mix since the hidden, three-dimensional continuous state is also sampled. In the second scenario, the data were generated from a 3-mode switching AR(2) process. The HDP-AR(2)-HMM has significantly better performance than the HDP-AR(1)-HMM while the performance of the HDP-SLDS with C = [1 0] is comparable after burn-in. As shown in Sec. 2, this HDP-SLDS model encompasses the class of HDP-AR(2)-HMMs. The data in the third scenario were generated from a 3-mode SLDS model with C = I . Here, we clearly see that neither the HDP-VAR(1)-HMM nor HDP-VAR(2)-HMM is equivalent to the HDP-SLDS. Together, these results demonstrate both the differences between our models as well as the models' ability to learn switching processes with varying numbers of modes. Finally, note that all of the switching models yielded significant improvements relative to the baseline sticky HDP-HMM, even when the latter was given first differences of the observations. This input representation, which is equivalent to an HDP-VAR(1)-HMM with random walk dynamics (A(k) = I for all k ), is more effective than using raw observations for HDP-HMM learning, but still much less effective than richer models which switch among learned LDS. IBOVESPA Stock Index We test the HDP-SLDS model on the IBOVESPA stock index (Sao Paulo Stock Exchange) over the period of 01/03/1997 to 01/16/2001. There are ten key world events shown in Fig. 3 and cited in [4] as affecting the emerging Brazilian market during this time period. In [4], a 2-mode Markov switching stochastic volatility (MSSV) model is used to identify periods of higher volatility in the daily returns. The MSSV assumes that the log-volatilities follow an AR(1) process with a Markov switching mean. This underlying process is observed via conditionally independent and normally distributed daily returns. The HDP-SLDS is able to infer very similar change points to those presented in [4]. Interestingly, the HDP-SLDS consistently identifies three regimes of volatility versus the assumed 2-mode model. In Fig. 3, the overall performance of the 6 1 1 0.8 Probability of Change Point Probability of Change Point 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.8 0.8 Detection Rate 0.6 0.6 0.4 0.4 HDP-SLDS HDP-SLDS, non-sticky HDP-AR(1)-HMM HDP-AR(2)-HMM 0.2 0.4 0.6 0.8 False Alarm Rate 1 0.2 0.2 1/3/97 7/2/97 6/1/98 1/15/99 Date 1/13/00 0 1/3/97 7/2/97 6/1/98 1/15/99 Date 1/13/00 0 1/3/97 7/2/97 6/1/98 1/15/99 Date 1/13/00 0 0 (a) (b) (c) (d) Figure 3: (a) IBOVESPA stock index daily returns from 01/03/1997 to 01/16/2001. (b) Plot of the estimated probability of a change point on each day using 3000 Gibbs samples for the HDP-SLDS. The 10 key events are indicated with red lines. (c) Similar plot for the non-sticky HDP-SLDS with no bias towards self-transitions. (d) ROC curves for the HDP-SLDS, non-sticky HDP-SLDS, HDP-AR(1)-HMM, and HDP-AR(2)-HMM. HDP-SLDS is compared to that of the HDP-AR(1)-HMM, HDP-AR(2)-HMM, and HDP-SLDS with no bias for self-transitions (i.e., = 0.) The ROC curves shown in Fig. 3(d) are calculated by windowing the time axis and taking the maximum probability of a change point in each window. These probabilities are then used as the confidence of a change point in that window. We clearly see the advantage of using a SLDS model combined with the sticky HDP-HMM prior on the mode sequence. Without the sticky extension, the HDP-SLDS over-segments the data and rapidly switches between redundant states which leads to a dramatically larger number of inferred change points. Dancing Honey Bees We test the HDP-VAR(1)-HMM on a set of six dancing honey bee sequences, aiming to segment the sequences into the three dances displayed in Fig. 4. (Note that we did not see performance gains by considering the HDP-SLDS, so we omit showing results for that architecture.) The data consist of measurements y t = [cos(t ) sin(t ) xt yt ]T , where (xt , yt ) denotes the 2D coordinates of the bee's body and t its head angle. We compare our results to those of Xuan and Murphy [14], who used a change-point detection technique for inference on this dataset. As shown in Fig. 4(d)-(e), our model achieves a superior segmentation compared to the change-point formulation in almost all cases, while also identifying modes which reoccur over time. Oh et al. [8] also presented an analysis of the honey bee data, using an SLDS with a fixed number of modes. Unfortunately, that analysis is not directly comparable to ours, because [8] used their SLDS in a supervised formulation in which the ground truth labels for all but one of the sequences are employed in the inference of the labels for the remaining held-out sequence, and in which the kernels used in the MCMC procedure depend on the ground truth labels. (The authors also considered a "parameterized segmental SLDS (PS-SLDS)," which makes use of domain knowledge specific to honey bee dancing and requires additional supervision during the learning process.) Nonetheless, in Table 1 we report the performance of these methods as well as the median performance (over 100 trials) of the unsupervised HDP-VAR(1)-HMM to provide a sense of the level of performance achievable without detailed, manual supervision. As seen in Table 1, the HDP-VAR(1)-HMM yields very good performance on sequences 4 to 6 in terms of the learned segmentation and number of modes (see Fig. 4(a)-(c)); the performance approaches that of the supervised method. For sequences 1 to 3--which are much less regular than sequences 4 to 6--the performance of the unsupervised procedure is substantially worse. This motivated us to also consider a partially supervised variant of the HDP-VAR(1)-HMM in which we fix the ground truth mode sequences for five out of six of the sequences, and jointly infer both a combined set of dynamic parameters and the left-out mode sequence. As we see in Table 1, this considerably improved performance for these three sequences. Not depicted in the plots in Fig. 4 is the extreme variation in head angle during the waggle dances of sequences 1 to 3. This dramatically affects our performance since we do not use domain-specific information. Indeed, our learned segmentations consistently identify turn-right and turn-left modes, but often create a new, sequence-specific waggle dance mode. Many of our errors can be attributed to creating multiple waggle dance modes within a sequence. Overall, however, we are able to achieve reasonably good segmentations without having to manually input domain-specific knowledge. 6 Discussion In this paper, we have addressed the problem of learning switching linear dynamical models with an unknown number of modes for describing complex dynamical phenomena. We presented a non7 (1) 4 Estimated mode Estimated mode 3 (2) (3) 4 3 Estimated mode (4) 1 (5) 1 (6) 0.8 Detection Rate 3 0.6 Detection Rate 0.8 0.6 2 2 2 0.4 HDP-VAR-HMM, unsupervised HDP-VAR-HMM, supervised Change-point formulation Viterbi sequence 0.4 HDP-VAR-HMM, unsupervised HDP-VAR-HMM, supervised Change-point formulation Viterbi sequence 1 1 1 0.2 0.2 0 200 400 Time 600 0 200 400 Time 600 800 0 200 Time 400 600 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 False Alarm Rate False Alarm Rate (a) (b) (c) (d) (e) Figure 4: (top) Trajectories of the dancing honey bees for sequences 1 to 6, colored by waggle (red), turn right (blue), and turn left (green) dances. (a)-(c) Estimated mode sequences representing the median error for sequences 4, 5, and 6 at the 200th Gibbs iteration, with errors indicated in red. (d)-(e) ROC curves for the unsupervised HDP-VAR-HMM, partially supervised HDP-VAR-HMM, and change-point formulation of [14] using the Viterbi sequence for segmenting datasets 1-3 and 4-6, respectively. Sequence 1 2 3 4 5 6 HDP-VAR(1)-HMM unsupervised 4 6 .5 4 4 .1 4 5 .6 8 3 .2 9 3 .2 8 8 .7 HDP-VAR(1)-HMM partially supervised 65.9 88.5 79.2 86.9 92.3 89.1 SLDS DD-MCMC 7 4 .0 8 6 .1 8 1 .3 9 3 .4 9 0 .2 9 0 .4 PS-SLDS DD-MCMC 7 5 .9 9 2 .4 8 3 .1 9 3 .4 9 0 .4 9 1 .0 Table 1: Median label accuracy of the HDP-VAR(1)-HMM using unsupervised and partially supervised Gibbs sampling, compared to accuracy of the supervised PS-SLDS and SLDS procedures, where the latter algorithms were based on a supervised MCMC procedure (DD-MCMC) [8]. parametric Bayesian approach and demonstrated both the utility and versatility of the developed HDP-SLDS and HDP-AR-HMM on real applications. Using the same parameter settings, in one case we are able to learn changes in the volatility of the IBOVESPA stock exchange while in another case we learn segmentations of data into waggle, turn-right, and turn-left honey bee dances. An interesting direction for future research is learning models of varying order for each mode. References [1] M. Aoki and A. Havenner. State space modeling of multiple time series. Econ. Rev., 10(1):1­59, 1991. [2] M. J. Beal, Z. Ghahramani, and C. E. Rasmussen. The infinite hidden Markov model. In NIPS, 2002. [3] F. Caron, M. Davy, A. Doucet, E. Duflos, and P. Vanheeghe. Bayesian inference for dynamic models with Dirichlet process mixtures. In Int. Conf. Inf. Fusion, July 2006. [4] C. Carvalho and H. Lopes. Simulation-based sequential analysis of Markov switching stochastic volatility models. Comp. Stat. & Data Anal., 2006. [5] E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky. An HDP-HMM for systems with state persistence. In ICML, 2008. [6] E. B. Fox, E. B. Sudderth, and A. S. Willsky. Hierarchical Dirichlet processes for tracking maneuvering targets. In Int. Conf. Inf. Fusion, July 2007. [7] H. Ishwaran and M. Zarepour. Exact and approximate sum­representations for the Dirichlet process. Can. J. Stat., 30:269­283, 2002. [8] S. Oh, J. Rehg, T. Balch, and F. Dellaert. Learning and inferring motion patterns using parametric segmental switching linear dynamic systems. IJCV, 77(1­3):103­124, 2008. [9] J. M. Pavlovic, V. Rehg and J. MacCormick. Learning switching linear models of human motion. In ´ NIPS, 2000. [10] X. Rong Li and V. Jilkov. Survey of maneuvering target tracking. Part V: Multiple-model methods. IEEE Trans. Aerosp. Electron. Syst., 41(4):1255­1321, 2005. [11] J. Sethuraman. A constructive definition of Dirichlet priors. Stat. Sinica, 4:639­650, 1994. [12] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. J. Amer. Stat. Assoc., 101(476):1566­1581, 2006. [13] M. West and J. Harrison. Bayesian Forecasting and Dynamic Models. Springer, 1997. [14] X. Xuan and K. Murphy. Modeling changing dependency structure in multivariate time series. In ICML, 2007. 8 1 APPENDIX I DY NA M I C PA R A M E T E R P O S T E R I O R In this appendix, we derive the posterior distribution over the dynamic parameters of a switching VAR(r) process defined as follows: i r (z ) yt = Ai t y t-i + et (zt ) et N (0, (zt ) ), (1) =1 where zt indexes the mode-specific VAR(r) process at time t. Assume that the state sequence {z1 , . . . , zT } is known (k ) and we wish to compute the posterior distribution over the k th mode's VAR(r) parameters Ai for i = 1, . . . , r and (k) . Let {t1 , . . . , tNk } = {t|zt = k }. Then, we may write y t1 -1 y t2 -1 . . . y tNk -1 y =A y y t2 -2 . . . y tNk -2 e . (k ) (k ) (k) t1 -2 + t1 et2 . . . etNk A2 . . . Ar . t1 y t2 . . . y tNk 1 . . y t1 -r y t2 -r . . . y tNk -r (2) We define the following notation for Eq. 2: ¯ Y(k) = A(k) Y(k) + E(k) . (3) ¯ Let D(k) = {Y(k) , Y(k) }. We place a matrix-normal inverse-Wishart prior on the dynamic parameters {A(k) , (k) } and show that the posterior remains matrix-normal inverse Wishart. The matrix-normal inverse-Wishart prior is given by placing a matrix-normal prior MN (A(k) ; M , (k) , K ) on A(k) given (k) : - ( 1 |K |d/2 T - (k ) (k ) (k ) exp (A - M )K ) 4) tr((A - M ) p(A | ) = 2 |2 (k) |m/2 and an inverse-Wishart prior IW(S0 , n) on (k) : p ( (k ) |S0 |n/2 |(k) |-(d+n+1)/2 exp )= 2nd/2 d (n/2) - 1 tr(-(k) S0 ) 2 ( 5) where d (n/2) is the multivariate gamma function and B -(k) denotes (B (k) )-1 for some matrix B . We first analyze the likelihood of the data, D(k) , given the k th mode's dynamic parameters, {A(k) , (k) }. Starting with the fact that each observation vector, y t , is conditionally Gaussian given the lag observations, y t = ¯ [y t-r . . . y t-1 ]T , we have - = 1i 1 exp ¯ (y ti - A(k) y ti )T -(k) (y ti - A(k) y ti ) ¯ p(D(k) |A(k) , (k) ) = 2 |2 (k) |Nk /2 = - 1 1 ¯ ¯ tr(-(k) (Y(k) - A(k) Y(k) )(Y(k) - A(k) Y(k) )T ) exp (k) |Nk /2 2 |2 = - 1 1 ¯ ¯ tr((Y(k) - A(k) Y(k) )T -(k) (Y(k) - A(k) Y(k) )I ) exp 2 |2 (k) |Nk /2 ¯ MN (Y(k) ; A(k) Y(k) , (k) , I ). (6) To derive the posterior on the dynamic parameters, it is useful to first compute p(D(k) , A(k) | (k) ) = p(D(k) | A(k) , (k) )p(A(k) | (k) ). (7) Using the fact that both the likelihood term p(D(k) | A(k) , (k) ) and the prior p(A(k) | (k) ) are matrix-normally 2 distributed sharing a common parameter (k) , we have log p(D(k) , A(k) | (k) ) + C 1 ¯ ¯ = - tr((Y(k) - A(k) Y(k) )T -(k) (Y(k) - A(k) Y(k) ) + (A(k) - M )T -(k) (A(k) - M )K ) 2 1 ¯ ¯ = - tr(-(k) {(Y(k) - A(k) Y(k) )(Y(k) - A(k) Y(k) )T + (A(k) - M )K (A(k) - M )T }) 2 1 T T (k ) (k ) = - tr(-(k) {A(k) Syy A(k) - 2Syy A(k) + S(k) }) ¯¯ ¯ yy 2 1 (k ) - (k ) ( k ) (k ) - ( k ) (k ) = - tr(-(k) {(A(k) - Syy Syy )Syy (A(k) - Syy Syy )T + Sy|y }), (8) ¯ ¯¯ ¯¯ ¯ ¯¯ ¯ 2 (k ) (k ) (k ) - (k ) (k )T |K |d/2 1 and S = S - S S S using the definitions for C = - log |2 (k) |Nk /2 |2 (k) |rNk /2 y |y ¯ yy yy ¯ yy ¯¯ yy ¯ (k ) ¯¯T Syy = Y(k) Y(k) + K ¯¯ ¯ Syy = Y(k) Y(k) + M K ¯ (k ) T S(k) = Y(k) Y(k) + M K M T . yy T Conditioning on the noise covariance (k) , we see that the dynamic matrix posterior is given by: - = 1 (k ) - (k ) (k ) - (k ) ( k ) p(A(k) | D(k) , (k) ) exp tr((A(k) - Syy Syy )T -(k) (A(k) - Syy Syy )Syy ) ¯ ¯¯ ¯ ¯¯ ¯¯ 2 MN (A(k) ; Syy Syy , -(k) , Syy ). ¯ ¯¯ ¯¯ (k ) - (k ) (k ) (9) Marginalizing Eq. 8 over the dynamic matrix A(k) , we derive A p(D(k) | (k) ) = p(D(k) , A(k) | (k) )dA(k) (k) A |K d/2 | = (k) |Nk /2 |2 (k) |rNk /2 (k) |2 - e 1 ( k ) - (k ) (k ) (k ) - (k ) exp tr(-(k) (A(k) - Syy Syy )Syy (A(k) - Syy Syy )T ) ¯ ¯¯ ¯¯ ¯ ¯¯ 2 - d 1 (k ) xp tr(-(k) Sy|y ) A(k) ¯ 2 - A d/2 1 |K | (k ) exp tr(-(k) Sy|y ) = ¯ (k) |Nk /2 2 |2 1 (k ) - (k ) (k ) MN (A(k) ; Syy Syy , -(k) , Syy )dA(k) ¯ ¯¯ ¯¯ (k) d/2 (k) |Syy | ¯¯ - 1 |K |d/2 (k ) exp tr(-(k) Sy|y ) = ¯ (k) |Nk /2 |S(k) |d/2 2 |2 yy ¯¯ ( 10) Using the above, the posterior of the covariance parameter is computed as p((k) | D(k) ) p(D(k) | (k) )p((k) ) - | - 1 |K |d/2 1 (k ) exp tr(-(k) Sy|y ) (k) |-(d+n+1)/2 exp tr(-(k) S0 ) ¯ (k) |Nk /2 |S(k) |d/2 2 2 |2 yy ¯¯ - = 1 (k ) (k) -(d+Nk +n+1)/2 tr(-(k) (Sy|y + S0 )) | | exp ¯ 2 IW(Sy|y + S0 , Nk + n). ¯ (k ) (11) 3 APPENDIX II M E S S AG E PA S S I N G In this appendix, we explore the computation of the backwards message passing and forward sampling scheme used for generating samples of the mode sequence z1:T and state sequence x1:T . A. Mode Sequence Message Passing Consider a switching VAR(r) process. To derive the forward-backward procedure for jointly sampling the mode sequence z1:T given observations y 1:T , plus r initial observations y 1-r:0 , we first note that the chain rule and Markov structure allows us to decompose the joint distribution as follows: p(z1:T | y 1:T , , ) = p(zT | zT -1 , y 1-r:T , , )p(zT -1 | zT -2 , y 1-r:T , , ) · · · p(z2 | z1 , y 1-r:T , , )p(z1 | y 1-r:T , , ). Thus, we may first sample z1 from p(z1 | y 1-r:T , , ), then condition on this value to sample z2 from p(z2 | z1 , y 1-r:T , , ), and so on. The conditional distribution of z1 is derived as: z t p(z1 | y 1-r:T , , ) p(z1 )p(y 1 | z1 , y 1-r:0 ) p(zt | zt-1 )p(y t | zt , y t-r:t-1 ) 2:T p(z1 )p(y 1 | z1 , y 1-r:0 ) p(z1 )p(y 1 | z1 , y 1-r:0 )m2,1 (z1 ), z p(z2 | z1 )p(y 2 | z2 , y 2-r:1 )m3,2 (z2 ) 2 (12) where mt,t-1 (zt-1 ) is the backward message passed from zt to zt-1 and is recursively defined by: z p(zt | zt-1 )p(y t | zt , y t-r:t-1 )mt+1,t (zt ), t T ; t mt,t-1 (zt-1 ) 1, t = T + 1. The general conditional distribution of zt is: p(zt | zt-1 , y 1-r:T , , ) p(zt | zt-1 )p(y t | zt , y t-r:t-1 )mt+1,t (zt ). (13) (14) For the HDP-AR-HMM, these distributions are given by: p(zt = k | zt-1 , y 1-r:T , , ) zt-1 (k )N (y t ; mt+1,t (k ) = jL ir Ai y t-i , (k) )mt+1,t (k ) ir Ai y t-i , (j ) )mt+2,t+1 (j ) (j ) (k ) (15) =1 k (j )N (y t+1 ; k = 1, . . . , L. (16) (17) =1 =1 mT +1,T (k ) = 1 B. State Sequence Message Passing A similar sampling scheme is used for generating samples of the state sequence x1:T . Although we now have a continuous state space, the computation of the backwards messages mt+1,t (xt ) is still analytically feasible since we are working with Gaussian densities. Assume, mt+1,t (xt ) N -1 (xt ; t+1,t , t+1,t ), where N -1 (x; , ) denotes a Gaussian distribution on x in information form with mean µ = -1 and covariance = -1 . The backwards messages for the HDP-SLDS can be recursively defined by x p(xt |xt-1 , zt )p(y t |xt )mt+1,t (xt )dxt . mt,t-1 (xt-1 ) t 4 For this model, the densities of Eq. 18 can be expressed as 1 p(xt |xt-1 , zt ) exp{- (xt - A(zt ) xt-1 )T -(zt ) (xt - A(zt ) xt-1 )} 2 x T A( z ) T - ( z ) ( z ) T t 1 t A t -A(zt ) -(zt ) t-1 exp{- xt --(zt ) A(zt ) - (z t ) 2 1 p(y t |xt ) exp{- (y t - C xt )T R-1 (y t - C xt )} 2 T0 x 1 0 t-1 exp{- xt 0 C T R -1 C 2 1 mt+1,t (xt ) exp{- xT t+1,t xt + xT t+1,t )} t 2t T0 x 1 0 t-1 exp{- 0 t+1,t xt 2 x t-1 xt } x t-1 xt + x t-1 xt T C T R -1 y t 0 } x t-1 xt + x t-1 xt T t+1,t 0 } The product of these quadratics is given by: p(xt |xt-1 , zt )p(y t |xt )mt+1,t (xt ) x +T 1 t-1 exp{- xt 2 A(z )T - (z t ) A t --(zt ) A(zt ) T x t-1 xt C T R-1 y t + t+1,t -A(zt ) -(zt ) -(zt ) + C T R-1 C + t+1,t } 0 T x t-1 xt Using standard Gaussian marginalization identities we integrate over xt to get, mt,t-1 (xt-1 ) N -1 (xt-1 ; t,t-1 , t,t-1 ), where, t,t-1 = A(zt ) -(zt ) (-(zt ) + C T R-1 C + t+1,t )-1 (C T R-1 y t + t+1,t ) t,t-1 = A(zt ) -(zt ) A(zt ) - A(zt ) -(zt ) (-(zt ) + C T R-1 C + t+1,t )-1 -(zt ) A(zt ) T T T This backwards message passing recursion is initialized at time T with mT +1,T N -1 (0, 0). Let, b|t = C T R-1 C + t+1,t t b t|t = C T R-1 y t + t+1,t Then we can define the following recursion, which we note is equivalent to the backwards running Kalman filter in information form, b t-1|t-1 = C T R-1 C + A(zt ) -(zt ) A(zt ) - A(zt ) -(zt ) (-(zt ) + C T R-1 C + t+1,t )-1 -(zt ) A(zt ) T T = C T R-1 C + A(zt ) -(zt ) A(zt ) - A(zt ) -(zt ) (-(zt ) + b|t )-1 -(zt ) A(zt ) t b t-1|t-1 = C T R-1 y t-1 + A(zt ) -(zt ) (-(zt ) + C T R-1 C + t+1,t )-1 (C T R-1 y t + t+1,t ) b = C T R-1 y t-1 + A(zt ) -(zt ) (-(zt ) + b|t )-1 t|t t T T T T We initialize at time T with b |T = C T R - 1 C T b T |T = C T R - 1 y T 5 An equivalent, but more numerically stable recursion is summarized in Algorithm 1. 1) Initialize filter with b |T = C T R - 1 C T b T |T = C T R - 1 y T 2) Working backwards in time, for each t {T - 1, . . . , 1}: a) Compute ~ Jt+1 = b+1|t+1 (b+1|t+1 + -(zt+1 ) )-1 t t ~ ~ t+1 = I - Jt+1 . L b) Predict b ~ ~t ~ ~+ t+1,t = A(zt+1 ) (Lt+1 t+1|t+1 LT+1 + Jt+1 -(zt+1 ) JtT 1 )A(zt+1 ) T b ~ t+1,t = A(zt+1 ) Lt+1 t+1|t+1 T c) Update b|t = t+1,t + C T R-1 C t b t|t = t+1,t + C T R-1 y t Algorithm 1: Numerically stable form of the backwards Kalman information filter. After computing the messages mt+1,t (xt ) backwards in time, we sample the state sequence x1:T working forwards in time. As with the discrete mode sequence, one can decompose the posterior distribution of the state sequence as p(x1:T | y 1:T , z1:T , ) = p(xT | xT -1 , y 1:T , z1:T , )p(xT -1 | xT -2 , y 1:T , z1:T , ) · · · p(x2 | x1 , y 1:T , z1:T , )p(x1 | y 1:T , z1:T , ). where p(xt | xt-1 , y 1:T , z1:T , ) p(xt | xt-1 , A(zt ) , (zt ) )p(y t | xt , R)mt+1,t (xt ). (18) For the HDP-SLDS, the product of these distributions is equivalent to p(xt | xt-1 , y 1:T , z1:T , ) N (xt ; A(zt ) xt-1 , (zt ) )N (y t ; C xt , R)mt+1,t (xt ) b N (xt ; A(zt ) xt-1 , (zt ) )N -1 (xt ; t|t , b|t ) t b N -1 (xt ; -(zt ) A(zt ) xt-1 + t|t , -(zt ) + b|t ), t (19) which is a simple Gaussian distribution so that the normalization constant is easily computed. Specifically, for each t {1, . . . , T } we sample xt from b xt N (xt ; (-(zt ) + b|t )-1 (-(zt ) A(zt ) xt-1 + t|t ), (-(zt ) + b|t )-1 ). t t (20) 6 Given a previous set of mode-specific transition probabilities (n-1) , the global transition distribution (n-1) , the (n-1) dynamic parameters (n-1) , and pseudo-observations y 1:T : ~ (n-1) 1) Set = (n-1) , {A(k) , (k) } = {A(k) , (k) }(n-1) , and y 1:T = y 1:T . ~ ~ 2) Calculate messages mt,t-1 (k ) and the sample mode sequence z1:T : a) For each k {1, . . . , L}, initialize messages to mT +1,T (k ) = 1. b) For each t {T , . . . , 1} and k {1, . . . , L}, compute ~r m jL i (j ) (j ) mt,t-1 (k ) = k (j )N y t ; Ai y t-i , ~ t+1,t (j ) =1 =1 c) Working sequentially forward in time, starting with transitions counts nj k = 0 for each (j, k ): i) For each k {1, . . . , L}, compute the probayility b m i r (k ) ~ fk (y t ) = zt-1 (k )N Ai y t-i , (k) ~ t+1,t (k ) t; =1 3) If HDP-AR-HMM, set pseudo-observations y 1:T = y 1:T . ~ 4) If HDP-SLDS, calculate messages mt,t-1 (xt-1 ) and the sample state sequence x1:T : a) Initialize messages to mT +1,T (xT ) = N -1 (xT ; 0, 0). b b) For each t {T , . . . , 1}, recursively compute {t|t , b|t } as in Algorithm 1. t c) Working sequentially forward in time sample b xt N (xt ; (-(zt ) + b|t )-1 (-(zt ) A(zt ) xt-1 + t|t ), (-(zt ) + b|t )-1 ). t t d) Set pseudo-observations y 1:T = x1:T . ~ 5) For each k {1, . . . , L}, compute sufficient statistics using pseudo-observations y 1:T : ~ (k ) (k ) (k ) ( k ) (k )T (k ) ¯ (k )T ¯ ( k ) Y (k )T + K ¯ + M KM T . + MK Syy = Y Y Syy = Y Y Syy = Y ¯ ¯¯ 6) Sample auxiliary variables m, w, and m and then hyperparameters , , and as in [5], [12]. ¯ 7) Update the global transition distribution by sampling Dir( /L + m.1 , . . . , /L + m.K ) ¯ ¯¯ 8) For each k {1, . . . , L}, sample a new transition distribution and dynamic parameters based on the sampled mode assignments and sufficient statistics of the pseudo-observations: k Dir(1 + nk1 , . . . , k + + nkk , . . . , L + nkL ) L (k ) (k) IW(Syy + S0 , nk + n0 ) ¯ A | MN (A If HDP-SLDS, also sample the measurement noise covariance tT R IW( (y t - C xt )(y t - C xt )T + R0 , T + r0 ). (n) = , (n) = , (n) (k ) (k ) (k ) =1 (k ) ( k ) - (k ) ; Syy Syy , -(k) , Syy ). ¯¯ ¯¯ ii) Sample a mode assignment zt as follows and increment nzt-1 zt : kL zt fk (y t ) (zt , k ) ~ =1 9) Fix = , =1 (n) and y 1:T ~ = y 1:T . ~ Algorithm 2: HDP-SLDS and HDP-AR-HMM Gibbs sampler.