Extracting State Transition Dynamics from Multiple Spike Trains with Correlated Poisson HMM Kentaro Katahira1,2 , Jun Nishikawa2 , Kazuo Okanoya2 and Masato Okada1,2 1 Graduate School of Frontier Sciences The University of Tokyo Kashiwa, Chiba 277-8561, Japan 2 RIKEN Brain Science Institute Wako, Saitama 351-0198, Japan katahira@mns.k.u-tokyo.ac.jp Abstract Neural activity is non-stationary and varies across time. Hidden Markov Models (HMMs) have been used to track the state transition among quasi-stationary discrete neural states. Within this context, an independent Poisson model has been used for the output distribution of HMMs; hence, the model is incapable of tracking the change in correlation without modulating the firing rate. To achieve this, we applied a multivariate Poisson distribution with a correlation term for the output distribution of HMMs. We formulated a Variational Bayes (VB) inference for the model. The VB could automatically determine the appropriate number of hidden states and correlation types while avoiding the overlearning problem. We developed an efficient algorithm for computing posteriors using the recursive relationship of a multivariate Poisson distribution. We demonstrated the performance of our method on synthetic data and a real spike train recorded from a songbird. 1 Introduction Neural activities are highly non-stationary and vary from time to time according to stimuli and internal state changes. Hidden Markov Models (HMMs) have been used for segmenting spike trains into quasi-stationary states, in which the spike train is regarded as stationary, hence the statistics (e.g., cross-correlation and inter-spike interval) can be calculated [1, 2, 3]. We can also calculate these statistics by using time-binned count data (e.g., the Peri-Stimulus Time Histogram or PSTH). However, we need a large trial set to obtain good estimates for all bins, which can be problematic in neurophysiological experiments. HMMs enlarge the effective amount of data for estimating the statistics. Moreover, the PSTH approach cannot be applied to cases where we cannot align spike data to stimuli or the behaviors of animals. HMMs are suitable for such situations. Previous studies using HMMs have assumed that all neural activities were independent of one another given the hidden states; hence, the models could not discriminate states whose firing rates were the almost the same but whose correlations among neurons were different. However, there has been reports that shows the correlation between neurons changes within a fraction of a second without modulating the firing rate (e.g., [4]). We developed a method that enabled us to segment spike trains based on differences in neuronal correlation as well as the firing rate. Treating neuronal correlations (including higher-order, and not only pairwise correlations) among multiple spike trains has been one of the central challenges in computational neuroscience. There have been approaches to calculating correlations by binarizing spike trains with small bin sizes [5, 6]. These approaches are limited to treating correlations of short bin length that include at most one spike. Here, we introduce a multivariate Poisson distribution with a higher-order correlation structure (simply abbreviated as a correlated Poisson distribution) as the output distribution for HMMs. The correlated Poisson distribution can incorporate correlation at arbitrary time intervals. 1 To construct optimal model from limited neurophysiological data, it is crucial to select a model that has appropriate complexity, and avoid over-fitting. In our model, model complexity corresponds to the number of hidden states and types of correlations (we have a choice as to whether to include pairwise correlation, third-order correlation, or higher order correlation). The maximum likelihood approach adopted in previous studies [1, 7, 8] cannot be used for this purpose since the likelihood criterion simply increases as the number of model parameters increases. A number of model-selection criteria used with the maximum likelihood approach, i.e., Akaike's information criteria (AIC), minimum description length (MDL), and Bayesian information criteria (BIC) are based on the asymptotic assumption that only holds when a large number of data is obtained. Furthermore, asymptotic normality, which is assumed in these criteria, does not hold in non-identifiable models including HMMs [9]. In this study, we applied the variational Bayes (VB) method [10, 11] to HMMs whose output distribution is a correlated Poisson distribution. VB is one of the approximations of the Bayes method and can avoid over-fitting even when the sample size is small. An optimal model structure can be determined based on tractable variational free energy, which is the upper bound of the negative marginal log-likelihood. Since the variational free energy does not need the asymptotic assumption, VB works well even when the sample size is small in practice [12]. The computation of posteriors for a correlated Poisson distribution imposes serious computational burdens. We developed an efficient algorithm to calculate these by using the recurrence relationship of a multivariate Poisson distribution [13]. To the best of our knowledge, this is the first report that has introduced VB method for a correlated Poisson distribution. Although Markov chain Monte Carlo (MCMC) methods has been applied to inferring posteriors for a correlated Poisson distribution [14], MCMC schemes are computationally demanding. We demonstrate the performance of the method on multiple spike data both on a synthesized spike train and real spike data recorded from the the forebrain nucleus for the vocal control (HVC) of an anesthetized songbird. 2 Method 2.1 HMM with multivariate Poisson distribution Suppose that we obtain spike trains of C neurons by using simultaneous recordings. As preprocessing, we first discretize the spike trains with a non-overlapping window whose length is to obtain spike-count data. The number of spikes of neurons c in the tth window of the nth trial is denoted by xn,t . The spike-count data are summarized as X n,t = {xn,t }C 1 and X = c c c= ,T {X n,t }N=1,t=1 . Let us assume spike-count data-set X is produced by a K -valued discrete hidden n ,T state, Y = {y n,t }N=1,t=1 , and the sequences of hidden states are generated by a first-order Markov n K process whose state transition matrix is a = {aij }K,1,j =1 , aij = p(y n,t = j |y n,t-1 = i), n,t and i= K K the initial state probability is = {i } : i = p(y n,1 = i), n , where i=1 i = 1, j =1 aij = 1, n aij 0 i,j . Hidden states yt are represented by a binary variable yk ,t such that if the hidden state n,t at the tth window of the nth trial is k , then yk = 1; otherwise 0. At state k , the spike count is assumed to be generated according to p(xn,t |k ), whose specific form is given in the following. c Next, we introduce the correlated Poisson distribution. For brevity, we have omitted the superscript, n, t, for the moment. As an example, let us first consider cases of the trivariate Poisson model (C = 3) with second- and third-order correlations. We will introduce an auxiliary hidden variable, sl , l {1, 2, 3, 12, 13, 23, 123} which satisfies x1 =s1 + s12 + s13 + s123 , x2 =s2 + s12 + s23 + s123 , x3 =s3 + s13 + s23 + s123 . Each sl obeys P (sl |l ), where P (x|) denotes a univariate Poisson distribution: P (x|) = ! e- . x Due to the reproducing properties of the Poisson distribution, each xi also marginally follows a Poisson distribution with parameter i + ij + ik + ij k , i, j, k {1, 2, 3}, i = j = k . The mean vector of this distribution is (1 + 12 + 13 + 123 , 2 + 12 + 23 + 123 , 3 + 13 + 23 + 123 )T 2 x and its variance-covariance matrix is given by ( 1 + 12 + 13 + 123 12 + 123 12 + 123 2 + 12 + 23 + 123 13 + 123 23 + 123 13 + 123 23 + 123 3 + 13 + 23 + 123 ) . The general definition of the multivariate Poisson distribution is given using the vector, S = (s1 , s2 , ..., sL )T , and C × L matrix B = [B1 , B2 , ..., BJ ], C L with 0 and 1 elements, where Bj , j = 1, ..., J is a sub-matrix of dimensions C ×C Cj , where C Cj is the number of combinations of choosing j from C elements. Vector x = (x1 , x2 , ..., xC )T defined as x = B S follows a multivariate Poisson distribution. In the above trivariate example, S = (s1 , s2 , s3 , s12 , s13 , s23 , s123 )T and B = [B1 , B2 , B3 ], where 1 B1 = @ 0 0 0 0 1 0 1 0 0 1 0 A , B2 = @ 1 1 0 1 0 1 1 0 1 0 1 1 A , B3 = @ 1 A . 1 1 (1) We can also consider only the second-order correlation model by setting B = [B1 , B2 ] and S = (s1 , s2 , s3 , s12 , s13 , s23 )T , or only the third-order correlation model by setting B = [B1 , B3 ] and S = (s1 , s2 , s3 , s123 )T . The probability mass function of x is given by p(x|k ) = P (sl |k,l ), (2) S G(x) l where G(x) denotes the set of S such that x = B S . The calculation of this probability can be computationally expensive, since summations over possible S might be exhaustive, especially when there is a large number of spikes per window. However, the computational burden can be alleviated by using recurrence relations for a multivariate Poisson distribution [13]. For further details on computation, see the Appendix. We call the HMM with this output distribution the Correlated Poisson HMM (CP-HMM). When we assume that the spike counts for all neurons are independent, (i.e., B = B1 , S = (s1 , s2 , s3 )T ), the output distribution is reduced to p(x|k ) = C c=1 P (xc |k,c ). (3) We call the HMM with this distribution the independent Poisson HMM (IP-HMM). IP-HMM is a special case of CP-HMM. The complete log-likelihood for CP-HMM is [K T K K N n,t-1 n,t n,1 yk log k + yk yk log akk log p(X, Y , S |) = n=1 k=1 T K t=1 k=1 n yk ,t log t=2 k=1 k =1 + P (sl |k,l ) ] , (4) S G(X n,t ) l where denotes the parameter set { , a, }. 2.2 Variational Bayes Here, we derive VB for CP-HMMs. We use conjugate prior distributions for all parameters of CPHMMs, which enabled the posterior distribution to have the same form as the prior distribution. The prior distribution for initial probability distribution and state transition matrix a is a Dirichlet distribution: K ( ) K (A) K ( ) = D({k }k=1 |{uk }k=1 ), (a) = D({aik }K 1 |{uk }K 1 ). (5) k= k= where D(·) is defined as D({ak }K 1 |{uk }K 1 ) = k= k= ( QK uk ) k=1 (uk ) k=1 i=1 PK k k=1 auk -1 . The conjugate prior for k the parameter of the Poisson mean, = {k,l }K 1,l=1 , of each auxiliary hidden variable, {sl }l , k= is: K () = G (k,l |0 , 0 ), (6) k =1 l 3 where G (·) denotes the Gamma distribution defined as G (|, ) = iments we discuss in the following, we set the hyperparameters as 0 = 0.1, 0 = 0.1. -1 - e . In the exper() ( ) (A) uj = uj = 0.1, j , and The Bayesian method calculates p(, Z |X, M ), which is a posterior of unknown parameters and hidden variable set Z = {Y , S } given the data and model structure, M (in our case, this indicates the number of hidden states, and correlation structure). However, the calculation of the posterior involves a difficult integral. The VB approach approximates the true posterior, p(, Z |X, M ), by factored test distribution r()Q(Z ). To make the test distribution closer to the true posterior, we need to minimize Kullback-Leibler (KL) divergence from r()Q(Z ) to p(, Z |X, M ): r()Q(Z ) KL(r()Q(Z )||p(, Z |X, M )) log p(Z, |X, M ) r()Q(Z ) = log p(X |M ) - log p(X, Z, |M )r()Q(Z ) - Hr () - HQ (Z ), (7) where ·p(x) denotes the expectation over p(x) and Hp (x) = - log p(x)p(x) is the entropy of the distribution, p(x). Since the log marginal likelihood log p(X |M ) is independent of r() and Q(Z ), minimizing KL divergence is equivalent to minimizing variational free energy F -log p(X, Z, |M )r()Q(Z ) - Hr () - HQ (Z ). (8) VB alternatively minimizes F with respect to Q(Z ) and r(). This minimization with respect to Q(Z ) is called the VB-E step, and the VB-M step for r(). VB-E step By using the Lagrange multiplier method, the VB-E step is derived as Q(Z ) = 1 explog p(X, Z |)r() , CQ where CQ is a normalization constant. More specifically, the following quantities are calculated: n yk ,t Q(Y ) n n yk ,t-1 yk,t Q(Y ) = = n p(yk ,t = 1|X n,1:t )p(X n,t+1:T |yk ,t = 1) ~n ~ K n ~ n,t = 1|X n,1:t )p(X n,t+1:T |yi ,t = 1) ~ i=1 p(yi n p(yk ,t-1 = 1|X n,1:t-1 )akk p(X n,t | )p(X n,t+1:T |yk,t = 1) ~n ~~ k~ K K n ~ n,t-1 = 1|X n,1:t-1 )aij p(X n,t |j )p(X n,t+1:T |yj ,t = 1) ~~ ~ i=1 j =1 p(yi These quantities are obtained by the forward-backward algorithm [11]. The subnormarized quantity aij is defined as aij = exp(log aij r(a) ) and p(X n,t |k ) is ~ ~ ~ t ~ p(X n,t |k ) = ~ P (sn,,l |k,l ), (9) k n Sk ,t G(X n,t ) l ~ where P (sl |k,l ) is a sub-normalized distribution: { } ~ ¯ ~ P (sl |k,l ) = exp sl log k,l - log(sl !) - k,l , where { } ~ ¯ k,l = exp log k,l r(k ) , k,l = k,l r(k ) . (10) These quantities can be calculated by using the recurrence relations of the multivariate Poisson distribution (See the Appendix). The calculation of the posterior for S is given as: n,t ~ n,t n l P (sk,l |k,l ) Sk ,t G(X n,t ) sk,l n,t sk,l Q(S ) = . (11) ~ n,t P (sn,t |k,l ) n,t S k G ( X ) l k,l This is also calculated by using the recurrence relations of the multivariate Poisson distribution. VB-M step By again using the Lagrange multiplier method, the VB-M step is derived as r ( ) = 1 () explog p(X, Z |)Q(Y ) , Cr 4 A (a) (b) 1 0 5 0 5 0 5 0 5 0 5 0 5 0 5 0 0 B (c) (d) state 1 state 2 state 3 state 1 state 2 0 1 2 0 0.5 1 1.5 0 0.5 1 C Variational free energy 4600 4500 4400 4300 4200 4100 4000 Independent 2nd-order 3rd-order Full-order 1 2 3 4 5 20 40 60 t (window index) 80 100 Number of hidden states Figure 1: Typical examples of estimation results for correlated Poisson-HMM with third-order correlation applied to simulated spike train. A: From top, 1) spike train of three neurons, 2) the probability t of state k staying at window t denoted by yk Q(Y ) , 3) spike count data xt , and 4) posterior mean i t for hidden variables sk,l . B: Posterior mean for Poisson mean k,l for all states. C: Variational free energy calculated for all models. where Cr is a normalization constant. More specifically, r() = r( )r(a)r(), and r( ) = D({k }K 1 |{wk }K 1 ), r(a) = k= k= K YY k=1 l K Y i=1 a D({aik }K 1 |{wik }K 1 ), k= k= r() = G (k,l |wk,l , wk ), where wk,l = 0 + N T N T X X n,t X X n,t n sk,l Q(S ) yk ,t Q(Y ) , wk = 0 + yk Q(Y ) , n=1 t=1 wj = u j ( ) n=1 t=1 N T X X n,t-1 n,t (a) = uj + yi yj Q(Y ) . n=1 t=2 N X n=1 + n a yj ,1 Q(Y ) , wij The VB computes the VB-E and VB-M steps alternatively until the variational free energy converges to a local minimum. In the experiment we discuss in the following, we started the algorithm from 10 different initializations to avoid a poor local minimum solution. 3 Demonstration on synthetic spike train By using the synthetic spike train of three neurons, let us first demonstrate how to apply our method to a spike train. In the case of three neurons, we have four choices for the correlation types that have (1) no correlation term, (2) only a second-order correlation term, (3) only a third-order correlation term, and (4) both of these. After this, we will call them IP-HMM, 2CP-HMM, 3CP-HMM, and fullCP-HMM. We generated spike trains by using a multivariate Poisson distribution with only a thirdorder correlation whose Poisson mean depends on periods as: (a) 1 = 2 = 3 = 0.5, 123 = 0.0 for t [1, 10], (b) 1 = 2 = 3 = 1.5, 123 = 0.0 for t [11, 50], (c) 1 = 2 = 3 = 0.5, 123 = 1.0 for t [51, 90], and (d) 1 = 2 = 3 = 0.5, and 123 = 0.0 for t [91, 100]. The periods (b) and (c) have the same mean firing rate (the mean spike count in one window is i + 123 = 1.5 i {1, 2, 3}), but they only differ in the third-order correlation. Therefore, classical Poisson-HMMs that employ an independent Poisson assumption [1, 2, 7] are not able to segment them into distinct states. Figure 1A shows that our method was able to do so. We generated 5 Table 1: Results of model selection for spike Table 2: Results of model selection with time stationary assumption (K = 1) trains from HVC Stimulus BOS REV Silent K 4 4 3 Correlation Structure Independent 3rd-order Independent Stimulus BOS REV Silent Correlation Structure 2nd order 2nd order Full order a. BOS s tate 1 s tate 2 s tate 3 s tate 4 0 1 2 3 4 5 0 20 40 0 10 20 0 5 10 0 1 2 b. REV s tate 1 s tate 2 s tate 3 s tate 4 0 1 2 3 4 5 0 20 40 0 10 20 0 10 20 0 1 2 c. Silent 0 1 2 3 4 5 s tate 1 s tate 2 s tate 3 time (sec.) 0 20 40 0 10 20 0 0.1 0.2 Figure 2: Typical examples of estimates of VB for spike train from HVC with A) bird's own song, B) its reversed song, and C) no stimuli presented. Selected model based on variational free energy was used for each condition (see Table 1). Each row corresponds to different trials. Background ¯ color indicates most probable state at each time window. Right panels indicate posterior mean k,l for all states. spike trains for 10 trials, but only one trial is shown. The periods (b) and (c) are segmented into states 1 and 2, whose Poisson means are different (Fig. 1B). The bottom four lines in Fig. 1A plot the posterior mean for st ,123 (Here, we omitted the index of trial n). These plots separately visualize k the contribution of the independent factor and correlation factor on spike counts xt , c {1, 2, 3}. c The spike counts in period (b) can be viewed as independent firing. Even if the spikes are in the same window, this can be regarded as just a coincidence predicted by the assumption of independent firing. In contrast, the spike counts in period (c) can be regarded as having been contributed by common factor st ,123 , as well as independent factors st ,i , i {1, 2, 3}. Here, we used a 3CP-HMM 2 2 having three hidden states. Because periods (a) and (d) have identical statistics, it is clear that the model for three states is sufficient for modeling this spike train. Then, can we select this model from the data? Figure 1C shows the variational free energy, F . The 3CP-HMM with three hidden states yields the lowest F , implying that it is optimal. The 3CP-HMMs with fewer hidden states, IPHMMs, or 2CP-HMMs cannot represent the statistical structure of the data, and hence yield higher F . The 3CP-HMMs with more hidden states (K > 3) or full-CP-HMMs (K 3) can include an optimal model, but by being penalized by a Bayesian Occam's razor, yield higher F . Thus, we can select the optimal model based on F , at least in this example. 4 Application to spike trains from HVC in songbird We applied our method to data collected from the nucleus HVC of songbird. HVC is an important nucleus that integrates auditory information and motor information of song sequences [15]. We obtained spike trains of three single units by using a silicon probe from one anesthetized Bengalese finch. The bird's own song (BOS) and reversed song (REV) were presented 50 times for each 6 Table 3: Log-likelihood on test data (REV). Method Independent & stationary assumption (K = 1) Stationary assumption (K = 1, correlation type is selected) Independent assumption (IP-HMM) (K is selected) CP-HMM (all selected) full-CP-HMM (K is selected) Log-likelihood (mean ± s.d.) -255.691 (± 2.074) -247.640 (± 1.659) -230.353 (± 0.958) -229.143 (± 1.242) -230.272 (± 1.244) stimulus during recording. Spontaneous activities (Silent) were recorded so that we could obtain the same amount of data as the stimulus-presented data. More details on the recordings are described elsewhere [16]. We modeled spike trains for all stimuli using IP-HMMs and CP-HMMs by varying the number of states K and various correlation structure. We then selected the model that yielded the lowest free energy. We used window length = 100 (ms). The selected models are summarized in Table 1. Figure 2 shows a typical example of spike trains and the segmentation results for the selected models. The CP-HMMs were only selected for spike trains when REV was presented. If we assume that the spike statistics did not change over the trials (in our case, this corresponds to the model with only one hidden state, K = 1), CP-HMMs were selected under all experimental conditions. These results reflect the fact that neurons in anesthetized animals simultaneously transit between highfiring and low-firing states [17], which can be captured by a Poisson distribution with correlation terms. Time-stationary assumptions have often been employed to obtain a sufficient sample size for estimating correlation (e.g., [6]). Our results suggest that we should be careful when interpreting such results; even when the spike trains seem to have a correlation, if we take state transition into account, spike trains may be better captured by using an independent Poisson model. We measured predictive performance of unseen data to verify how well our model capture the statitical properties of the spike train. Here, we used spike trains for REV where 3CP-HMM was selected. We first divided spike trains into 20 learning and 20 test trials. In the training phase, we constructed models using the model selection based on the variational free energy with four restrictions: (1) an independent & stationary assumption (K = 1), (2) a stationary assumption (K = 1, correlation type was selected), (3) IP-HMM (K was selected), (4) CP-HMM (no restrictions), and (5) the full-CP-HMM (K was selected). In the prediction phase, we calculated the log-likelihood on test data under the posterior mean r() of selected models. The results are summarized in Table 3. We took averaged over different choices of training set and prediction sets. We can see that the log-likelihood on the test data improved by taking both the state transition and correlation structure into consideration. These results suggest that CP-HMMs can characterize the spike train better than classical Poisson-HMMs. The full-CP-HMM include 2nd-order CP-HMM, but shows lower predictive performance than the model in which correlation type were selected. This is likely due to over-fitting to the traing data. The VB approach selected the model with appropriate complexity avoiding over-fitting problem. 5 Discussion We constructed HMMs whose output is a correlated multivariate Poisson distribution for extracting state-transition dynamics from multiple spike trains. We applied the VB method for inferring the posterior over the parameter and hidden variables of the models. We have seen that VB can be used to select an appropriate model (the number of hidden states and correlation structure), which gives a better prediction. Our method incorporated the correlated Poisson distribution for treating pairwise and higher-order correlations. There have been approaches that have calculated correlations by binarizing spike data with log-linear [5] or maximum-entropy models [6]. These approaches are limited to treating correlations in short bin lengths, which include at most one spike. In contrast, our approach can incorporate correlations in an arbitrary time window from exact synchronization to firing-rate correlations on a modest time scale. The major disadvantages of our model are that it is incapable of negative correlations. It can be incorporated by employing a mixture of multivariatePoisson distributions for the output distribution of HMMs. VB can easily be extended to such models. 7 Appendix: Calculation of correlated Poisson distribution in VB-E step The sub-normarized distribution (Eq.9) can be calculated by using the recurrence relation of multivariate Poisson distribution [13]. Let us consider the tri-variate (C = 3) with the second-order correlation case, where B = [B1 , B2 ]. Here, the recursive scheme for the calculating Eq.9 is: · If all elements of X = (X1 , X2 , X3 ) are non-zero, then ~~ ~ x1 P (X1 = x1 , X2 = x2 , X3 = x3 |) =1 P (X1 = x1 - 1, X2 = x2 , X3 = x3 |) ~~ + 12 P (X1 = x1 - 1, X2 = x2 - 1, X3 = x3 |) ~~ + 13 P (X1 = x1 - 1, X2 = x2 , X3 = x3 - 1|). · If at most one element of X is non-zero, then 3 ~ ij ~ (X1 = x1 , X2 = x2 , X3 = x3 |) = exp - ~ P P (Xi = xi |i ), i, j 1, 2, 3. i