The Dynamic Hierarchical Dirichlet Pro cess

Lu Ren lr@ee.duke.edu Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708, USA David B. Dunson Department of Statistical Science, Duke University, Durham, NC 27708, USA dunson@stat.duke.edu

Lawrence Carin lcarin@ee.duke.edu Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708, USA

Abstract
The dynamic hierarchical Dirichlet process (dHDP) is developed to model the timeevolving statistical properties of sequential data sets. The data collected at any time point are represented via a mixture associated with an appropriate underlying model, in the framework of HDP. The statistical properties of data collected at consecutive time points are linked via a random parameter that controls their probabilistic similarity. The sharing mechanisms of the timeevolving data are derived, and a relatively simple Markov Chain Monte Carlo sampler is developed. Experimental results are presented to demonstrate the model.

sured in a sequential manner, and there is information in this temporal character that should ideally be exploited; this violates the aforementioned assumption of exchangeability. Developing models for time-evolving data has recently been the focus of significant interest, and researchers have proposed various solutions directed toward specific applications. An early example is the order-based dependent DP (Griffin & Steel, 2006), in which the model is time-reversible but is not Markovian, and it requires one to specify how the mixture weights change through time. Another related work is the timevarying Dirichlet process mixture model (Caron et al., 2007) based on a modified Polya urn scheme (Blackwell & MacQueen, 1973), implemented by changing the number and locations of clusters over time. This method is easy to understand intuitively but has computational challenges for large data sets. To examine the temporal dynamics of scientific topics, latent Dirichlet allocation (Blei et al., 2003) (Griffiths & Steyvers, 2004) has been used as a generative model for analysis of documents. In order to explicitly model the dynamics of the underlying topics, Blei (Blei & Lafferty, 2006) proposed a dynamic topic model, in which the parameter at the previous time t - 1 is the expectation for the distribution of the parameter at the next time t, and the correlation of the samples at adjacent times is controlled through adjusting the variance of the conditional distribution. Unfortunately, the nonconjugate form of the conditional distribution requires approximations in the model inference. Recently Dunson (Dunson, 2006) proposed a Bayesian dynamic model to learn the latent trait distribution through a mixture of DPs, in which the latent variable density changes dynamically in location and shape across levels of predictors. This dynamic structure is considered in this paper to extend HDP to incorpo-

1. Introduction
The Dirichlet process (DP) mixture model (Escobar & West, 1995) has been widely used to perform density estimation and clustering, by generalizing finite mixture models to (in principle) infinite mixtures. In order to "share statistical strength" across different groups of data, the hierarchical Dirichlet process (HDP) (Teh et al., 2005) has been proposed to model the dependence among groups through sharing the same set of discrete parameters ("atoms"), and the mixture weights associated with different atoms are varied as a function of the data group. In the HDP, it is assumed that the data groups are exchangeable. However, in many real applications, such as seasonal market analysis and gene investigation for disease, data are meaAppearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).


The Dynamic Hierarchical Dirichlet Pro cess

rate time dependence, and has the following features: (i) two data samples drawn at proximate times have a higher probability of sharing the same underlying model parameters (atoms) than parameters drawn at disparate times; and (ii) there is a possibility that temporally distant data samples may also share model parameters, thereby accounting for possible distant repetition in the data.

of the discrete form of G0 (all Gj are composed of  the same set of atoms {k } 1 ). The clusters in each k= group j , assumed by the set {j,i }i=1,...,Nj , are inferred via the posterior density function on the parameters, with the likelihood function selecting the set of discrete  parameters {k } 1 most consistent with the data k= {xj,i }i=1,...,Nj . Meanwhile, clusters (and, hence, asso ciated cluster parameters {k } 1 ) are shared across k= multiple data sets, as appropriate. Although the HDP introduces a dependency between the J groups, the data sets are assumed exchangeable. However, in many applications, the data may be collected sequentially, and one may have a prior belief that sharing of data is more probable when the data sets are collected at similar points in time. The purpose of this paper is to extend the HDP to account for such temporal information. Before proceeding, it will prove useful to consider an alternative form of the HDP model, as derived in (Teh et al., 2005). Specifically, each draw Gj may be expressed as: Gj =
ind

2. Dynamic HDP
2.1. Background A Dirichlet process is a measure on a measure G and is parameterized as G  DP (0 , G0 ), in which G0 is a base measure and 0 is a positive "precision" parameter. To provide an explicit form for a G drawn from DP (0 , G0 ), Sethuraman (Sethuraman, 1994) developed a stick-breaking construction: G= k
=1
 k k ,

k = k ~

k-1 i =1

(1 - i ) ~

(1)

 where {k } 1 represent a set of atoms drawn i.i.d. k= from G0 and {k } 1 represent a set of weights, with k = ~ the constraint k=1 k = 1; each k is drawn i.i.d. from B e(1, 0 ). According to the construction in (1), a draw G from a DP (0 , G0 ) is discrete with probability one. Based on this important property, Teh (Teh et al., 2005) proposed a hierarchical Dirichlet process (HDP) to link the group-specific Dirichlet processes, learning the models jointly across multiple data sets.

k
=1

 j,k k

j  DP (0j ,  )   S tick ( )
 k  H iid

(3)

Assume we have J groups of data and the j data set (group) is denoted as {xj,i }i=1,...,Nj . For each of these data sets, xj,i is drawn from the model xj,i 
iid ind

th

where S tick ( ) stochastically generates an infinite set of sticks {1 , 2 , . . .}, based on a stick-breaking process of the form in (1 here with parameter  , satisfying ), the constraint i=1 i = 1. 2.2. Bayesian Dynamic Structure Similar to HDP, we again consider J data sets but now using an explicit assumption that the data sets are collected sequentially, with {x1,i }i=1,...,N1 collected first, {x2,i }i=1,...,N2 collected second, and with {xJ,i }i=1,...,NJ collected last. Since our assumption is that a time evolution exists between adjacent data groups, the distribution Gj -1 , from which {j -1,i }i=1,...,Nj -1 are drawn, is likely related to Gj , from which {j,i }i=1,...,Nj are drawn. To specify explicitly the dependence between Gj -1 and Gj , Dunson (Dunson, 2006) proposed a Bayesian dynamic mixture DP (DMDP), in which Gj shares features with Gj -1 but some innovation may also occur. The DMDP has the drawback that mixture components can only be added over time, so that one ends up with more components at later times as an artifact of the model.

F (j,i ) with parameters j,i  Gj , and the parame ters {j,i }i=1,...,Nj are likely to assume the atoms k for which the associated sticks j,k are large, as a consequence of the form of Gj given by (1); for the J data sets, different group-specific Gj are drawn from DP (j 0 , G0 ), in which G0 is drawn from another DP. The generative model for HDP is represented as: xj,i  F (j,i ) j,i  Gj Gj  DP (j 0 , G0 ) G0  DP ( , H ) where j = 1, . . . , J and i = 1, . . . , Nj . Under this hierarchical structure, not only can different observations xj,i and xj,i in the same group share the same parameters   based on the stick weights represented by Gj , but also the observations across different groups might share parameters as a consequence
ind iid ind

(2)


The Dynamic Hierarchical Dirichlet Pro cess

In the dHDP, we have Gj = (1 - wj -1 )Gj -1 + wj -1 Hj -1 ~ ~ (4)

To further develop the dynamic relationship from G1 to GJ , we extend the mixture structure in (4) from group to group: Gj = (1 - wj -1 )Gj -1 + wj -1 Hj -1 ~ ~ =
j -1 l =1 j -1 l =1

where G1  DP (01 , G0 ), Hj -1 is called an innovation distribution drawn from DP (0j , G0 ), and wj -1  B e(aw(j -1) , bw(j -1) ). In this way, Gj is modi~ fied from Gj -1 by introducing a new innovation distribution Hj -1 , and the random variable wj -1 controls ~ the probability of innovation (i.e., it defines the mixture weights). As a result, the relevant atoms adjust with time, and it is probable that proximate data will share the same atoms, but with the potential for transient innovation. Additionally, we assume that G0  DP ( , H ) as in the HDP to enforce that G0 is discrete, which manifests another important aspect of the dynamic HDP: the same atoms are used for al l Gj , but with different time-evolving weights. Consequently, the model encourages sharing between temporally proximate data, but it is also possible to share between data sets widely separated in time. Providing now more model details, the discrete base distribution drawn from DP ( , H ) may be expressed as: k  (5) G0 = k k
=1

(1 - wl )G1 + ~

{

mj -1
=l+1

(1 - wm )}wl Hl (9) ~ ~

= wj 1 G1 + wj 2 H1 + . . . + wj j Hj -1 j -1 where wj l = wl-1 m=l (1 - wm ), for l = 1, 2, . . . , j , ~ ~ j with w0 = 1. It can be easily verified that l=1 wj l = ~ 1 for each wj , which is the prior probability that the data in group j will be drawn from the mixture distribution: G1 , H1 , . . . , Hj -1 . If all wj = 0, all of the ~ groups share the same mixture distribution G1 and the model reduces to a Dirichlet mixture model, and if all wj = 1 the model reduces to the HDP. Therefore, ~ the dynamic HDP is more general than both DP and HDP, with each a special case. A visual representation of the model is depicted in Figure 1.
H
aw

1

0

G0

~ w
HJ
~ 1 wJ
1

bw

H1

1

G

~ 1 w1

G2

~ w 1

~ wJ

1

GJ

where are the global parameter components (atoms), drawn independently from the base distribution H and {k }k=1,2,..., are drawn from a stick-breaking process   S tick ( ), defined as: l ~ ~ ~ iid k = k (1 - l ) k  B e(1,  ) (6)
<k

 {k }k=1,2,...,

F

1i

2

i

Ji

x1i

x2 i

x Ji

igure 1. General graphical model for the dynamic HDP.

We also have J groups of data. Gj represents the prior for the mixture distribution associated with the global components in group j , Hj -1 represents the associated prior for the innovation mixture distribution, and this yields the explicit priors used in (4): G1 = k
=1
 1,k k , H1 =

k
=1

 2,k k , . . . ,

According to (9), the observation xj,i will choose a mixture distribution from 1:j based on M ult(wj ) to be drawn from the global parameter components  {k } 1 . We let rj,i be a variable to indicate which k= mixture distribution is taken from 1:j to draw the observation xj,i ; zj,i is a parameter component indicator variable. An alternative form of the dHDP model is represented as:
  k |H  H ,  |  S tick ( ) ~ wj |awj , bwj  B e(wj |awj , bwj ), ~ ~ rj,i |w  wj j |0j ,   DP (0j ,  ), zj,i |1:j , rj,i  rj,i   xj,i |zj,i , (k ) 1  F (zj,i ), k= (10) and a graphical representation is shown in Figure 2, in which we add a gamma prior for  and for the components of the vector 0 : P r( ) = Ga( ; 01 , 02 ) and J P r(0 ) = j =1 Ga(0j ; c0 , d0 ). The form of the parametric model F (·) may be varied depending on the application.

HJ -1 =

k
=1

(7)

 J,k k

where, analogous to the discussion at the end of Section 2.1, the different weights j are independent given  since G1 , H1 , . . . , HJ -1 are independent given G0 ; the relationship between j and  is proven (Teh et al., 2005) to be j |0j ,   DP (0j ,  ) (8)


The Dynamic Hierarchical Dirichlet Pro cess

l
0

j = 2, . . . , J is
j

C orr(Gj -1 , Gj ) E {Gj (A)Gj -1 (A)} - E {Gj (A)}E {Gj -1 (A)} [V {Gj (A)}V {Gj -1 (A)}]1/2 j -1 wjl wj-1,l 0l + +1 ·  +1 l=1 1+0l =j 2 2 j -1 wj-1,l 0l + +1 1/2 wj l [ l=1 1+0l · 0l + +1 ]1/2 [ l=1 1+0l ·  +1 ]  +1 (14) =

j
F

l 1,

,

aw

z

i

ri
j 1,2, Jj i 1,2, , N J

~ w

bw

To compare the similarity of two data groups, the correlation coefficient defined in Theorem 2 can be calcuigure 2. Graphical representation of the dHDP from a stick- lated from the p osterior exp ectation of w, 0 and  as a local similarity measure. breaking view.
k 1, ,

H* k

x ji

2.3. Sharing Prop erties To see the mixture structure in a discrete partition space A = (A1 , . . . , AK ), we consider Gj (A1 , . . . , AK )|Gj -1 , wj -1  ~ ~ (1 - wj -1 )Gj -1 (A1 , . . . , AK ) + wj -1 Hj -1 (A1 , . . . , AK ) ~ Gj -1 (A1 , . . . , AK ) + j(A1 , . . . , AK ) (11)

2.4. Posterior Computation A modification of the block Gibbs sampler (Ishwaran & James, 2001) is proposed for dHDP inference. Since in practice the {k } 1 in (1) diminish quickly with k= increasing k , a truncated stick-breaking process (Ishwaran & James, 2001) is employed here, with a large truncation level K , to approximate the infinite stick breaking process. In the dHDP, the second level of DPs associated with the dynamic structure is the only part different from HDP (see Figure 2). Due to the limited space, we only give the conditional posterior ~~ distributions for w,  , r and z. The conditional distribution of wl , for l = 1, . . . , J - 1 ~ has the simple form: (wl | · · · )  B e(aw + ~ j
J

where j(A1 , . . . , AK ) = wj -1 {Hj -1 (A1 , . . . , AK ) - ~ Gj -1 (A1 , . . . , AK )} is the random deviation from Gj -1 to Gj . Theorem 1. Given any discrete partition A, we have: E { j(A)|Gj -1 , wj -1 , H,  , 0j } ~ =wj -1 {H (A) - Gj -1 (A)} ~ V { j(A)|Gj -1 , wj -1 , H,  , 0j } ~ =wj -1 ~2 (1 +  + 0j )H (A)(1 - H (A)) (1 + 0j )(1 +  ) (13) (12)

nj,l+1 , bw +
=l+1

j

J

hl

nj h )

=l+1 =1

According to Theorem 1, given the previous mixture distribution Gj -1 , the expectation of the deviation from Gj -1 to Gj is controlled by wj -1 . Meanwhile, the ~ variance of the deviation is both related with wj -1 and ~ the precision parameters  , 0j . To consider limiting cases, we observe the following: · if wj -1  0, Gj = Gj -1 ; ~ · if Gj -1  H , E (Gj (A)|Gj -1 , wj -1 , H,  , 0j ) = ~ Gj -1 (A); · if    and 0j  , V ( j(A)|Gj -1 , wj -1 , H,  , 0j )  0. ~ These limiting cases yield insights on the underlying dependence between adjacent groups. Theorem 2. The correlation coefficient of the distributions between two adjacent groups Gj -1 and Gj for

(15) Nj where nj h =  (rj i = h). In (15) and in the i=1 results that follow, for simplicity, the distributions B e(awj , bwj ) are set with fixed parameters awj = aw and bwj = bw for all time samples. The conditional distribution of lk , for l = 1, . . . , J ~ and k = 1, . . . , K , is updated under the conjugate k prior: lk  B e(0,l k , 0,l (1 - m=1 m )), which is ~ specified in (Teh et al., 2005). Then the conditional posterior of lk has the form ~ (lk | · · · )  B e(0l k + ~ lk
=1

j J iNj
=1 =1

 (rj i = l, zj i = k ),

0l (1 -

l ) +

j J iNj k
=1 =1

K

 (rj i = l, zj i = k ))
=k +1

(16) The update of the indicator variables rj i and zj i , for j = 1, . . . , J and i = 1, . . . , Nj are completed by generating samples from multinomial distributions with


The Dynamic Hierarchical Dirichlet Pro cess

entries as follows: P r(rj i = l| · · · )  wl-1 ~
j- m1 =l zj i -1

(1 - wm ) · lzji ~ ~

q

 (1 - lq ) · P r(xj i |zji ) ~ =1

(17) where l = 1, . . . , j . The posterior probability P r(rj i = j l) is normalized so that l=1 P r(rj i = l) = 1. P r(zj i = k | · · · )  rji k ~
k k -1  (1 - rji k ) · P r(xj i |k ) ~

(18) where k = 1, . . . , K aKd the posterior is also normaln ized by a constant k=1 P r(zj i = k ). The remaining variables specified in (10) are sampled in the same ways as in HDP (Teh et al., 2005). The  component parameters k for k = 1, . . . , K are considered for different model forms depending on the specific applications. For the results that follow, it is of interest to consider a hidden Markov model (HMM) mixture (Qi et al., 2007) and Gaussian mixture model  (GMM), in which k respectively represent the statetransition matrix, the observation matrix, the initialstate distribution for the HMM and the mean vector and covariance matrix for GMM. For more details about sampling for such models, see (Qi et al., 2007) and (Escobar & West, 1995). The Gibbs sampling algorithm was tested carefully under different initializations and the diagnostic method in (Raftery & Lewis, 1992) is used to demonstrate rapid convergence and good mixing (for the results considered, convergence based on this method was observed for a burn-in of 200 samples, followed by a subsequent 4000 samples).

=1

No. 17 (Newman, 1972). As is widely employed for analysis of such audio data, MFCC features are extracted and discretized with vector quatization (Qi et al., 2007); each of the aforementioned subsequences corresponds to a sequence of codewords (we here employ a discrete HMM). The basic form of the Bayesian representation of a discrete HMM is as discussed in (Qi et al., 2007). The piece is transformed into 4980 discrete symbols, divided into 83 subsequences of equal length (the codebook has 16 codes, and 8 states are employed for each HMM); each subsequence corresponds to 6 secs in the music. To model the time dependence between adjacent subsequences, each subsequence corresponds to one group in the dHDP HMM mixture and will choose one set of HMM parameters according to the corresponding mixture weights. In the dHDP framework, one subsequence can share the old DP mixture distributions with the previous ones or it might be drawn from an innovation DP mixture, which may be also shared by the following time series in a similar manner. To encourage that adjacent ~ subsequences be shared, the prior for w is specified ~ ) < 0.5. The product of most interest here is as E (w the segmentation of the music, with the specific HMM parameters of secondary importance.
1 10 10 0.8 1

Sequence Index

30 40 50 60 70 80 10 20 30 40 50 60 70 80

Sequence Index

20

20 30 40 50 60 70 80 10 20 30 40 50 60 70 80

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

Sequence Index

Sequence Index

(a)
z

(b)

3. Exp erimental Results
3.1. Music Segmentation It is of interest to segment music, to infer interrelationships between different parts of a given piece, as well as between different pieces. Here we consider segmentation of music, where a given piece is divided into contiguous subsequences, with each subsequence modeled via a hidden Markov model (HMM). The dHDP model is useful in this application in enforcing the idea that contiguous subsequences are likely to be within the same music segment, and therefore are likely to share HMM parameters. However, when the segment changes, these changes are detected via innovation within the dHDP. The music under consideration is the first movement "Largo - Allegro" from the Beethoven piano Sonata

Figure 3. Similarity matrix E (z ) from HMM mixture modeling of the Sonata. (a) dHDP-HMM, (b) HDPHMMs.

To represent the time dependence of the piece, the similarity measure E (z z) (see z in Eq. (18)) is computed across each pair of subsequences, as shown in Figure 3, in which larger values represent higher probability of the two corresponding subsequences being shared during parameter inference. Based upon a discussion in (Newman, 1972), the movement alternates seeming peacefulness with sudden turmoil (1st6th subsequences), after some time expanding into a haunting "storm" in which the peacefulness is lost (7th-21st subsequences). After the recurrence of the same pattern (22nd-42nd subsequences) and a small transition, the movement starts a long recitative section in a slow tone (53rd-69th subsequences). Then through the crescendo, previous disturbed tones come back again until the music goes to the peaceful epilogue


The Dynamic Hierarchical Dirichlet Pro cess

(after the 70th subsequence). See (Newman, 1972) for more details on the Sonata. This is deemed to be an interesting piece for study because it is well characterized in the music literature, as briefly summarized above, and because it is anticipated to have repeated segments over the length of the piece. In Figures 3(a) and (b) we compare the dHDP and HDP, respectively, ~ the latter computed by fixing all w = 1 in the dHDP model. The dHDP and HDP yield related results, but the former yields a smoother segmentation, in good agreement with the music theory discussed above.
1 0.8 0.6

yj i to be associated with the ith sample at time tj . The error term j i is also a vector of dimension p and each coefficient j i,d is independently drawn from a Student-t distribution. To eliminate the problem of model identifiability, we incorporate the constraints that µ1 = 0 and 1 = 1, as (Dunson, 2006) discusses. In the present model, one cannot explicitly associate  exclusively with the virus; however, since these are cell data, it is anticipated that the virus represents the dominant phenomena. We have access to expressions of thousands of genes from each sample (cell) for multiple consecutive times t1 , t2 , . . . , tJ . For each time tj , there are Nj samples measured from different cells (Hibberd et al., 2006). Although these samples may have different observations in gene expressions at the same time, due to individual diversity, the hidden variable  (see (19)) underlying the observations may have similar characteristics. Based on this consideration, the  underlying the observations in one group corresponding to one time are assumed to be drawn from a Gaussian mixture model. They may also share the same mixture distribution for proximate time points, under the assumption of the dHDP model. The Dengue gene expression data (Hibberd et al., 2006) are divided into six groups of samples measured at different times and the number of samples in each group are 10, 12, 12, 10, 12, 9 (the specific time points associated with these data are respectively 3, 6, 12, 24, 48 and 72 hours); each sample has 19,143 genes. To deal with such high-dimensional data, the Fisher score (Duda & Hart, 1973) is used to preliminarily select p = 5000 genes as being the most relevant (variable across time and cell), and then we use the dHDP mixture model discussed above to analyze the time evolution existing in these gene samples.
1. 5 1 0. 5

Auditory Waveform

0.4 0.2 0 -0.2 -0.4 -0.6 -0.8

0

1

2

3

4

5 x 10

6
6

Music Sampled at 22.05(kHz)

Figure 4. Segmentation of the Beethoven piano music from the dHDP HMMs (red dash lines represent segment positions and blue curves represent the auditory waveform).

Based on the results from the dHDP HMM, which effectively yields a model with smoothly time-evolving statistics, we segment the music and present the associated auditory waveform in Figure 4. By examining the waveform and the results in Figure 3, we note that the dHDP segments the music into dominant auditory phenomena, but it is less sensitive to noticeable but temporally localized events in the music, yielding a segmentation that is consistent with the music theory. By contrast, the HDP results in Figure 3(b) are evidently more sensitive to these local temporal bursts in the waveform. 3.2. Gene Expression Data As a second example, we consider the time-evolving characteristics of gene-expression data, here for a Dengue virus study (Hibberd et al., 2006). Concerning a model for the gene-expression data at one time snapshot, Dunson (Dunson, 2006) proposed a latent response model based on a linear regression structure; we extend this model for time-evolving gene-expression data via dHDP (with comparison as well to HDP). Assume yj i is a feature vector with dimension p for j = 1, . . . , J and i = 1, . . . , Nj (index j corresponds to time, i represents a particular cell from which a sample is collected, and p denotes the number of genes being modeled). Each yj i is represented as yj i = µ + j i + j i (19) where µ = (µ1 , . . . , µp ) is the intercept vector and  = (1 , . . . , p ) represents factor loadings. We define a hidden variable j i underlying the observation

Box Plots of the 

0 -0. 5 -1 -1. 5 -2 -2. 5 -3 -3. 5 10 20 30 40 50 60 3hr 6hr 12hr 24hr 48hr 72hr

Data Index

Figure 5. Median values and associated uncertainty based on posterior distributions of the hidden variables  .

Based on the samples collected from the Gibbs sampling after burn-in, the posterior distributions (includ-


The Dynamic Hierarchical Dirichlet Pro cess
1 1 0.9 0.8 2 0.7

1 0.9 10 0.8 20 0.7 0.6 30 0.5 40 0.4 0.3 50 0.2 60 10 20 30 40 50 60 0.1 0

Mixture Distribution Index

virus (as determined by the analysis) due to the lack of a systematic trend over time. As discussed in Section 2.2, if all wj are set to one for ~ j = 1, . . . , J - 1, the dHDP reduces to HDP and all the temporal groups are conditionally exchangeable. It is of interest to compare the dHDP with HDP both in the sharing mechanism and parameter estimation. In practice, acquisition of the gene-expression data is expensive, and it is desirable to reduce the number of samples required. To consider this issue, we reduced the samples size to four at each time point, and plot the data similarity matrix E [z z ] for HDP and dHDP respectively in Figures 8(a) and (b).
1 0.9 5

3

0.6 0.5

4

0.4 0.3

5 0.2 6 10 20 30 40 50 60 0.1 0

Data Index

Data Index

Data Index

Time (hr) : 3

6

12

24

48

72

(a)

Time (hr) : 3

6

12

24

48

72

(b)

Figure 6. The dHDP GMM modeling for the gene expression data. (a) The posterior distribution of r. (b) The similarity matrix E [z z ].
5 4 3 2 increasing trend decreasing trend flat trend

1 0.9 5 0.8 0.7

Gene Expression Level

0.8 0.7

1 0 -1 -2

10

Data Index

Data Index

0.6 0.5

10

0.6 0.5

15

0.4 0.3

15

0.4 0.3

20

0.2 0.1 0

20

0.2 0.1 0

5

10

15

20

5

10

15

20

-3

Data Index

Data Index

0

10

20

30

40

50

60

70

Data Index Associated with Time

Time (hr) : 3

6

12

24

48

72

Time (hr) : 3

6

12

24

48

72

(a) Figure 7. The first ten inferred important genes (color red and blue) and the relatively unrelated genes (color green).
z

(b)

Figure 8. Similarity matrix E [z ] with four samples for each temporal group. (a) HDP, (b) dHDP.
1. 5 1 0. 5

ing the minimum, median, maximum, 25th and 75th percentiles of the values) for all components of  underlying these samples at different times are shown in Figure 5. Time points 3hr, 6hr and 12hr appear to share a similar pattern, but the t=12 seem to have smaller diversity among different samples. From 24hrs,  drops slightly to a new pattern and they drop significantly again at 48hr. The posterior of indicator r is plotted in Figure 6(a) to show the mixture-distribution sharing relationship across different groups. Figure 6(b) shows the similarity measure E (z z) across every pair of samples; here zj i is the indicator variable for the j i associated with time tj (see Eq. (18)). Consider the factor loadings vector , which has components linked to the p genes under consideration. The larger the value of |d |, the more influence the pattern contained in  has on the corresponding gene at the dth dimension. Therefore, according to the posterior mean of |d | for all d from the Gibbs sampling we rank the genes based on their importance. In Figure 7 we plot the expression levels over time for the 10 most important and 10 least important genes. The red and blue curves show two different time patterns and their values have either an increasing or a decreasing trend with time, depending on whether the associated  is positive or negative. The green curves represent the genes with no apparent relation to the


Box Plots of

0 -0. 5 -1 -1. 5 -2 -2. 5 -3 0 5 10 15 20 25

HDP dHDP

D ata Index

Time (hr) : 3

6

12

24

48

72

Figure 9. Comparison of dHDP and HDP with box plots of the hidden variables  as the sample size is reduced to four for each temporal group (the standard deviation based on dHDP is 12.1% reduced on average relative to HDP; the means are very similar).

Compared with HDP, dHDP has more sharing between the related groups (as expected from model construction), and despite the reduced data samples the dHDP yields an inter-relationship between the different times that is consistent with that in Figure 6(b) which employs all of the available data. In Figure 9 we compare dHDP and HDP estimation of  based on four samples per time point. These results show that dHDP has a smaller estimation uncertainty for most  relative to HDP, which is attributed to proper temporal


The Dynamic Hierarchical Dirichlet Pro cess

sharing explicitly imposed by dHDP. As the sample size is increased, the differences between dHDP and HDP diminish. Finally, correlation coefficients between two groups are calculated from the samples drawn from the Gibbs sampler, according to (14) and plotted as a matrix in Figure 10; this representation is an additional benefit of the dynamic structure explicitly imposed within dHDP (of potential biological interest). The size of each small block at the ith row and j th column is proportional to the value of the correlation coefficient associated with group i and group j . We note based on Figure 10 that such inference appears to be accurate (or at least consistent) even with diminished sample size.
1 1 2 2

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993­1022. Caron, F., Davy, M., & Doucet, A. (2007). Generalized poly urn for time-varying dirichlet process mixtures. Proceedings of the International Conference on Uncertainty in Artificial Intel ligence(UAI). Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. Wiley. Dunson, D. B. (2006). Bayesian dynamic modeling of latent trait distributions. Biostatistics, 7, 551­568. Escobar, M. D., & West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90, 577­588. Griffin, J. E., & Steel, M. F. J. (2006). Order-based dependent dirichlet processes. Journal of the American Statistical Association, 101, 179­194.

Time Index

3

Time Index

3

4

4

5

5

6 1 2 3 4 5 6

6 1 2 3 4 5 6

Time Index

Time Index

(a)

(b)

Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proc Natl Acad Sci U S A, 101, Suppl 1, 5228­5235. Hibberd, M. L., Vasudevan, S. G., Ling, L., & George, J. (2006). Time course expression data of human cel l lines infected with dengue virus serotype2 ngc (Technical Report). Genome Institute of Singapore. Ishwaran, H., & James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96, 161­173. Newman, A. S. (1972). Sonata in the classic era (a history of the sonata idea). W. W. Norton. Qi, Y., Paisley, J. W., & Carin, L. (2007). Music analysis using hidden markov mixture models. IEEE Transactions on Signal Processing, 55, 5209­5224. Raftery, A. E., & Lewis, S. (1992). How many iterations in the gibbs sampler? Bayesian Stat., 4, 763­773. Sethuraman, J. (1994). A constructive definition of dirichlet priors. Statistica Sinica, 2, 639­650. Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2005). Hierarchical dirichlet processes (Technical Report). Dept. of Computer Science, National University of Singapore. Welling, M., Porteous, I., & Bart, E. (2007). Infinite state bayes-nets for structured domains. Proceedings of the International Conference on Neural Information Processing Systems.

Figure 10. Similarity matrix between data at different time points based on the correlation coefficients (14), as computed from the dHDP posterior. (a) using all available data, (b) using four samples for each temporal group.

4. Conclusions
The proposed dynamic hierarchical Dirichlet process (dHDP) extends the HDP (Teh et al., 2005), imposing a dynamic time dependence so that the initial mixture model and the subsequent time-dependent mixtures share the same set of components (atoms). The experiments indicate that the dHDP is an effective model for analysis of time-evolving data. Concerning future research, more efficient inference methods will be considered, such as collapsed sampling (Welling et al., 2007) and variational Bayesian inference (Blei & Jordan, 2004).

References
Blackwell, D., & MacQueen, J. B. (1973). Ferguson distributions via polya urn schemes. Ann. Statist., 1, 353­355. Blei, D. M., & Jordan, M. I. (2004). Variational methods for the dirichlet process. Proceedings of the International Conference on Machine Learning. Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models. Proceedings of the International Conference on Machine Learning.