The Dynamic Hierarchical Dirichlet Pro cess Lu Ren lr@ee.duke.edu Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708, USA David B. Dunson Department of Statistical Science, Duke University, Durham, NC 27708, USA dunson@stat.duke.edu Lawrence Carin lcarin@ee.duke.edu Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708, USA Abstract The dynamic hierarchical Dirichlet process (dHDP) is developed to model the timeevolving statistical properties of sequential data sets. The data collected at any time point are represented via a mixture associated with an appropriate underlying model, in the framework of HDP. The statistical properties of data collected at consecutive time points are linked via a random parameter that controls their probabilistic similarity. The sharing mechanisms of the timeevolving data are derived, and a relatively simple Markov Chain Monte Carlo sampler is developed. Experimental results are presented to demonstrate the model. sured in a sequential manner, and there is information in this temporal character that should ideally be exploited; this violates the aforementioned assumption of exchangeability. Developing models for time-evolving data has recently been the focus of significant interest, and researchers have proposed various solutions directed toward specific applications. An early example is the order-based dependent DP (Griffin & Steel, 2006), in which the model is time-reversible but is not Markovian, and it requires one to specify how the mixture weights change through time. Another related work is the timevarying Dirichlet process mixture model (Caron et al., 2007) based on a modified Polya urn scheme (Blackwell & MacQueen, 1973), implemented by changing the number and locations of clusters over time. This method is easy to understand intuitively but has computational challenges for large data sets. To examine the temporal dynamics of scientific topics, latent Dirichlet allocation (Blei et al., 2003) (Griffiths & Steyvers, 2004) has been used as a generative model for analysis of documents. In order to explicitly model the dynamics of the underlying topics, Blei (Blei & Lafferty, 2006) proposed a dynamic topic model, in which the parameter at the previous time t - 1 is the expectation for the distribution of the parameter at the next time t, and the correlation of the samples at adjacent times is controlled through adjusting the variance of the conditional distribution. Unfortunately, the nonconjugate form of the conditional distribution requires approximations in the model inference. Recently Dunson (Dunson, 2006) proposed a Bayesian dynamic model to learn the latent trait distribution through a mixture of DPs, in which the latent variable density changes dynamically in location and shape across levels of predictors. This dynamic structure is considered in this paper to extend HDP to incorpo- 1. Introduction The Dirichlet process (DP) mixture model (Escobar & West, 1995) has been widely used to perform density estimation and clustering, by generalizing finite mixture models to (in principle) infinite mixtures. In order to "share statistical strength" across different groups of data, the hierarchical Dirichlet process (HDP) (Teh et al., 2005) has been proposed to model the dependence among groups through sharing the same set of discrete parameters ("atoms"), and the mixture weights associated with different atoms are varied as a function of the data group. In the HDP, it is assumed that the data groups are exchangeable. However, in many real applications, such as seasonal market analysis and gene investigation for disease, data are meaAppearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). The Dynamic Hierarchical Dirichlet Pro cess rate time dependence, and has the following features: (i) two data samples drawn at proximate times have a higher probability of sharing the same underlying model parameters (atoms) than parameters drawn at disparate times; and (ii) there is a possibility that temporally distant data samples may also share model parameters, thereby accounting for possible distant repetition in the data. of the discrete form of G0 (all Gj are composed of the same set of atoms {k } 1 ). The clusters in each k= group j , assumed by the set {j,i }i=1,...,Nj , are inferred via the posterior density function on the parameters, with the likelihood function selecting the set of discrete parameters {k } 1 most consistent with the data k= {xj,i }i=1,...,Nj . Meanwhile, clusters (and, hence, asso ciated cluster parameters {k } 1 ) are shared across k= multiple data sets, as appropriate. Although the HDP introduces a dependency between the J groups, the data sets are assumed exchangeable. However, in many applications, the data may be collected sequentially, and one may have a prior belief that sharing of data is more probable when the data sets are collected at similar points in time. The purpose of this paper is to extend the HDP to account for such temporal information. Before proceeding, it will prove useful to consider an alternative form of the HDP model, as derived in (Teh et al., 2005). Specifically, each draw Gj may be expressed as: Gj = ind 2. Dynamic HDP 2.1. Background A Dirichlet process is a measure on a measure G and is parameterized as G DP (0 , G0 ), in which G0 is a base measure and 0 is a positive "precision" parameter. To provide an explicit form for a G drawn from DP (0 , G0 ), Sethuraman (Sethuraman, 1994) developed a stick-breaking construction: G= k =1 k k , k = k ~ k-1 i =1 (1 - i ) ~ (1) where {k } 1 represent a set of atoms drawn i.i.d. k= from G0 and {k } 1 represent a set of weights, with k = ~ the constraint k=1 k = 1; each k is drawn i.i.d. from B e(1, 0 ). According to the construction in (1), a draw G from a DP (0 , G0 ) is discrete with probability one. Based on this important property, Teh (Teh et al., 2005) proposed a hierarchical Dirichlet process (HDP) to link the group-specific Dirichlet processes, learning the models jointly across multiple data sets. k =1 j,k k j DP (0j , ) S tick ( ) k H iid (3) Assume we have J groups of data and the j data set (group) is denoted as {xj,i }i=1,...,Nj . For each of these data sets, xj,i is drawn from the model xj,i iid ind th where S tick ( ) stochastically generates an infinite set of sticks {1 , 2 , . . .}, based on a stick-breaking process of the form in (1 here with parameter , satisfying ), the constraint i=1 i = 1. 2.2. Bayesian Dynamic Structure Similar to HDP, we again consider J data sets but now using an explicit assumption that the data sets are collected sequentially, with {x1,i }i=1,...,N1 collected first, {x2,i }i=1,...,N2 collected second, and with {xJ,i }i=1,...,NJ collected last. Since our assumption is that a time evolution exists between adjacent data groups, the distribution Gj -1 , from which {j -1,i }i=1,...,Nj -1 are drawn, is likely related to Gj , from which {j,i }i=1,...,Nj are drawn. To specify explicitly the dependence between Gj -1 and Gj , Dunson (Dunson, 2006) proposed a Bayesian dynamic mixture DP (DMDP), in which Gj shares features with Gj -1 but some innovation may also occur. The DMDP has the drawback that mixture components can only be added over time, so that one ends up with more components at later times as an artifact of the model. F (j,i ) with parameters j,i Gj , and the parame ters {j,i }i=1,...,Nj are likely to assume the atoms k for which the associated sticks j,k are large, as a consequence of the form of Gj given by (1); for the J data sets, different group-specific Gj are drawn from DP (j 0 , G0 ), in which G0 is drawn from another DP. The generative model for HDP is represented as: xj,i F (j,i ) j,i Gj Gj DP (j 0 , G0 ) G0 DP ( , H ) where j = 1, . . . , J and i = 1, . . . , Nj . Under this hierarchical structure, not only can different observations xj,i and xj,i in the same group share the same parameters based on the stick weights represented by Gj , but also the observations across different groups might share parameters as a consequence ind iid ind (2) The Dynamic Hierarchical Dirichlet Pro cess In the dHDP, we have Gj = (1 - wj -1 )Gj -1 + wj -1 Hj -1 ~ ~ (4) To further develop the dynamic relationship from G1 to GJ , we extend the mixture structure in (4) from group to group: Gj = (1 - wj -1 )Gj -1 + wj -1 Hj -1 ~ ~ = j -1 l =1 j -1 l =1 where G1 DP (01 , G0 ), Hj -1 is called an innovation distribution drawn from DP (0j , G0 ), and wj -1 B e(aw(j -1) , bw(j -1) ). In this way, Gj is modi~ fied from Gj -1 by introducing a new innovation distribution Hj -1 , and the random variable wj -1 controls ~ the probability of innovation (i.e., it defines the mixture weights). As a result, the relevant atoms adjust with time, and it is probable that proximate data will share the same atoms, but with the potential for transient innovation. Additionally, we assume that G0 DP ( , H ) as in the HDP to enforce that G0 is discrete, which manifests another important aspect of the dynamic HDP: the same atoms are used for al l Gj , but with different time-evolving weights. Consequently, the model encourages sharing between temporally proximate data, but it is also possible to share between data sets widely separated in time. Providing now more model details, the discrete base distribution drawn from DP ( , H ) may be expressed as: k (5) G0 = k k =1 (1 - wl )G1 + ~ { mj -1 =l+1 (1 - wm )}wl Hl (9) ~ ~ = wj 1 G1 + wj 2 H1 + . . . + wj j Hj -1 j -1 where wj l = wl-1 m=l (1 - wm ), for l = 1, 2, . . . , j , ~ ~ j with w0 = 1. It can be easily verified that l=1 wj l = ~ 1 for each wj , which is the prior probability that the data in group j will be drawn from the mixture distribution: G1 , H1 , . . . , Hj -1 . If all wj = 0, all of the ~ groups share the same mixture distribution G1 and the model reduces to a Dirichlet mixture model, and if all wj = 1 the model reduces to the HDP. Therefore, ~ the dynamic HDP is more general than both DP and HDP, with each a special case. A visual representation of the model is depicted in Figure 1. H aw 1 0 G0 ~ w HJ ~ 1 wJ 1 bw H1 1 G ~ 1 w1 G2 ~ w 1 ~ wJ 1 GJ where are the global parameter components (atoms), drawn independently from the base distribution H and {k }k=1,2,..., are drawn from a stick-breaking process S tick ( ), defined as: l ~ ~ ~ iid k = k (1 - l ) k B e(1, ) (6)