An HDP-HMM for Systems with State Persistence Emily B. Fox Department of EECS, Massachusetts Institute of Technology, Cambridge, MA 02139 Erik B. Sudderth Department of EECS, University of California, Berkeley, CA 94720 ebfox@mit.edu sudderth@eecs.berkeley.edu Michael I. Jordan jordan@eecs.berkeley.edu Department of EECS and Department of Statistics, University of California, Berkeley, CA 94720 Alan S. Willsky Department of EECS, Massachusetts Institute of Technology, Cambridge, MA 02139 willsky@mit.edu Abstract The hierarchical Dirichlet process hidden Markov model (HDP-HMM) is a flexible, nonparametric model which allows state spaces of unknown size to be learned from data. We demonstrate some limitations of the original HDP-HMM formulation (Teh et al., 2006), and propose a sticky extension which allows more robust learning of smoothly varying dynamics. Using DP mixtures, this formulation also allows learning of more complex, multimodal emission distributions. We further develop a sampling algorithm that employs a truncated approximation of the DP to jointly resample the full state sequence, greatly improving mixing rates. Via extensive experiments with synthetic data and the NIST speaker diarization database, we demonstrate the advantages of our sticky extension, and the utility of the HDP-HMM in real-world applications. Recently, Teh et al. (2006) presented a nonparametric Bayesian approach to HMMs in which a stochastic process, the hierarchical Dirichlet process (HDP), defines a prior distribution on transition matrices over countably infinite state spaces. The resulting HDPHMM leads to data­driven learning algorithms which infer posterior distributions over the number of states. This posterior uncertainty can be integrated out when making predictions, effectively averaging over models of varying complexity. The HDP-HMM has shown promise in a variety of applications, including visual scene recognition (Kivinen et al., 2007) and the modeling of genetic recombination (Xing & Sohn, 2007). One serious limitation of the standard HDP-HMM is that it inadequately models the temporal persistence of states. This problem arises in classical finite HMMs as well, where semi-Markovian models are often proposed as solutions. However, the problem is exacerbated in the nonparametric setting, where the Bayesian bias towards simpler models is insufficient to prevent the HDP-HMM from learning models with unrealistically rapid dynamics, as demonstrated in Fig. 1. To illustrate the seriousness of this issue, let us consider a challenging application that we revisit in Sec. 5. The problem of speaker diarization involves segmenting an audio recording into time intervals associated with individual speakers. This application seems like a natural fit for the HDP-HMM, as the number of true speakers is typically unknown, and may grow as more data is observed. However, this is not a setting in which model averaging is the goal; rather, it is critical to infer the number of speakers as well as the transitions among speakers. As we show in Sec. 5, the HDPHMM's tendency to rapidly switch among redundant states leads to poor speaker diarization performance. In contrast, the methods that we develop in this paper yield a state-of-the-art speaker diarization method, as 1. Introduction Hidden Markov models (HMMs) have been a ma jor success story in many applied fields; they provide core statistical inference procedures in areas as diverse as speech recognition, genomics, structural biology, machine translation, cryptanalysis and finance. Even after four decades of work on HMMs, however, significant problems remain. One lingering issue is the choice of the hidden state space's cardinality. While standard parametric model selection methods can be adapted to the HMM, there is little understanding of the strengths and weaknesses of such methods in this setting. App earing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). An HDP-HMM for Systems with State Persistence 80 14 14 14 60 12 12 12 Estimated Mode Sequence True Mode Sequence 40 10 10 Estimated Mode Sequence 50 100 150 200 250 300 350 400 Observation Sequence 10 20 8 8 8 0 6 6 6 -20 4 4 4 -40 -60 2 2 2 -80 0 50 100 150 200 250 300 350 400 0 0 50 100 150 200 250 300 350 400 0 0 0 0 50 100 150 200 250 300 350 400 Time Time Time Time (a) (b ) (c) (d ) Figure 1. Sensitivity of the HDP-HMM to within-state variations in the observations. (a) Observation sequence; (b) true state sequence; estimated state sequence after 100 Gibbs iterations for the (c) original and (d) sticky HDP-HMM, with errors indicated in red. Without an extra self­transition bias, the HDP-HMM rapidly transitions among redundant states. well as a general solution to the problem of state persistence in HDP-HMMs. The approach is easily stated-- we simply augment the HDP-HMM to include a parameter for self-transition bias, and place a separate prior on this parameter. The challenge is to consistently execute this idea in a nonparametric Bayesian framework. Earlier papers have also proposed selftransition parameters for HMMs with infinite state spaces (Beal et al., 2002; Xing & Sohn, 2007), but did not formulate general solutions that integrate fully with nonparametric Bayesian inference. While the HDP-HMM treats the state transition distribution nonparametrically, it is also desirable to allow more flexible, nonparametric emission distributions. In classical applications of HMMs, finite Gaussian mixtures are often used to model multimodal observations. Dirichlet process (DP) mixtures provide an appealing alternative which avoids fixing the number of observation modes. Such emission distributions are not identifiable for the standard HDP-HMM, due to the tendency to rapidly switch between redundant states. With an additional self-transition bias, however, we show that a fully nonparametric HMM leads to effective learning algorithms. In particular, we develop a blocked Gibbs sampler which leverages forward­backward recursions to jointly resample the state and emission assignments for all observations. In Sec. 2, we begin by presenting background material on the HDP. Sec. 3 then links these nonparametric methods with HMMs, and extends them to account for state persistence. We further augment the model with multimodal emission distributions in Sec. 4, and present results using synthetic data and the NIST speaker diarization database in Sec. 5. on a parameter space . The weights are sampled via a stick-breaking construction (Sethuraman, 1994): k -1 (2) k = k (1 - ) k Beta(1, ) =1 We denote this distribution by GEM( ). The DP is commonly used as a prior on the parameters of a mixture model of unknown complexity, resulting in a DPMM (see Fig. 2(a)). To generate observations, ¯ ¯ we choose i G0 and yi F (i ). This sampling process is often described via a discrete variable zi indicating which component generates yi F (zi ). The hierarchical Dirichlet process (HDP) (Teh et al., 2006) extends the DP to cases in which groups of data are produced by related, yet unique, generative processes. Taking a hierarchical Bayesian approach, the HDP places a global Dirichlet process prior DP(, G0 ) on , and then draws group specific distributions Gj DP(, G0 ). Here, the base measure G0 acts as an "average" distribution (E [Gj ] = G0 ) encoding the frequency of each shared, global parameter: t ~ Gj () = j t ( - j t ) ~ j GEM() ~ (3) =1 = ~ Because G0 is discrete, multiple j t G0 may take identical values k . Eq. (4) aggregates these probabilities, allowing an observation yj i to be directly associated with the unique global parameters via an indicator random variable zj i j . See Fig. 2(b). We can alternatively represent this generative process via indicator variables tj i j and kj t , as in ~ Fig. 2(c). The stick-breaking priors on these mixture weights can be analytically marginalized, yielding simple forms for the predictive distributions of assignments. The resulting distribution on partitions is sometimes described using the metaphor of a Chinese restaurant franchise (CRF). There are J restaurants (groups), each with infinitely many tables (clusters) at k j k ( - k ) j DP(, ) (4) =1 2. Background: Dirichlet Processes A Dirichlet process (DP), denoted by DP( , H ), is a distribution over countably infinite random measures G0 () = k k ( - k ) k H (1) =1 An HDP-HMM for Systems with State Persistence (a) (b ) (c) Figure 2. (a) DPMM in which GEM( ), k H (), zi , and yi f (y | zi ). (b) HDP mixture model with GEM( ), j DP(, ), k H (), zj i j , and yj i f (y | zji ). (c) CRF with loyal customers. Customers yj i sit at table tj i j which considers dish ~ ¯ kj t , but override variables wj t Ber(/ + ) can force the served dish kj t to b e j . The original CRF, as ¯ describ ed in Sec. 2, has = 0 so that kj t = kj t . Figure 3. Graph of the sticky HDP-HMM. The state evolves as zt+1 zt , where k DP( + , ( + k )/( + )) and GEM( ), and observations are generated as yt F (zt ). The original HDP-HMM has = 0. which customers (observations) sit. Upon entering the j th restaurant, customer yj i sits at currently occupied tables tj i with probability proportional to the number ~ of currently seated customers, or starts a new table t with probability proportional to . Each table chooses ~ a dish (parameter) j t = kjt with probability proportional to the number of other tables in the franchise that ordered that dish, or orders a new dish k with ~ probability proportional to . Observation yj i is then ~ generated by global parameter zji = j tji = kjtji . An alternative, non­constructive characterization of samples G0 DP( , H ) from a Dirichlet process states that for every finite partition {A1 , . . . , AK } of , (G0 (A1 ), . . . , G0 (AK )) Dir( H (A1 ), . . . , H (AK )). (5) Using this expression, it can be shown that the following finite, hierarchical mixture model converges in distribution to the HDP as L (Ishwaran & Zarepour, 2002; Teh et al., 2006): Dir( /L, . . . , /L) j Dir(1 , . . . , L ). (6) By sampling j DP(, ), the HDP prior encourages states to have similar transition distributions (E [j k ] = k ). However, it does not differentiate self­ transitions from moves between states. When modeling systems with state persistence, the flexible nature of the HDP-HMM prior allows for state sequences with unrealistically fast dynamics to have large posterior probability. For example, with Gaussian emissions, as in Fig. 1, a good explanation of the data is to divide an observation block into two small­variance states with slightly different means, and then rapidly switch between them (see Fig. 1). In such cases, many models with redundant states may have large posterior probability, thus impeding our ability to identify a single dynamical model which best explains the observations. The problem is compounded by the fact that once this alternating pattern has been instantiated by the sampler, its persistence is then reinforced by the properties of the Chinese restaurant franchise, thus slowing mixing rates. Furthermore, when observations are high-dimensional, this fragmentation of data into redundant states may reduce predictive performance. In many applications, one would thus like to be able to incorporate prior knowledge that slow, smoothly varying dynamics are more likely. To address these issues, we propose to instead sample transition distributions j as follows: . + j (7) j DP + , + Here, ( + j ) indicates that an amount > 0 is added to the j th component of . The measure of j over a finite partition (Z1 , . . . , ZK ) of the positive integers Z+ , as described by Eq. (5), adds an amount only to the arbitrarily small partition containing j , corresponding to a self-transition. When = 0 the original HDP-HMM is recovered. Because positive values increase the prior probability E [j j ] of self­transitions, we refer to this extension as the sticky HDP-HMM. In some ways, this parameter is reminiscent of the infinite HMM's self-transition bias (Beal et al., 2002). Later sections use this weak limit approximation to develop efficient, blocked sampling algorithms. 3. The Sticky HDP-HMM The HDP can be used to develop an HMM with an unknown, potentially infinite state space (Teh et al., 2006). For this HDP-HMM, each HDP group-specific distribution, j , is a state-specific transition distribution and, due to the infinite state space, there are infinitely many groups. Let zt denote the state of the Markov chain at time t. For Markov chains zt zt-1 , so that zt-1 indexes the group to which yt is assigned. The current HMM state zt then indexes the parameter zt used to generate observation yt (see Fig. 3). An HDP-HMM for Systems with State Persistence However, that paper relied on a heuristic, approximate Gibbs sampler. The full connection between the infinite HMM and an underlying nonparametric Bayesian prior, as well as the development of a globally consistent inference algorithm, was made in Teh et al. (2006), but without a treatment of a self-transition parameter. 3.1. A CRF with Loyal Customers We further abuse the Chinese restaurant metaphor by extending it to the sticky HDP-HMM, where our franchise now has restaurants with loyal customers. Each restaurant has a specialty dish with the same index as that of the restaurant. Although this dish is served elsewhere, it is more popular in the dish's namesake restaurant. We see this increased popularity from the fact that a table's dish is now drawn as + j . (8) kj t + We will refer to zt as the parent and zt+1 as the child. The parent enters a restaurant j determined by its parent (the grandparent), zt-1 = j . We assume there is a bijective mapping of indices f : t j i. The parent then chooses a table tj i j and that table is served ~ a dish indexed by kj tji . Noting that zt = zj i = kj tji , the increased popularity of the house specialty dish implies that children are more likely to eat in the same restaurant as their parent and, in turn, more likely to eat the restaurant's specialty dish. This develops family loyalty to a given restaurant in the franchise. However, if the parent chooses a dish other than the house specialty, the child will then go to the restaurant where this dish is the specialty and will in turn be more likely to eat this dish, too. One might say that for the sticky HDP-HMM, children have similar tastebuds to their parents and will always go the restaurant that prepares their parent's dish best. Often, this keeps many generations eating in the same restaurant. The inference algorithm is simplified if we introduce a ¯ set of auxiliary random variables kj t and wj t as follows: ¯ kj t , wj t Ber + , kj t = ¯ kj t , j, wj t = 0; wj t = 1, (9) 3.2. Sampling via Direct Assignments In this section we describe a modified version of the direct assignment Rao-Blackwellized Gibbs sampler of Teh et al. (2006) which circumvents the complicated bookkeeping of the CRF by sampling indicator random variables directly. Throughout this section, we refer to the variables in the graph of Fig. 3. For this sampler, a set of auxiliary variables mj k , mj k , and wj t must be ¯ added (as illustrated in Fig. 2(c)). Sampling zt The posterior distribution factors as: p(zt = k | z\t , y1:T , , , , ) p(zt = k | z\t , , , )p(yt | y\t , zt = k , z\t , ). (10) The properties of the Dirichlet process dictate that on ~ the finite partition {1, . . . , K, k } we have the following form for the group-specific transition distributions: j Dir(1 , . . . , j + , . . . , K , k ). ~ (11) We use the above definition of j and the Dirichlet distribution's conjugacy to the multinomial observations zt to marginalize j and derive the following conditional distribution over the states assignments: p(zt = k | z\t , , , ) (k + n-t 1 k + (zt-1 , k )) zt- . -t zt+1 + nkzt+1 + (k , zt+1 ) + (zt-1 , k ) (k , zt+1 ) + n-.t + + (zt-1 , k ) k (12) This formula is more complex than that of the standard HDP sampler due to potential dependencies in the marginalization of zt-1 and zt . For a detailed derivation, see Fox et al. (2007). The notation nj k represents the number of Markov chain transitions from k nj k , and n-t the number of state j to k , nj. = jk transitions from state j to k not counting the transition zt-1 to zt or zt to zt+1 . Intuitively, this expression chooses a state k with probability depending on how many times we have seen other zt-1 to k and k to zt+1 transitions. Note that there is a dependency on whether either or both of these transitions correspond to a self-transition, which is strongest when > 0. As in Teh et al. (2006), by placing a conjugate prior on the parameter space, there is a closed analytic form for the likelihood component p(yt | y\t , zt = k , z\t , ). ¯ Sampling Assume there are currently K unique dishes being considered and take a finite partition ¯ K {1 , 2 , . . . , K , k } of , where k = \ k=1 {k }. ¯ ~ ~ ~ ¯ Since j t G0 and m.k tables are considering dish k , the properties of the Dirichlet distribution dictate: ¯ p((1 , . . . , K , ~ ) | k, ) Dir(m.1 , . . . , m.K , ). (13) ¯ ¯¯ ¯ k where Ber(p) represents the Bernoulli distribution. ¯ The table first chooses a dish kj t without taking the restaurant's specialty into consideration (i.e., the original CRF.) With some probability, this considered dish is overridden (perhaps by a waiter's suggestion) and the table is served the specialty dish j . Thus, kj t represents the served dish. We refer to wj t as the override variable. For the original HDP-HMM, when = 0, the considered dish is always the served dish since wj t = 0 for all tables. See Fig. 2(c). From the above, we see that {m.k }K 1 is a set of suf¯ k= ficient statistics for resampling on this partition. ¯ An HDP-HMM for Systems with State Persistence However, this requires sampling two additional variables, mj k and wj t , corresponding to the number of tables in restaurant j served dish k and the corresponding overwrite variables. We jointly sample from p(m, w, m | z1:T , , , ) = p(m | m, w, z1:T , , , ) ¯ ¯ p(w | m, z1:T , , , )p(m | z1:T , , , ). (14) We start by examining p(m | z1:T , , , ). Having the state index assignments z1:T effectively partitions the data (customers) into both restaurants and dishes, though the table assignments are unknown since multiple tables can be served the same dish. Thus, sampling mj k is in effect equivalent to sampling table assignments for each customer after knowing the dish assignment. This conditional distribution is given by: p(tj i = t | kj t = k , t-j i , k-j t , y1:T , , , ) ~ n -j i , t {1, . . . , Tj }; jt (15) ~ k + (k , j ), t = tj , where n-j i is the number of customers at table t in ~ jt restaurant j , not counting yj i . The form of Eq. (15) implies that a customer's table assignment conditioned on a dish assignment k follows a DP with concentration parameter k + (k , j ) and may be sampled by simulating the associated Chinese restaurant process. We now derive the conditional distribution for the override variables wj t . The table counts provide that mj k tables are serving dish k in restaurant j . If k = j , we automatically have mj k tables with wj t = 0 since the served dish is not the house specialty. Otherwise, j (1 - ), wj t = 0; p(wj t | kj t = j, , ) (16) , wj t = 1, 3.3. Blo cked Sampling of State Sequences The HDP-HMM direct assignment sampler can exhibit slow mixing rates since global state sequence changes are forced to occur coordinate by coordinate. This is explored in Scott (2002) for the finite HMM. Although the sticky HDP-HMM reduces the posterior uncertainty caused by fast state-switching explanations of the data, the self-transition bias can cause two continuous and temporally separated sets of observations of a given state to be grouped into two states. If this occurs, the high probability of self-transition makes it challenging for the sequential sampler to group those two examples into a single state. A variant of the HMM forward-backward procedure (Rabiner, 1989) allows us to harness the Markov structure and jointly sample the state sequence z1:T given the observations y1:T , transitions probabilities j , and model parameters k . To take advantage of this procedure, we now must sample the previously marginalized transition distributions and model parameters. In practice, this requires approximating the theoretically countably infinite transition distributions. One approach is the degree L weak limit approximation to the DP (Ishwaran & Zarepour, 2002), GEML () Dir(/L, . . . , /L), (18) where L is a number that exceeds the total number of expected HMM states. This approximation encourages the learning of models with fewer than L components while allowing the generation of new components, upper bounded by L, as new data are observed. The posterior distributions of and j are given by: Dir( /L + m.1 , . . . , /L + m.L ) ¯ ¯ (19) j Dir(1 + nj 1 , . . . , j + + nj j , . . . , L + nj L ). Depending on the form of the emission distribution and base measure on the parameter space , we sample parameters for each of the currently instantiated states from the updated posterior distribution: j p( | {yt | zt = j }, ). (20) Now that we are sampling j directly, we can use a non-conjugate base measure. We block sample z1:T by first computing backward messages mt,t-1 (zt-1 ) p(yt:T |zt-1 , , ) and then recursively sampling each zt conditioned on zt-1 from p(zt | zt-1 , y1:T , , ) p(zt | zt-1 )p(yt | zt )mt+1,t (zt ). (21) A similar sampler has been used for learning HDP hidden Markov trees (Kivinen et al., 2007). However, this work did not consider the complications introduced by multimodal emissions, as we explore next. where = + is the prior probability that wj t = 1. Observing served dish kj t = j makes it more likely that ¯ the considered dish kj t was overridden than the prior suggests. We draw mj j samples of wj t from Eq. (16). Given mj k for all j and k and wj t for each of these instantiated tables, we can now deterministically compute mj k . Any table that was overridden is an unin¯ formative observation for the posterior of mj k so that ¯ m j = k; jk , mj k = ¯ (17) mj j - wj. , j = k . Sampling Hyp erparameters Rather than fixing the sticky HDP-HMM's hyperparameters, we place vague gamma priors on and ( + ), and a beta prior on /( + ). As detailed in Fox et al. (2007), the auxiliary variables introduced in the preceding section then allow tractable resampling of these hyperparameters. This allows the number of occupied states, and the degree of self­transition bias, to be strongly influenced by the statistics of observed data, as desired. An HDP-HMM for Systems with State Persistence nent of the augmented state, the backward message mt,t-1 (zt-1 ) from (zt , st ) to (zt-1 , st-1 ) is solely a function of zt-1 . These messages are given by: zs p(zt | zt-1 )p(st | zt ) mt,t-1 (zt-1 ) t t p(yt | zt ,st )mt+1,t (zt ). (25) 5. Results Figure 4. Sticky HDP-HMM with DP emissions, where st indexes the state-sp ecific mixture comp onent generating observation yt . The DP prior dictates that st zt for k GEM( ). The j th Gaussian comp onent of the kth mixture density is parameterized by k,j so yt F (zt ,st ). Synthetic Data We generated test data from a three-state Gaussian emission HMM with: 0.97 probability of self-transition; means 50, 0, and -50; and variances 50, 10, and 50 (see Fig. 1(a).) For the blocked sampler, we used a truncation level of L = 15. Fig. 5 shows the clear advantage of considering a sticky HDP-HMM with blocked sampling. The Hamming distance error is calculated by greedily mapping the indices of the estimated state sequence to those maximizing overlap with the true sequence. The apparent slow convergence of the sticky HDP-HMM direct assignment sampler (Fig. 5(b)) can be attributed to the sampler splitting temporally separated segments of a true state into multiple, redundant states. Although not depicted due to space constraints, both sticky HDP-HMM samplers result in estimated models with significantly larger likelihoods of the true state sequence than those of the original HDP-HMM. To test the model of Sec. 4, we generated data from a two-state HMM, where each state had a two-Gaussian mixture emission distribution with equally weighted components defined by means (0, 10) and (-7, 7), and variances of 10. The probability of self-transition was set to 0.98. The resulting observation and true state sequences are shown in Fig. 6(a) and (b). Fig. 6(e)-(h) compares the performance of the sticky and original HDP-HMM with single and infinite Gaussian mixture emissions. All results are for the blocked sampler with truncation levels L = L = 15. Intuitively, when constrained to single Gaussian emissions, the best explanation of the data is to associate each true mixture component with a separate state and then quickly switch between these states, resulting in the large Hamming distances of Fig. 6(g)-(h). Although not the desired effect in this scenario, this behavior, as depicted in Fig. 6(c), demonstrates the flexibility of the sticky HDP-HMM: if the best explanation of the data according to the model is fast state-switching, the sticky HDP-HMM still allows for this by learning a small bias towards self-transitions. The sticky HDP-HMM occasionally has more accurate state sequence estimates by grouping a true state's Gaussian mixture components into a single Gaussian with large variance. By far the best performance is 4. Multimodal Emission Distributions For many application domains, the data associated with each hidden state may have a complex, multimodal distribution. We propose to approximate such emission distributions nonparametrically, using an infinite DP mixture of Gaussians. This formulation is related to the nested DP (Rodriguez et al., 2006). The bias towards self-transitions allow us to distinguish between the underlying HDP-HMM states. If the model were free to both rapidly switch between HDP-HMM states and associate multiple Gaussians per state, there would be considerable posterior uncertainty. Thus, it is only with the sticky HDP-HMM that we can effectively learn such models. We augment the HDP-HMM state zt with a term st t indexing the mixture component of the zt h emission density. For each HDP-HMM state, there is a unique stick-breaking distribution k GEM( ) defining the mixture weights of the k th emission density so that st zt . The observation yt is generated by the Gaussian component with parameter zt ,st . See Fig. 4. To implement blocked resampling of (z1:T , s1:T ), we use weak limit approximations to both the HDP-HMM and Dirichlet process emissions, approximated to levels L and L , respectively. The posterior distributions of and k remain unchanged; that of k is given by: k Dir( /L + n 1 , . . . , /L + n L ), k k n l k (22) are the number of observations assigned to where the lth mixture component of the k th HMM state. The posterior distribution for each Gaussian's mean and covariance, k,j , is determined by the observations assigned to this component, namely, k,j p( | {yt | (zt = k , st = j )}, ). The augmented state (zt , st ) is sampled from p(zt , st | zt-1 , y1:T , , , ) p(zt | zt-1 )p(st | zt )p(yt | zt ,st )mt+1,t (zt ). (24) Since the Markov structure is only on the zt compo(23) An HDP-HMM for Systems with State Persistence 0.6 0.6 0.6 0.6 Normalized Hamming Distance Normalized Hamming Distance Normalized Hamming Distance 0.5 0.5 0.5 Normalized Hamming Distance 20 40 60 80 100 120 140 160 180 200 0.5 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0 0 20 40 60 80 100 120 140 160 180 200 0 0 20 40 60 80 100 120 140 160 180 200 0 0 0 0 20 40 60 80 100 120 140 160 180 200 Iteration Iteration Iteration Iteration (a) (b ) (c) (d ) Figure 5. Hamming distance b etween true and estimated state sequences over 100 iterations for the sticky HDP-HMM (a) blocked and (b) direct assignment samplers and the original HDP-HMM (c) blocked and (d) direct assignment samplers. These plots show the median (solid blue) and 10th and 90th quantiles (dashed red) from 200 initializations. 25 20 8 7 Estimated Mode Sequence 15 10 5 0 -5 -10 -15 -20 0 True Mode Sequence 2 6 5 4 3 2 1 0 -1 0 Estimated Mode Sequence 2 Observations 1 1 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000 Time Time Time Time (a) 0.8 0.8 (b ) 0.8 (c) 0.8 (d ) Normalized Hamming Distance 0.7 0.6 Normalized Hamming Distance Normalized Hamming Distance Normalized Hamming Distance 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0 0 10 20 30 40 50 60 70 80 90 100 0 0 10 20 30 40 50 60 70 80 90 100 0 0 10 20 30 40 50 60 70 80 90 100 0 0 10 20 30 40 50 60 70 80 90 100 Iteration Iteration Iteration Iteration (e) (f ) (g) (h ) Figure 6. Performance of inference on data generated by an HMM with Gaussian mixture emissions. (a) Observation sequence; (b) true HMM state sequence; estimated HMM state sequence using the sticky HDP-HMM model with (c) single and (d) infinite Gaussian mixture emissions. Errors are indicated by red markers. The b ottom row contains Hamming distance plots, as in Fig. 5, for infinite Gaussian mixture emissions and the (e) sticky HDP-HMM and (f ) original HDP-HMM, and single Gaussian emissions for the (g) sticky HDP-HMM and (h) original HDP-HMM. achieved by the sticky HDP-HMM with infinite Gaussian mixture emissions (see Fig. 6(e) and (d)); comparing to Fig. 6(f ), we see that the gain can be attributed to modeling rather than just improved mixing rates. Sp eaker Diarization Data The speaker diarization task involves segmenting an audio recording into speaker-homogeneous regions, while simultaneously identifying the number of speakers. We tested the utility of the sticky HDP-HMM for this task on the data distributed by NIST as part of the Rich Transcription 2004-2007 meeting recognition evaluations (NIST, 2007). We use the first 19 Mel Frequency Cepstral Coefficients (MFCCs), computed over a 30ms window every 10ms, as our feature vector. When working with this dataset, we discovered that: (1) the high frequency content of these features contained little discriminative information, and (2) without a minimum speaker duration, the sticky HDP-HMM learned within speaker dynamics in addition to global speaker changes. To jointly address these issues, we instead model feature averages computed over 250ms, non­ overlapping blocks. A minimum speaker duration of 500ms is set by associating two average features with each hidden state. We also tie the covariances of within­state mixture components. We found single­ Gaussian emission distributions to be less effective. For each of 21 meetings, we compare 10 initializations of the original and sticky HDP-HMM blocked samplers. In Fig. 8(a), we report the official NIST diarization error rate (DER) of the run with the largest observation sequence likelihood, given parameters estimated at the 1000th Gibbs iteration. The sticky HDPHMM's temporal smoothing provides substantial performance gains. Fig 8(b) plots the estimated versus true number of speakers who talk for more than 10% of the meeting time, and shows our model's ability to adapt to a varying number of speakers. As a further comparison, the ICSI team's algorithm (Wooters & Huijbregts, 2007), by far the best performer at the 2007 competition, has an overall DER of 18.37%, simi- An HDP-HMM for Systems with State Persistence 8 4 4 7 7 Estimated speaker label 6 Estimated speaker label 1 2 3 4 5 x 10 6 4 6 True speaker label 3 3 True speaker label 5 5 4 4 2 2 3 3 2 2 1 1 1 1 0 1 2 3 4 5 x 10 6 4 0 1 2 3 4 5 x 10 6 4 0 0 0 1 2 3 4 5 x 10 6 4 Time Time Time Time (a) (b ) (c) (d ) Figure 7. True state sequences for meetings (a) AMI 20041210-1052 and (c) VT 20050304-1300, with the corresp onding most likely state estimates shown in (b) and (d), resp ectively, with incorrect lab els shown in red. 50 5 45 40 35 30 25 20 15 10 5 0 0 4 3 2 1 10 20 30 40 50 1 2 3 4 5 Sticky HDP-HMM DER (%) Estimated number of speakers (a) (b ) captures state persistence. We have also shown that this sticky HDP-HMM allows a fully nonparametric treatment of multimodal emissions, disambiguated by its bias towards self-transitions, and presented efficient sampling techniques with mixing rates that improve on the state-of-the-art. Results on synthetic data, and a challenging speaker diarization task, clearly demonstrate the practical importance of our extensions. True number of speakers HDP-HMM DER (%) Figure 8. For the 21 meeting database: (a) plot of sticky vs. original HDP-HMM most likely sequence DER; and (b) plot of true vs. estimated numb er of sp eakers for samples drawn from 10 random initializations of each meeting (larger circles have higher likelihood). Acknowledgments We thank O. Vinyals, G. Friedland, and N. Morgan for helpful discussions ab out the NIST dataset. This research was supp orted in part by DARPA contract NBCHD030010, and MURIs funded through ARO Grant W911NF-06-10076 and AFOSR Grant FA9550-06-1-0324. E.B.F. was partially funded by an NDSEG fellowship. lar to our 19.04%. Our best and worst DER are 1.26% and 31.42%, respectively, compared to their 4.39% and 32.23%. We use the same non-speech pre-processing, so that the differences are due to changes in the identified speakers. As depicted in Fig. 7, a significant proportion of our errors can be attributed to splitting or merging speakers. The ICSI team's algorithm uses agglomerative clustering, and requires significant tuning of parameters on representative training data. In contrast, our hyperparameters are automatically set meeting-by-meeting, so that each component's expected mean and covariance are that of the entire feature sequence. Note that the selected runs plotted in Fig. 8 are not necessarily those with the smallest DER. For example, the run depicted in Fig. 7(d) had 24.06% DER, while another run on the same meeting had 4.37% (versus ICSI's 22.00%.) There is inherent posterior uncertainty in this task, and our sampler has the advantage of giving several interpretations. When considering the best per-meeting DER for the five most likely samples, our overall DER drops to 15.14%; we hope to explore automated ways of combining multiple samples in future work. Regardless, our results demonstrate that the sticky HDP-HMM provides an elegant and empirically effective speaker diarization method. References Beal, M. J., Ghahramani, Z., & Rasmussen, C. E. (2002). The infinite hidden Markov model. NIPS (pp. 577­584). Fox, E., Sudderth, E., Jordan, M., & Willsky, A. (2007). A temp ered HDP-HMM for systems with state p ersistence. MIT LIDS, TR #2777. Ishwaran, H., & Zarep our, M. (2002). Exact and approximate sum­representations for the Dirichlet process. Can. J. Stat., 30, 269­283. Kivinen, J. J., Sudderth, E. B., & Jordan, M. I. (2007). Learning multiscale representations of natural scenes using Dirichlet processes. ICCV (pp. 1­8). NIST (2007). Rich transcriptions database. http://www.nist.gov/speech/tests/rt/. Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in sp eech recognition. Proc. IEEE, 77, 257­286. Rodriguez, A., Dunson, D., & Gelfand, A. (2006). The nested Dirichlet process. Duke ISDS, TR #06-19. Scott, S. (2002). Bayesian methods for hidden Markov models: Recursive computing in the 21st century. J. Amer. Stat. Assoc., 97, 337­351. Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Stat. Sinica, 4, 639­650. Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet processes. J. Amer. Stat. Assoc., 101, 1566­1581. Wooters, C., & Huijbregts, M. (2007). The ICSI RT07s sp eaker diarization system. To appear in LNCS. Xing, E., & Sohn, K.-A. (2007). Hidden Markov Dirichlet process: Modeling genetic inference in op en ancestral space. Bayes. Analysis, 2, 501­528. 6. Discussion We have demonstrated the considerable benefits of an extended HDP-HMM in which a separate parameter