Hierarchical Text Segmentation from Multi-Scale Lexical Cohesion Jacob Eisenstein Beckman Institute for Advanced Science and Technology University of Illinois Urbana, IL 61801 jacobe@illinois.edu Abstract This paper presents a novel unsupervised method for hierarchical topic segmentation. Lexical cohesion ­ the workhorse of unsupervised linear segmentation ­ is treated as a multi-scale phenomenon, and formalized in a Bayesian setting. Each word token is modeled as a draw from a pyramid of latent topic models, where the structure of the pyramid is constrained to induce a hierarchical segmentation. Inference takes the form of a coordinate-ascent algorithm, iterating between two steps: a novel dynamic program for obtaining the globally-optimal hierarchical segmentation, and collapsed variational Bayesian inference over the hidden variables. The resulting system is fast and accurate, and compares well against heuristic alternatives. The idea of multi-scale cohesion is illustrated by the following two examples, drawn from the Wikipedia entry for the city of Buenos Aires. There are over 150 city bus lines called Colectivos ... Colectivos in Buenos Aires do not have a fixed timetable, but run from 4 to several per hour, depending on the bus line and time of the day. The Buenos Aires metro has six lines, 74 stations, and 52.3 km of track. An expansion program is underway to extend existing lines into the outer neighborhoods. Track length is expected to reach 89 km... 1 Introduction Recovering structural organization from unformatted texts or transcripts is a fundamental problem in natural language processing, with applications to classroom lectures, meeting transcripts, and chatroom logs. In the unsupervised setting, a variety of successful systems have leveraged lexical cohesion (Halliday and Hasan, 1976) ­ the idea that topically-coherent segments display consistent lexical distributions (Hearst, 1994; Utiyama and Isahara, 2001; Eisenstein and Barzilay, 2008). However, such systems almost invariably focus on linear segmentation, while it is widely believed that discourse displays a hierarchical structure (Grosz and Sidner, 1986). This paper introduces the concept of multi-scale lexical cohesion, and leverages this idea in a Bayesian generative model for hierarchical topic segmentation. 353 The two sections are both part of a high-level segment on transportation. Words in bold are characteristic of the subsections (buses and trains, respectively), and do not occur elsewhere in the transportation section; words in italics occur throughout the high-level section, but not elsewhere in the article. This paper shows how multi-scale cohesion can be captured in a Bayesian generative model and exploited for unsupervised hierarchical topic segmentation. Latent topic models (Blei et al., 2003) provide a powerful statistical apparatus with which to study discourse structure. A consistent theme is the treatment of individual words as draws from multinomial language models indexed by a hidden "topic" associated with the word. In latent Dirichlet allocation (LDA) and related models, the hidden topic for each word is unconstrained and unrelated to the hidden topic of neighboring words (given the parameters). In this paper, the latent topics are constrained to produce a hierarchical segmentation structure, as shown in Figure 1. Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 353­361, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics 8 6 1 2 3 4 7 5 w1 ... wT Figure 1: Each word wt is drawn from a mixture of the language models located above t in the pyramid. These structural requirements simplify inference, allowing the language models to be analytically marginalized. The remaining hidden variables are the scale-level assignments for each word token. Given marginal distributions over these variables, it is possible to search the entire space of hierarchical segmentations in polynomial time, using a novel dynamic program. Collapsed variational Bayesian inference is then used to update the marginals. This approach achieves high quality segmentation on multiple levels of the topic hierarchy. Source code is available at http://people. csail.mit.edu/jacobe/naacl09.html. 2 Related Work The use of lexical cohesion (Halliday and Hasan, 1976) in unsupervised topic segmentation dates back to Hearst's seminal T EXT T ILING system (1994). Lexical cohesion was placed in a probabilistic (though not Bayesian) framework by Utiyama and Isahara (2001). The application of Bayesian topic models to text segmentation was investigated first by Blei and Moreno (2001) and later by Purver et al. (2006), using HMM-like graphical models for linear segmentation. Eisenstein and Barzilay (2008) extend this work by marginalizing the language models using the Dirichlet compound multinomial distribution; this permits efficient inference to be performed directly in the space of segmentations. All of these papers consider only linear topic segmentation; we introduce multi-scale lexical cohesion, which posits that the distribution of some 354 words changes slowly with high-level topics, while others change rapidly with lower-level subtopics. This gives a principled mechanism to model hierarchical topic segmentation. The literature on hierarchical topic segmentation is relatively sparse. Hsueh et al. (2006) describe a supervised approach that trains separate classifiers for topic and sub-topic segmentation; more relevant for the current work is the unsupervised method of Yaari (1997). As in T EXT T ILING, cohesion is measured using cosine similarity, and agglomerative clustering is used to induce a dendrogram over paragraphs; the dendrogram is transformed into a hierarchical segmentation using a heuristic algorithm. Such heuristic approaches are typically brittle, as they include a number of parameters that must be hand-tuned. These problems can be avoided by working in a Bayesian probabilistic framework. We note two orthogonal but related approaches to extracting nonlinear discourse structures from text. Rhetorical structure theory posits a hierarchical structure of discourse relations between spans of text (Mann and Thompson, 1988). This structure is richer than hierarchical topic segmentation, and the base level of analysis is typically more fine-grained ­ at the level of individual clauses. Unsupervised approaches based purely on cohesion are unlikely to succeed at this level of granularity. Elsner and Charniak (2008) propose the task of conversation disentanglement from internet chatroom logs. Unlike hierarchical topic segmentation, conversational threads may be disjoint, with unrelated threads interposed between two utterances from the same thread. Elsner and Charniak present a supervised approach to this problem, but the development of cohesion-based unsupervised methods is an interesting possibility for future work. 3 Model Topic modeling is premised on a generative framework in which each word wt is drawn from a multinomial yt , where yt is a hidden topic indexing the language model that generates wt . From a modeling standpoint, linear topic segmentation merely adds the constraint that yt {yt-1 , yt-1 + 1}. Segmentations that draw boundaries so as to induce compact, low-entropy language models will achieve a high likelihood. Thus topic models situate lexical cohesion in a probabilistic setting. For hierarchical segmentation, we take the hypothesis that lexical cohesion is a multi-scale phenomenon. This is represented with a pyramid of language models, shown in Figure 1. Each word may be drawn from any language model above it in the pyramid. Thus, the high-level language models will be required to explain words throughout large parts of the document, while the low-level language models will be required to explain only a local set of words. A hidden variable zt indicates which level is responsible for generating the word wt . Ideally we would like to choose the segmentation y = argmaxy p(w|y)p(y). However, we must deal ^ with the hidden language models and scale-level assignments z. The language models can be integrated out analytically (Section 3.1). Given marginal likelihoods for the hidden variables z, the globally optimal segmentation y can be found using a dy^ namic program (Section 4.1). Given a segmentation, we can estimate marginals for the hidden variables, using collapsed variational inference (Section 4.2). We iterate between these procedures in an EM-like coordinate-ascent algorithm (Section 4.4) until convergence. With these pieces in place, we can write the observation likelihood, T p(w|y, z, ) = t K p(wt |y(zt ) ) t = j {t:y (zt ) =j} t p(wt |j ), where we have merely rearranged the product to group terms that are drawn from the same language model. As the goal is to obtain the hierarchical segmentation and not the language models, the search space can be reduced by marginalizing . The derivation is facilitated by a notational convenience: xj represents the lexical counts induced by the set (z ) of words {wt : yt t = j}. K p(w|y, z, ) = j K dj p(j |)p(xj |j ) pdcm (xj ; ) = j K = j (W ) ( W i W i xji + ) (xji + ) . () (1) 3.1 Language models We begin the formal presentation of the model with some notation. Each word wt is modeled as a single draw from a multinomial language model j . The language models in turn are drawn from symmetric Dirichlet distributions with parameter . The number of language models is written K; the number of words is W ; the length of the document is T ; and the depth of the hierarchy is L. For hierarchical segmentation, the vector yt indicates the segment index of t at each level of the topic hierarchy; the specific level of the hierarchy responsible for wt is given by the hidden variable zt . Thus, (z ) yt t is the index of the language model that generates wt . 355 Here, pdcm indicates the Dirichlet compound multinomial distribution (Madsen et al., 2005), which is the closed form solution to the integral over language models. Also known as the multivariate Polya distribution, the probability density function can be computed exactly as a ratio of gamma functions. Here we use a symmetric Dirichlet prior , though asymmetric priors can easily be applied. Thus far we have treated the hidden variables z as observed. In fact we will compute approximate marginal probabilities Qzt (zt ), written t Qzt (zt = ). Writing x Qz for the expectation of x under distribution Qz , we approximate, pdcm (xj ; ) xj (i) Qz Qz pdcm ( xj L Qz ; ) ( ) = {t:jyt } (wt = i)(yt = j)t , where xj (i) indicates the count for word type i generated from segment j. In the outer sum, we consider all t for possibly drawn from segment j. The inner sum goes over all levels of the pyramid. The delta functions take the value one if the enclosed Boolean expression is true and zero otherwise, so we are adding the fractional counts t only when ( ) wt = i and yt = j. 3.2 Prior on segmentations 4 Inference This section describes the inference for the segmentation y, the approximate marginals QZ , and the hyperparameter . 4.1 Dynamic programming for hierarchical segmentation Maximizing the joint probability p(w, y) = p(w|y)p(y) leaves the term p(y) as a prior on segmentations. This prior can be used to favor segmentations with the desired granularity. Consider a prior of the form p(y) = L p(y( ) |y( -1) ); for nota=1 tional convenience, we introduce a base level such (0) that yt = t, where every word is a segmentation point. At every level > 0, the prior is a Markov ( ) ( ) process, p(y( ) |y( -1) ) = T p(yt |yt-1 , y( -1) ). t The constraint yt {yt-1 , yt-1 + 1} ensures a linear segmentation at each level. To enforce hierar( ) chical consistency, each yt can be a segmentation point only if t is also a segmentation point at the lower level - 1. Zero probability is assigned to segmentations that violate these constraints. To quantify the prior probability of legal segmentations, assume a set of parameters d , indicating the expected segment duration at each level. If t is a valid potential segmentation point at level ( -1) ( -1) (i.e., yt = 1 + yt-1 ), then the prior probability of a segment transition is r = d -1 /d , with d0 = 1. If there are N segments in level and M N segments in level - 1, then the prior p(y( ) |y( -1) ) = rN (1 - r )M -N , as long as the hierarchical segmentation constraint is obeyed. For the purposes of inference it will be preferable to have a prior that decomposes over levels and segments. In particular, we do not want to have to commit to a particular segmentation at level before segmenting level + 1. The above prior can be approximated by replacing M with its expectation M d -1 = T /d -1 . Then a single segment ranging from wu to wv (inclusive) will contribute v-u log r + d -1 log(1 - r ) to the log of the prior. 356 ( ) ( ) ( ) While the model structure is reminiscent of a factorial hidden Markov model (HMM), there are important differences that prevent the direct application of HMM inference. Hidden Markov models assume that the parameters of the observation likelihood distributions are available directly, while we marginalize them out. This has the effect of introducing dependencies throughout the state space: the segment assignment for each yt contributes to lexical counts which in turn affect the observation likelihoods for many other t . However, due to the left-to-right nature of segmentation, efficient inference of the optimal hierarchical segmentation (given the marginals QZ ) is still possible. Let B ( ) [u, v] represent the log-likelihood of grouping together all contiguous words wu . . . wv-1 at level of the segmentation hierarchy. Using xt to indicate a vector of zeros with one at the position wt , we can express B more formally: v B ( ) [u, v] = log pdcm t=u xt t v-u-1 log(1 - r ). d -1 + log r + The last two terms are from the prior p(y), as explained in Section 3.2. The value of B ( ) [u, v] is computed for all u, all v > u, and all . Next, we compute the log-likelihood of the optimal segmentation, which we write as A(L) [0, T ]. This matrix can be filled in recursively: A( ) [u, v] = max B ( ) [t, v] + A( ut