--PREPRINT--

Bayesian Agglomerative Clustering with Coalescents

Yee Whye Teh Gatsby Unit University College London
ywteh@gatsby.ucl.ac.uk

´ Hal Daume III School of Computing University of Utah
me@hal3.name

Daniel Roy CSAIL MIT
droy@mit.edu

Abstract
We introduce a new Bayesian model for hierarchical clustering based on a prior over trees called Kingman's coalescent. We develop novel greedy and sequential Monte Carlo inferences which operate in a bottom-up agglomerative fashion. We show experimentally the superiority of our algorithms over others, and demonstrate our approach in document clustering and phylolinguistics.

1

Introduction

Hierarchically structured data abound across a wide variety of domains. It is thus not surprising that hierarchical clustering is a traditional mainstay of machine learning [1]. The dominant approach to hierarchical clustering is agglomerative: start with one cluster per datum, and greedily merge pairs until a single cluster remains. Such algorithms are efficient and easy to implement. Their primary limitations--a lack of predictive semantics and a coherent mechanism to deal with missing data-- can be addressed by probabilistic models that handle partially observed data, quantify goodness-offit, predict on new data, and integrate within more complex models, all in a principled fashion. Currently there are two main approaches to probabilistic models for hierarchical clustering. The first takes a direct Bayesian approach by defining a prior over trees followed by a distribution over data points conditioned on a tree [2, 3, 4, 5]. MCMC sampling is then used to obtain trees from their posterior distribution given observations. This approach has the advantages and disadvantages of most Bayesian models: averaging over sampled trees can improve predictive capabilities, give confidence estimates for conclusions drawn from the hierarchy, and share statistical strength across the model; but it is also computationally demanding and complex to implement. As a result such models have not found widespread use. [2] has the additional advantage that the distribution induced on the data points is exchangeable, so the model can be coherently extended to new data. The second approach uses a flat mixture model as the underlying probabilistic model and structures the posterior hierarchically [6, 7]. This approach uses an agglomerative procedure to find the tree giving the best posterior approximation, mirroring traditional agglomerative clustering techniques closely and giving efficient and easy to implement algorithms. However because the underlying model has no hierarchical structure, there is no sharing of information across the tree. We propose a novel class of Bayesian hierarchical clustering models and associated inference algorithms combining the advantages of both probabilistic approaches above. 1) We define a prior and compute the posterior over trees, thus reaping the benefits of a fully Bayesian approach; 2) the distribution over data is hierarchically structured allowing for sharing of statistical strength; 3) we have efficient and easy to implement inference algorithms that construct trees agglomeratively; and 4) the induced distribution over data points is exchangeable. Our model is based on an exchangeable distribution over trees called Kingman's coalescent [8, 9]. Kingman's coalescent is a standard model from population genetics for the genealogy of a set of individuals. It is obtained by tracing the genealogy backwards in time, noting when lineages coalesce together. We review Kingman's coalescent in Section 2. Our own contribution is in using it as a prior over trees in a hierarchical clustering model (Section 3) and in developing novel inference procedures for this model (Section 4). 1


(a) z y {1 ,2 ,3 ,4 }

y {1 ,2 } y {3 ,4 } 3 2 t2
{{1, 2}, {3, 4}} {{1}, {2}, {3, 4}}

x1 x2 x3 x4 1 t1
{{1}, {2}, {3}, {4}}

(b)

!") ! &") & !&") !! !!") !%

(c)

#&' # $&' $ %&' % !%&' !$ !$&'

-
 (t) = {{1, 2, 3, 4}}

t3

t0 = 0

!%") !( !!"#

!!"$

!!"%

!!

!&"'

!&"#

!&"$

!&"%

&

t

!# !!

!"

!#

!$

%

$

Figure 1: (a) Variables describing the n-coalescent. (b) Sample path from a Brownian diffusion coalescent process in 1D, circles are coalescent points. (c) Sample observed points from same in 2D, notice the hierarchically clustered nature of the points.

2

Kingman's coalescent

Kingman's coalescent is a standard model in population genetics describing the common genealogy (ancestral tree) of a set of individuals [8, 9]. In its full form it is a distribution over the genealogy of a countably infinite set of individuals. Like other nonparametric models (e.g. Gaussian and Dirichlet processes), Kingman's coalescent is most easily described and understood in terms of its finite dimensional marginal distributions over the genealogies of n individuals, called n-coalescents. We obtain Kingman's coalescent as n  . Consider the genealogy of n individuals alive at the present time t = 0. We can trace their ancestry backwards in time to the distant past t = -. Assume each individual has one parent (in genetics, haploid organisms), and therefore genealogies of [n] = {1, . . . , n} form a directed forest. In general, at time t  0, there are m (1  m  n) ancestors alive. Identify these ancestors with their corresponding sets 1 , . . . , m of descendants (we will make this identification throughout the paper). Note that  (t) = {1 , . . . , m } form a partition of [n], and interpret t   (t) as a function from (-, 0] to the set of partitions of [n]. This function is piecewise constant, left-continuous, monotonic (s  t implies that  (t) is a refinement of  (s)), and  (0) = {{1}, . . . , {n}} (see Figure 1a). Further,  completely and succinctly characterizes the genealogy; we shall henceforth refer to  as the genealogy of [n]. Kingman's n-coalescent is simply a distribution over genealogies of [n], or equivalently, over the space of partition-valued functions like  . More specifically, the n-coalescent is a continuous-time, partition-valued, Markov process, which starts at {{1}, . . . , {n}} at present time t = 0, and evolves backwards in time, merging (coalescing) lineages until only one is left. To describe the Markov process in its entirety, it is sufficient to describe the jump process (i.e. the embedded, discrete-time, Markov chain over partitions) and the distribution over coalescent times. Both are straightforward and their simplicity is part of the appeal of Kingman's coalescent. Let li , ri be the ith pair of lineages to coalesce, tn-1 < · · · < t1 < t0 = 0 be the coalescent times and i = ti-1 - ti > 0 be the duration between adjacent events (see Figure 1a). Under the n-coalescent, every pair of lineages merges independently with rate 1. Thus the first pair amongst m lineages merge with rate m = m(m-1) n-i+1 i . Therefore i  Exp ndependently, the pair li , ri is chosen from among 2 2 2 those right after time ti , and with probability one a random draw from the n-coalescent is a binary tree with a single root at t = - and the n individuals at time t = 0. The genealogy is given as:  if t = 0; {{1}, . . . , {n}}  (t) = ti-1 - li - ri + (li  ri ) if t = ti ; (1)   ti if ti+1 < t < ti . Combining the probabilities of the durations and choices of lineages, the probability of  is simply: e - n-i+1  / n-i+1 = n-1 - n-i+1  ( n-1 n i p( ) = i=1 -2+1 xp 2) i i i=1 exp 2 2 2 The n-coalescent has some interesting statistical properties [8, 9]. The marginal distribution over tree topologies is uniform and independent of the coalescent times. Secondly, it is infinitely exchangeable: given a genealogy drawn from an n-coalescent, the genealogy of any m contemporary individuals alive at time t  0 embedded within the genealogy is a draw from the m-coalescent. Thus, taking n  , there is a distribution over genealogies of a countably infinite population for which the marginal distribution of the genealogy of any n individuals gives the n-coalescent. Kingman called this the coalescent. 2


3

Hierarchical clustering with coalescents

We take a Bayesian approach to hierarchical clustering, placing a coalescent prior on the latent tree and modeling observed data with a Markov process evolving forward in time along the tree. We will alter our terminology from genealogy to tree, from n individuals at present time to n observed data points, and from individuals on the genealogy to latent variables on the tree-structured distribution. Let x1 , . . . , xn be n observed data at the leaves of a tree  drawn from the n-coalescent.  has n - 1 coalescent points, the ith occuring when li and ri merge at time ti to form i = li  ri . Let tli and tri be the times at which li and ri are themselves formed. We construct a continuous-time Markov process evolving along the tree from the past to the present, branching independently at each coalescent point until we reach time 0, where the n Markov processes induce a distribution over the n data points. The joint distribution respects the conditional independences implied by the structure of the directed tree. Let yi be a latent variable that takes on the value of the Markov process at i just before it branches (see Figure 1a). Let y{i} = xi at leaf i. To complete the description of the likelihood model, let q (z ) be the initial distribution of the Markov process at time t = -, and kst (x, y ) be the transition probability from state x at time s to state y at time t. This Markov process need be neither stationary nor ergodic. Marginalizing over paths of the Markov process, the joint probability over the latent variables and the observations is: n-1 p(x, y, z | ) = q (z )k- tn-1 (z , yn-1 ) i=1 kti tli (yi , yli )kti tri (yi , yri ) (3) Notice that the marginal distributions at each observation p(xi | ) are identical and given by the Markov process at time 0. However, they are not independent: they share the same sample path down the Markov process until they split. In fact the amount of dependence between two observations is a function of the time at which the observations coalesce in the past. A more recent coalescent time implies larger dependence. The overall distribution induced on the observations p(x) inherits the infinite exchangeability of the n-coalescent. We considered a brownian diffusion (see Figures 1(b,c)) and a simple independent sites mutation process on multinomial vectors (Section 4.3).

4

Agglomerative sequential Monte Carlo and greedy inference

We develop two classes of efficient and easily implementable inference algorithms for our hierarchical clustering model based on sequential Monte Carlo (SMC) and greedy schemes respectively. In both classes, the latent variables are integrated out, and the trees are constructed in a bottom-up fashion. The full tree  can be expressed as a series of n - 1 coalescent events, ordered backwards in time. The ith coalescent event involves the merging of the two subtrees with leaves li and ri and occurs at a time i before the previous coalescent event. Let i = {j , lj , rj for j  i} denote the first i coalescent events. n-1 is equivalent to  and we shall use them interchangeably. We assume that the form of the Markov process is such that the latent variables {yi }n-1 and z can i=1 be efficiently integrated out using an upward pass of belief propagation on the tree. Let Mi (y ) be the message passed from yi to its parent; M{i} (y ) = xi (y ) is point mass at xi for leaf i. Mi (y ) is proportional to the likelihood of the observations at the leaves below coalescent event i, given that yi = y . Belief propagation computes the messages recursively up the tree; for i = 1, . . . , n - 1: k b - Mi (y ) = Zi1 (x, i ) =l,r (4) ti tbi (y , yb )Mbi (yb ) dyb Zi (x, i ) is a normalization constant introduced to avoid numerical problems. The choice of Z does not affect the probability of x, but dq es impact the accuracy and efficiency of our inference o algorithms. We found that Zi (x, i ) = (y )Mi (y ) dy worked well. At the root, we have: q Z- (x, n-1 ) = (z )k- tn-1 (z , y )Mn-1 (y ) dy dz (5) The marginal probability p(x| ) is now given by the product of normalization constants: n-1 p(x| ) = Z- (x, n-1 ) i=1 Zi (x, i )

(6)

Multiplying in the prior (2) over  , we get the joint probability for the tree  and observations x: - n-i+1  Z n-1 p(x,  ) = Z- (x, n-1 ) i=1 exp (7) i i (x, i ) 2 3


Our inference algorithms are based upon (7). Note that each term Zi (x, i ) can be interpreted as a local likelihood term for coalescing the pair li , ri 1 . In general, for each i, we choose a duration i and a pair of subtrees li , ri to coalesce. This choice is based upon the ith term in (7), interpreted as the product of a local prior and a local likelihood for choosing i , li and ri given i-1 . 4.1 Sequential Monte Carlo algorithms Sequential Monte Carlo algorithms (aka particle filters), approximate the posterior using a weighted sum of point masses [10]. These point masses are constructed iteratively. At iteration i - 1, particle s s s s consists of i-1 = {j , sj , sj for j < i}, and has weight wi-1 . At iteration i, s is extended by r l s s s ss s sampling i , li and ri from a proposal distribution fi (i , li , si |i-1 ), with weights: r - n-i+1  s Z s s s ss s s wi = wi-1 exp (8) i (x, i )/fi (i , li , r i |i-1 ) i 2
s s After n - 1 iterations, we obtain a set of trees n-1 and weights wn-1 . The joint distribution ss s is approximated by: p( , x)  wn-1 n-1 ( ), while the posterior is approximated with the weights normalized. An important aspect of SMC is resampling, which places more particles in high probability regions and prunes particles stuck in low probability regions. We resample as in Algorithm 5.1 of [11] when the effective sample size ratio as estimated in [12] falls below one half. s SMC-PriorPrior. The simplest proposal distribution is to aample i , si and si from the local n s r l s s s -i+1

prior. i is drawn from an exponential with rate nd li , ri are drawn uniformly from 2 s all available pairs. The weight updates (8) reduce to multiplying by Zi (x, i ). This approach is computationally very efficient, but performs badly with many objects due to the uniform draws over pairs. SMC-PriorPost. The second approach addresses the suboptimal choice of pairs to coalesce. s We first draw i from its local prior, then draw si , si from the local posterior: r l  s s ss s ss s s s s slr fi (li , ri |i , i-1 )  Zi (x, i-1 , i , li , ri ); wi = wi-1 l ,r Zi (x, i-1 , i ,  ,  ) (9) This approach is more computationally demanding since we need to evaluate the local likelihood of every pair. It also performs significantly better than SMC-PriorPrior. We have found that it works reasonably well for small data sets but fails in larger ones for which the local posterior for i is highly s peaked. SMC-PostPost. The third approach is to draw all of i , si and si from their posterior: r l - n-i+1  s Z ss s s s ss s fi (i , li , ri |i-1 )  exp i (x, i-1 , i , li , r i ) i 2 - n-i+1  Z e  s s ( s ,l r wi = wi-1 xp 10) l ,r i (x, i-1 ,   ,  ) d 2 This approach requires the fewest particles, but is the most computationally expensive due to the integral for each pair. Fortunately, for the case of Brownian diffusion process described below, these integrals are tractable and related to generalized inverse Gaussian distributions. 4.2 Greedy algorithms SMC algorithms are attractive because they produce an arbitrarily accurate approximation to the full posterior. However in many applications a single good tree is often times sufficient. We describe a few greedy algorithms to construct a good tree.
Greedy-MaxProb: the obvious greedy algorithm is to pick i , li and ri maximizing the ith term in (7). We do so by computing the optimal i for each pair of li , ri , and then picking the pair maximizing the ith term at its optimal i . Greedy-MinDuration: simply pick the pair to coalesce whose optimal duration is minimum. Both algorithms require recomputing the optimal duration for ni o each pair at each iteration, since the exponential rate -2+1 n the duration varies with the iteration i. The total computational cost is thus O(n3 ). We can avoid this by using the alternative view of the n-coalesent as a Markov process where each pair of lineages coalescesnat rate 1. Greedy-Rate1: for p i each pair li and ri we determine the optimal i , but replacing the -2+1 rior rate with 1. We coalesce the pair with most recent time (as in Greedy-MinDuration). This reduces the complexity to O(n2 ). We found that all three perform about equally well.
1 If the Markov process is stationary with equilibrium q (y ), Zi (x, i ) is a likelihood ratio between two models with observations xi : (1) a single tree with leaves i ; (2) two independent trees with leaves li and ri respectively. This is similar to [6, 7] and is used later in our NIPS experiment to determine coherent clusters.

4


4.3 Examples Brownian diffusion. Consider the case of continuous data evolving via Brownian diffusion. The transition kernel kst (y , ·) is a Gaussian centred at y with variance (t - s), where  is a symmetric p.d. covariance matrix. Because the joint distribution (3) over x, y and z is Gaussian, we can express each message Mi (y ) as a Gaussian with mean yi and variance vi . The local likelihood is: - 1 2; i = (v + v + tli + tri - 2ti ) (11) Z (x, i ) = |2 i |- 2 exp 1 ||y - y || b
i

2

li

ri

i

li

ri

where x i =
4


=x

-1

n

where D is the dimensionality. The message at the newly coalesced point has mean and covariance: ( b -1 v y yri b vi = vli + tli - ti )-1 + (vri + tri - ti )-1 ; yi = v +tllii -ti + v +tri -ti i (13)
li ri

1 -i+1 2

x is the Mahanalobis norm. The optimal duration i can also be solved for, - | ni 1 4 -2+1 |yli - yri ||2 + D2 - D (12)  2 (vli + vri + tli + tr i - 2ti-1 )

Multinomial vectors. Consider a Markov process acting on multinomial vectors with each entry taking one of K values and evolving independently. Entry d evolves at rate d and has equilibrium distribution vector qd . The transition rate matrix is Qd = d (qh 1 K - Ik ) where 1 K is a vector of K ones and IK is identity matrix of size K , while the transition probability matrix for entry d in a time interval of length t is eQd t = e-d t IK + (1 - e-d t )qd 1K . Representing the message for d d d d entry d from i to its parent as a vector Mi = [Mi1 , . . . , MiK ] , normalized so that qd · Mi = 1, the local likelihood terms and messages are computed as, 1 ( K d d dk Zi (x, i ) = 1 - eh (2ti -tli -tri ) - k=1 qdk Mlk Mri 14) i
d d d d (15) Mi = (1 - ed (ti -tli ) (1 - Mli ))(1 - ed (ti -tri ) (1 - Mri ))/Zi (x, i ) Unfortunately the optimal i cannot be solved analytically and we use Newton steps to compute it.

4.4 Hyperparameter estimation and predictive density We perform hyperparameter estimation by iterating between estimating a geneology, then reestimating the hyperparamters conditioned on this tree. Space precludes a detailed discussion of the algorithms we use; they can be found in the supplemental material. In the Brownian case, we ^ place an inverse Wishart prior on  and the MAP posterior  is available in a standard closed form. In the multinomial case, the updates are not available analytically and must be solved iteratively. Given a tree and a new individual y we wish to know: (a) where y might coalescent and (b) what the density is at y . In the supplemental material, we show that the probability that y merges at time t with a given sibling is available in closed form for the Brownian motion case. To obtain the density, we sum over all possible siblings and integrate out t by drawing equally spaced samples.

5

Experiments

Synthetic Data Sets In Figure 2 we compare the various SMC algorithms and Greedy-Rate12 on a range of synthetic data sets drawn from the Brownian diffusion coalescent process itself ( = ID ) to investigate the effects of various parameters on the efficacy of the algorithms. Generally SMCPostPost performed best, followed by SMC-PriorPost, SMC-PriorPrior and Greedy-Rate1. With increasing D the amount of data given to the algorithms increases and all algorithms do better, especially Greedy-Rate1. This is because the posterior becomes concentrated and the Greedy-Rate1 approximation corresponds well with the posterior. As n increases, the amount of data increases as well and all algorithms perform better3 . However, the posterior space also increases and SMCPriorPrior which simply samples from the prior over genealogies does not improve as much. We see this effect as well when S is small. As S increases all SMC algorithms improve. Finally, the algorithms were surprisingly robust when there is mismatch between the generated data sets'  and the  used by the model. We expected all models to perform worse with SMC-PostPost best able to maintain its performance (though this is possibly due to our experimental setup).
We found in unreported experiments that the greedy algorithms worked about equally well. Each panel was generated from independent runs. Data set variance affected all algorithms, varying overall performance across panels. However, trends in each panel are still valid, as they are based on the same data.
3 2

5


( a ) -0.6 av e r a g e l o g p r e d i c t i v e

( b ) -0.6

( c ) -0.6

( d ) -0.6

-0.8

-0.8

-0.8

-0.8

-1

-1

-1

-1

-1.2

-1.2

-1.2

-1.2 SMC-PostPost SMC-PriorPost SMC-PriorPrior Greedy-Rate1 10 30 50 S : p arti c l e s 70

-1.4

-1.4

-1.4

-1.4

-1.6

4

6 8 D : d i m e n si on s

-1.6

4

6 n : ob se rvati on s

8

-1.6

0.5

1 : mu tati on rate

2

-1.6

Figure 2: Predictive performance of algorithms as we vary (a) the numbers of dimensions D, (b) observations n, (c) the mutation rate  ( = ID ), and (d) number of samples S . In each panel other parameters are fixed to their middle values (we used S = 50) in other panels, and we report log predictive probabilities on one unobserved entry, averaged over 100 runs.
MNIST Avg-link BHC .363 ± .004 .392 ± .006 .581 ± .005 .579 ± .005 .755 ± .005 .763 ± .005 Coalescent .412 ± .006 .610 ± .005 .773 ± .005 SPAMBASE Avg-link BHC .616 ± .007 .711 ± .010 .607 ± .011 .549 ± .015 .846 ± .010 .832 ± .010 Coalescent .689 ± .008 .661 ± .012 .861 ± .008

Purity Subtree LOO-acc

Table 1: Comparative results. Numbers are averages and standard errors over 50 and 20 repeats. MNIST and SPAMBASE We compare the performance of our approach (Greedy-Rate1 with 10 iterations of hyperparameter update) to two other hierarchical clustering algorithms: averagelink agglomerative clustering and Bayesian hierarchical clustering [6]. In MNIST, We use 10 digits from the MNIST data set, 20 examplars for each digit and 20 dimensions (reduced via PCA), repeating the experiment 50 times. In SPAMBASE, we use 100 examples of 57 attributes each from 2 classes, repeating 20 times. We present purity scores [6], subtree scores (#{interior nodes with all leaves of same class}/(n - #classes)) and leave-one-out accuracies (all scores between 0 and 1, higher better). The results are in Table 1; as we can see, except for purity on SPAMBASE, ours gives the best performance. Experiments not presented here show that all greedy algorithms perform about the same and that performance improves with hyperparameter updates. Phylolinguistics We apply our approach (Greedy-Rate1) to a phylolinguistic problem: language evolution. Unlike previous research [13] which studies only phonological data, we use a full typological database of 139 binary features over 2150 languages: the World Atlas of Language Structures (henceforth, "WALS") [14]. The data is sparse: about 84% of the entries are unknown. We use the same version of the database as extracted by [15]. Based on the Indo-European subset of this data for which at most 30 features are unknown (48 language total), we recover the coalescent tree shown in Figure 3(a). Each language is shown with its genus, allowing us to observe that it teases apart Germanic and Romance languages, but makes a few errors with respect to Iranian and Greek. (In the supplemental material, we report results applied to a wider range of languages.) Next, we compare predictive abilities to other algorithms. We take a subset of WALS and tested on 5% of withheld entries, restoring these with various techniques: Greedy-Rate1; nearest neighbors (use value from nearest observed neighbor); average-linkage (nearest neighbor in the tree); and probabilistic PCA (latent dimensions in 5, 10, 20, 40, chosen optimistically). We use five subsets of the WALS database of varying size, obtained by sorting both the languages and features of the database according to how many cells are observed. We then use a varying percentage (10% - 50%) of the densest portion. The results are in Figure 3(b). The performance of PPCA is steady around 76%. The performance of the other algorithms degrades as the sparsity incrases. Our approach performs at least as well as all the other techniques, except at the two extremes. NIPS We applied Greedy-Rate1 to all NIPS abstracts through NIPS12 (1740, total). The data was preprocessed so that only words occuring in at least 100 abstracts were retained. The word counts were then converted to binary. We performed one iteration of hyperparameter re-estimation. In the supplemental material, we depict the top levels of the coalescent tree. Here, we use use the tree to 6


[Celtic] Irish [Celtic] Gaelic (Scots) [Celtic] Welsh [Celtic] Cornish [Celtic] Breton [Iranian] Tajik [Iranian] Persian [Iranian] Kurdish (Central) [Romance] French [Germanic] German [Germanic] Dutch [Germanic] English [Germanic] Icelandic [Germanic] Swedish [Germanic] Norwegian [Germanic] Danish [Romance] Spanish [Greek] Greek (Modern) [Slavic] Bulgarian [Romance] Romanian [Romance] Portuguese [Romance] Italian [Romance] Catalan [Albanian] Albanian [Slavic] Polish [Slavic] Slovene [Slavic] Serbian-Croatian [Slavic] Ukrainian [Slavic] Russian [Baltic] Lithuanian [Baltic] Latvian [Slavic] Czech [Iranian] Pashto [Indic] Panjabi [Indic] Hindi [Indic] Kashmiri [Indic] Sinhala [Indic] Nepali [Iranian] Ossetic [Indic] Maithili [Indic] Marathi [Indic] Bengali [Armenian] Armenian (Western) [Armenian] Armenian (Eastern) 0.2 0.1 0

82 80 78 76 74 72 0.1

Coalescent Neighbor Agglomerative PPCA

0.2

0.3

0.4

0.5

(a) Coalescent for a subset of Indo-European languages from WALS.

(b) Data restoration on WALS. Y-axis is accuracy; X-axis is percentage of data set used in experiments. At 10%, there are N = 215 languages, H = 14 features and p = 94% observed data; at 20%, N = 430, H = 28 and p = 80%; at 30%: N = 645, H = 42 and p = 66%; at 40%: N = 860, H = 56 and p = 53%; at 50%: N = 1075, H = 70 and p = 43%. Results are averaged over five folds with a different 5% hidden each time. (We also tried a "mode" prediction, but its performance is in the 60% range in all cases, and is not depicted.)

Figure 3: Results of the phylolinguistics experiments.
LLR (t) Top Words Top Authors 32.7 (-2.71) bifurcation attractors hopfield network saddle Mjolsness (9) Saad (9) Ruppin (8) Coolen (7) 0.106 (-3.77) voltage model cells neurons neuron Koch (30) Sejnowski (22) Bower (11) Dayan (10) 83.8 (-2.02) chip circuit voltage vlsi transistor Koch (12) Alspector (6) Lazzaro (6) Murray (6) 140.0 (-2.43) spike ocular cells firing stimulus Sejnowski (22) Koch (18) Bower (11) Dayan (10) 2.48 (-3.66) data model learning algorithm training Jordan (17) Hinton (16) Williams (14) Tresp (13) 31.3 (-2.76) infomax image ica images kurtosis Hinton (12) Sejnowski (10) Amari (7) Zemel (7) 31.6 (-2.83) data training regression learning model Jordan (16) Tresp (13) Smola (11) Moody (10) 39.5 (-2.46) critic policy reinforcement agent controller Singh (15) Barto (10) Sutton (8) Sanger (7) 23.0 (-3.03) network training units hidden input Mozer (14) Lippmann (11) Giles (10) Bengio (9)

Table 2: Nine clusters discovered in NIPS abstracts data. generate a flat clustering. To do so, we use the log likelihood ratio at each branch in the coalescent to determine if a split should occur. If the log likelihood ratio is greater than zero, we break the branch; otherwise, we recurse down. On the NIPS abstracts, this leads to nine clusters, depicted in Table 2. Note that clusters two and three are quite similar--had we used a slighly higher log likelihood ratio, they would have been merged (the LLR for cluster 2 was only 0.105). Note that the clustering is able to tease apart Bayesian learning (cluster 5) and non-bayesian learning (cluster 7)--both of which have Mike Jordan as their top author!

6

Discussion

We described a new model for Bayesian agglomerative clustering. We used Kingman's coalescent as our prior over trees, and derived efficient and easily implementable greedy and SMC inference algorithms for the model. We showed empirically that our model gives better performance than other agglomerative clustering algorithms, and gives good results on applications to document modeling and phylolinguistics. Our model is most similar in spirit to the Dirichlet diffusion tree of [2]. Both use infinitely exchangeable priors over trees. While [2] uses a fragmentation process for trees, our prior uses the reverse--a 7


coalescent process instead. This allows us to develop simpler inference algorithms than those in [2], though it will be interesting to consider the possibility of developing analogous algorithms for [2]. [3] also describes a hierarchical clustering model involving a prior over trees, but his prior is not infinitely exchangeable. [5] uses tree-consistent partitions to model relational data; it would be interesting to apply our approach to their setting. Another related work is the Bayesian hierarchical clustering of [6], which uses an agglomerative procedure returning a tree structured approximate posterior for a Dirichlet process mixture model. As opposed to our work [6] uses a flat mixture model and does not have a notion of distributions over trees. There are a number of unresolved issues with our work. Firstly, our algorithms take O(n3 ) computation time, except for Greedy-Rate1 which takes O(n2 ) time. Among the greedy algorithms we see that there are no discernible differences in quality of approximation thus we recommend GreedyRate1. It would be interesting to develop SMC algorithms with O (n2 ) runtime. Secondly, there are unanswered statistical questions. For example, since our prior is infinitely exchangeable, by de Finetti's theorem there is an underlying random distribution for which our observations are i.i.d. draws. What is this underlying random distribution, and how do samples from this distribution look like? We know the answer for at least a simple case: if the Markov process is a mutation process with mutation rate /2 and new states are drawn i.i.d. from a base distribution H , then the induced distribution is a Dirichlet process DP(, H ) [8]. Another issue is that of consistency--does the posterior over random distributions converge to the true distribution as the number of observations grows? Finally, it would be interesting to generalize our approach to varying mutation rates, and to non-binary trees by using generalizations to Kingman's coalescent called -coalescents [16].

References
[1] R. O. Duda and P. E. Hart. Pattern Classification And Scene Analysis. Wiley and Sons, New York, 1973. [2] R. M. Neal. Defining priors for distributions using Dirichlet diffusion trees. Technical Report 0104, Department of Statistics, University of Toronto, 2001. [3] C. K. I. Williams. A MCMC approach to hierarchical mixture modelling. In Advances in Neural Information Processing Systems, volume 12, 2000. [4] C. Kemp, T. L. Griffiths, S. Stromsten, and J. B. Tenenbaum. Semi-supervised learning with trees. In Advances in Neural Information Processing Systems, volume 16, 2004. [5] D. M. Roy, C. Kemp, V. Mansinghka, and J. B. Tenenbaum. Learning annotated hierarchies from relational data. In Advances in Neural Information Processing Systems, volume 19, 2007. [6] K. A. Heller and Z. Ghahramani. Bayesian hierarchical clustering. In Proceedings of the International Conference on Machine Learning, volume 22, 2005. [7] N. Friedman. Pcluster: Probabilistic agglomerative clustering of gene expression profiles. Technical Report Technical Report 2003-80, Hebrew University, 2003. [8] J. F. C. Kingman. On the genealogy of large populations. Journal of Applied Probability, 19:27­43, 1982. Essays in Statistical Science. [9] J. F. C. Kingman. The coalescent. Stochastic Processes and their Applications, 13:235­248, 1982. [10] A. Doucet, N. de Freitas, and N. J. Gordon. Sequential Monte Carlo Methods in Practice. Statistics for Engineering and Information Science. New York: Springer-Verlag, May 2001. [11] P. Fearnhead. Sequential Monte Carlo Method in Filter Theory. PhD thesis, Merton College, University of Oxford, 1998. [12] R. M. Neal. Annealed importance sampling. Technical Report 9805, Department of Statistics, University of Toronto, 1998. [13] A. McMahon and R. McMahon. Language Classification by Numbers. Oxford University Press, 2005. [14] M. Haspelmath, M. Dryer, D. Gil, and B. Comrie, editors. The World Atlas of Language Structures. Oxford University Press, 2005. ´ [15] H. Daume III and L. Campbell. A Bayesian model for discovering typological implications. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2007. [16] J. Pitman. Coalescents with multiple collisions. Annals of Probability, 27:1870­1902, 1999.

8