The Translation-invariant Wishart-Dirichlet Process for Clustering Distance Data Julia E. Vogt Computer Science Department, University of Basel, Basel, Switzerland Sandhya Prabhakaran Computer Science Department, University of Basel, Basel, Switzerland JULIA . VOGT @ UNIBAS . CH SANDHYA . PRABHAKARAN @ UNIBAS . CH Thomas J. Fuchs THOMAS . FUCHS @ INF. ETHZ . CH Department of Computer Science, ETH Zurich & Competence Center for Systems Physiology and Metabolic Diseases, Zurich, Switzerland Volker Roth Computer Science Department, University of Basel, Basel, Switzerland VOLKER . ROTH @ UNIBAS . CH Abstract We present a probabilistic model for clustering of objects represented via pairwise dissimilarities. We propose that even if an underlying vectorial representation exists, it is better to work directly with the dissimilarity matrix hence avoiding unnecessary bias and variance caused by embeddings. By using a Dirichlet process prior we are not obliged to fix the number of clusters in advance. Furthermore, our clustering model is permutation-, scale- and translation-invariant, and it is called the Translation-invariant Wishart Dirichlet (TIWD) process. A highly efficient MCMC sampling algorithm is presented. Experiments show that the TIWD process exhibits several advantages over competing approaches. 1. Introduction The Bayesian clustering approach presented in this work aims at identifying subsets (or "clusters") of objects represented as columns/rows in a dissimilarity matrix. The underlying idea is that objects grouped together in such a cluster can be reasonably well described as a homogeneous sub-population. Our focus on dissimilarity matrices implies that we do not have access to a vectorial representation of the objects. Such underlying vectorial representaAppearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). tion may or may not exist, depending on whether the dissimilarity matrix can be embedded (without distortion) in a vector space. One way of dealing with such problems would be to explicitly construct an Euclidean embedding (or possibly a distorted embedding), and to apply a traditional clustering method in the Euclidean space. We argue, however, that even under the assumption that there exists an Euclidean embedding it is better not to embed the data, since any such choice might induce an unnecessary bias and variance in the clustering process. Technically speaking, such embeddings break the symmetry induced by the translation- and rotation-invariance which reflects the information loss incurred when moving from vectors to pairwise dissimilarities. We propose a clustering model which works directly on dissimilarity matrices. It is invariant against label- and object permutations and against scale transformations. The model is fully probabilistic in nature, which means that on output we are given samples from a distribution over partitions. Further, the use of a Dirichlet process prior unburdens the user from explicitly fixing the number of clusters. We present a highly efficient sampling algorithm which avoids costly matrix operations by carefully exploiting the structure of the clustering problem. Invariance against label permutations is a common cause of the so-called "label switching" problem in mixture models, (Jasr et al., 2005). By formulating the model as a partition process this switching problem is circumvented. This paper is structured as follows: we start with a review of the Dirichlet cluster process for Gaussian mixtures in (McCullagh & Yang, 2008). This model is generalized to relational data by enforcing translation invariance. We call this new model the Translation-invariant Wishart- The Translation-invariant Wishart-Dirichlet Process Dirichlet (WD) cluster process. We then develop an efficient sampling algorithm which makes it possible to apply the method to large-scale datasets. 2. Gauss-Dirichlet Cluster Process Let [n] := {1, . . . , n} denote an index set, and Bn the set of partitions of [n]. A partition B Bn is an equivalence relation B : [n] × [n] {0, 1} that may be represented in matrix form as B(i, j) = 1 if y(i) = y(j) and B(i, j) = 0 otherwise, with y being a function that maps [n] to some label set L. Alternatively, B may be represented as a set of disjoint non-empty subsets called "blocks" b. A partition process is a series of distributions Pn on the set Bn in which Pn is the marginal distribution of Pn+1 . Such a process is called exchangeable if each Pn is invariant under permutations of object indices, see (Pitman, 2006). A Gauss-Dirichlet cluster process consists of an infinite sequence of points in Rd , together with a random partition of integers into k blocks. A sequence of length n can be sampled as follows (MacEachern, 1994; Dahl, 2005; McCullagh & Yang, 2008): fix the number of mixture modes k, generate mixing proportions = (1 , . . . , k ) from an exchangeable Dirichlet distribution Dir(/k, . . . , /k), generate a label sequence {y(1), . . . , y(n)} from a multinomial distribution and forget the labels introducing the random partition B of [n] induced by y. Integrating out , one arrives at a Dirichlet-Multinomial prior over partitions Pn (B|, k) = () bB (nb + /k) k! , (1) (k - kB )! (n + )[(/k)]kB Thus, the columns of X are independent n-dimensional vectors xi Rn distributed according to a normal distribution with covariance matrix B = In + B. Further, the distribution factorizes over the blocks b B. Introducing the symbol ib := {i : i b} defining an index-vector of all objects assigned to block b, the joint distribution reads p(X, B|, , , k) = Pn (B|, k) · bB d j=1 N (Xib j |Inb + 1nb 1t b ) , n (4) where nb is the size of block b and 1nb a nb -vector of ones. In the following we will use the abbreviations 1b := 1nb and Ib := Inb to avoid double subscripts. Note that this distribution is expressed in terms of the partition without resorting to labels, so label switching cannot occur. 3. Wishart-Dirichlet Cluster Process We now extend the Gauss-Dirichlet cluster process to a sequence of inner-product and distance matrices. Assume that the random matrix Xn×d follows the zero-mean Gaussian distribution specified in (2), with 0 = Id , 1 = Id . Then, conditioned on the partition B, the inner product matrix S = XX t /d follows a (possibly singular) Wishart distribution in d degrees of freedom, S Wd (B ), (Srivastava, 2003). If we directly observe the dot products S, it suffices to consider the conditional probability of partitions, Pn (B|S), which has the same functional form for ordinary and singular Wishart distributions: Pn (B|S, , , , k) Wd (S|B ) · Pn (B|, k) |B | -d 2 where kB k denotes the number of blocks present in the partition B and nb is the size of block b. The limit as k is well defined and known as the Ewens process (a.k.a. Chinese Restaurant process), see for instance (Ewens, 1972; Neal, 2000; Blei & Jordan, 2006). Given such a partition B, a sequence of n-dimensional observations xi Rn , i = 1, . . . , d is arranged as columns of the (n×d) matrix X, and this X is generated from a zero-mean Gaussian distribution with covariance matrix B = In 0 + B 1 , with cov(Xir , Xjs |B) = ij 0rs + Bij 1rs , (2) exp - d tr(-1 S) · Pn (B|, k), B 2 (5) For the following derivation it is suitable to re-parametrize the model in terms of (, ) instead of (, ), where := /, and in terms of W := -1 . Due to the block structure B in B, Pn (B|S, ·) factorizes over the blocks b B: Pn (B|S, , , , k) Pn (B|, k) · bB d |Wb | 2 exp - d bB 2 tr(Wb Sbb ) , (6) where Wb , Sbb denote the submatrices corresponding to the b-th diagonal block in B or W , see Figure 1. B1 B2 B3 W1 W2 W3 S11 S 12 S 13 S 23 S 33 D11 D12 D13 D23 D33 S 21 S 22 S 31 S 32 D21 D22 D31 D32 where 0 is the usual (d × d) "pooled" within-class covariance matrix and 1 the (d × d) between-class matrix, respectively, and ij denotes the Kronecker symbol. Since the partition process is invariant under permutations, we can always think of B being block-diagonal. For spherical covariance matrices (i.e. scaled identity matrices), 0 = Id , 1 = Id , the covariance structure reduces to B = In Id + B Id = (In + B) Id =: B Id , with cov(Xir , Xjs |B) = (ij + Bij )rs . (3) Figure 1. Example of the block structure of B and W (left) and the definition of sub-matrices in S and D (right) for kB = 3. The above factorization property can be exploited to derive an efficient inference algorithm for this model. The key observation is that the inverse matrix Wb = -1 can be b analytically computed as Wb = (Ib + 1b 1t )-1 = b 1 Ib - t 1+nb 1b 1b . (7) The Translation-invariant Wishart-Dirichlet Process Thus, the contribution of block b to the trace is tr(Wb Sbb ) = 1 tr(Sbb ) - ¯ 1+nb Sbb , (8) ¯ where Sbb = 1t Sbb 1b denotes the sum of the b-th diagonal b block of S. A similar trick can be used for the determinant which is the product of the eigenvalues: the kB smallest eigenvalues of W are given by b = -1 (1 + nb )-1 . The remaining n - kB eigenvalues are equal to -1 . Thus, the determinant reads |W | = bB b = -n bB (1 + nb )-1 . (9) 3.1. Scale Invariance Note that the re-parametrization using (, ) leads to a new semantics of (1/) as a scale parameter: we excluded from the partition-dependent terms in the product over the blocks in (9), which implies that the conditional for the partition becomes Pn (B|·) Pn (B|, k) · 1 · exp - d 2 bB bB (1 + nb )-1 -d/2 (10) tr(Wb Sbb ) . Note that (1/) simply rescales the observed matrix S, and we can make the model scale invariant by introducing a prior distribution and integrating out . The conditional posterior for follows an inverse Gamma distribution p(|r, s) = sr (r) 1 r+1 s exp - , (11) Rn : x N (µ = 0n , = B ). This assumption is crucial, since general mean vectors correspond to a noncentral Wishart model (Anderson, 1946), which imposes severe computational problems due to the appearance of the hypergeometric function. Both of the above problems are related in that they have to do with the lack of information about geometric transformations: assume we only observe S without access to the vectorial representations Xn×d . Then we have lost the information about orthogonal transformations X XO with OOt = Id , i.e. about rotations and reflections of the rows in X. If we only observe D, we have additionally lost the information about translations of the rows. Our sampling model implies that the means in each row are expected to converge to zero as the number of replications d goes to infinity. Thus, if we had access to X and if we are not sure that the above zero-mean assumption holds, it might be a plausible strategy to subtract the empirical row means, Xn×d Xn×d - (1/d)Xn×d 1d 1t , d and then to construct a candidate matrix S by computing the pairwise dot products. This procedure should be statistically robust if d n, since then the empirical means are probably close to their expected values. Such a matrix S fulfills two requirements for selecting candidate dot product matrices: first, S should be "typical" with respect to the assumed Wishart model with µ = 0, thereby avoiding any bias introduced by a particular choice. Second, the choice should be robust in a statistical sense: if we are given a second observation from the same data source, the two selected prototypical matrices S1 and S2 should be similar. For small d, this procedure is dangerous since it can introduce a strong bias even if the model is correct. Consider now case (ii) where we observe S without access to X. Case (i) needs no special treatment, since it can be reduced to case (ii) by first constructing a positive semidefinite matrix S which fulfills Dij = Sii + Sjj - 2Sij . For "correcting" the matrix S just as described above we would need a procedure which effectively subtracts the empirical row means from the rows of X. Unfortunately, there exists no such matrix transformation that operates directly on S without explicit construction of X. It is important to note that the "usual" centering transformation S QSQ 1 with Qij = ij - n as used in kernel PCA and related algorithms does not work here: in kernel PCA the rows of X are assumed to be i.i.d. replications in Rd . Consequently, the centered matrix Sc is built by subtracting the column means: Xn×d Xn×d - (1/n)1n 1t Xn×d and n Sc = XX t = QSQ. Here, we need to subtract the row means, and therefore it is inevitable to explicitly construct X, which implies that we have to choose a certain orthogonal transformation O. It might be reasonable to consider only rotations and to use the principle components as coordinate axes. This is essentially the kernel PCA embedding procedure: compute Sc = QSQ and its eigenvalue decom- with shape parameter r = n · d/2 - 1 and scale s = d ¯ bB 1+nb Sbb ), cf. eqs. (8) and (10). Using 2 (tr(S) - an inverse Gamma prior with parameters r0 , s0 , the posterior is of the same functional form with rp = r + r0 + 1 and sp = s + s0 , and we can integrate out analytically. Dropping all terms independent of the partition structure we arrive at Pn (B|·) Pn (B|, k)|W |(=1) (s + s0 )r+r0 +1 , where |W |(=1) = bB (1 d/2 (12) + nb )-1 follows from (9). 3.2. The Centering Problem In practice, however, there are two problems with the model described above: (i) we often do not directly observe S, but only a matrix of distances D. In the following we will assume that the (suitably pre-processed) matrix D contains squared Euclidean distances with components Dij = Sii + Sjj - 2Sij ; (ii) even if we observe a dotproduct matrix, we usually have no information about the mean vector µ. Note that we assumed that there exists a matrix X with XX t = S such that the columns of X are independent copies drawn from a zero-mean Gaussian in The Translation-invariant Wishart-Dirichlet Process position Sc = V V t , and then project on the principle axes: X = V 1/2 . The problem with this vector-space embedding is that it is statistically robust in the above sense only if d is small, because otherwise the directions of the principle axes might be difficult to estimate, and the estimates for two replicated observations might highly fluctuate, leading to different row-mean normalizations. Note that this condition for fixing the rotation contradicts the above condition d n that justifies the subtraction of the means. Further, row-mean normalization will change the pairwise dissimilarities Dij (even if the model is correct!), and this change can be drastic if d is small. The cleanest solution might be to consider the dissimilarities D (which are observed in case (i) and computed as Dij = Sii + Sjj - 2Sij in case (ii)) as the "reference" quantity, and to avoid an explicit choice of S and X altogether. Therefore, we propose to encode the translation invariance directly into the likelihood, which means that the latter becomes constant on all matrices S that fulfill Dij = Sii + Sjj - 2Sij . 3.3. The Translation-invariant WD-Process A squared Euclidean distance matrix D is characterized by the property of being of negative type, which means that xt Dx = -2xt Sx 0 for any x : xt 1 = 0. This condition is equivalent to the absence of negative eigenvalues in Sc = QSQ = -(1/2)QDQ. The distribution of D has been formally studied in (McCullagh, 2009), where it was shown that if S follows a standard Wishart generated from an underlying zero-mean Gaussian process, S Wd (B ), -D follows a generalized Wishart distribution, -D W(1, 2B ) = W(1, -) defined with respect to the transformation kernel K = 1, where ij = B ii + B jj - 2B ij . To understand the role of the transformation kernel it is useful to introduce the notion of a generalized Gaussian distribution with kernel K = 1: X N (1, µ, ). For any transformation L with L1 = 0, the meaning of the general Gaussian notation is: LX N (Lµ, LLt ). (13) with spherical within- and between covariances we generalize the above model to Gaussian random matrices X N (µ, B Id ). Note that the d columns of this matrix are i.i.d. copies. The distribution of the matrix of squared Euclidean distances D then follows a generalized Wishart with d degrees of freedom -D Wd (1, -). This distribution differs from a standard Wishart in that the inverse matrix W = -1 is substituted by the matrix B W = W - (1t W 1)-1 W 11t W and the determinant | · | is substituted by a generalized det(·)-symbol which denotes the product of the nonzero eigenvalues of its matrix-valued argument (note that W is rank-deficient). The conditional probability of a partition then reads P (B|D, ·) Wd (-D|1, -) · Pn (B|, k) det(W ) 2 exp d d 4 tr(W D) · Pn (B|, k). (14) Note that in spite of the fact that this probability is written as a function of W = -1 , it is constant over all choices of B B which lead to the same , i.e. independent under translations of the row vectors in X. For the purpose of inferring the partition B, this invariance property means that we can simply use our block-partition covariance model B and assume that the (unobserved) matrix S follows a standard Wishart distribution parametrized by B . We do not need to care about the exact form of S, since the conditional posterior for B depends only on D. Scale invariance can be built into the model with the same procedure as described above for the simple (i.e. not translation invariant) WD-process. The posterior of again follows an inverse Gamma distribution, and after introducing a prior with parameters (s0 , r0 ) and integrating out we arrive at an expression analogous to (12) with s = d tr(W D): 4 P (B|·) Pn (B|, k) det(W(=1) ) 2 (s+s0 ) d n d +r0 2 . (15) 3.4. Efficient Inference via Gibbs Sampling In Gibbs sampling one iteratively samples parameter values from the full conditionals. Our model includes the following parameters: the partition B, the scale , the covariance parameter , the number k of clusters in the population, the Dirichlet rate and the degrees of freedom d. We propose to fix d, and k: the degrees of freedom d might be estimated by the rank of S, which is often known from a pre-processing procedure. Note that d is not a very critical parameter, since all likelihood contributions are basically raised to the power of d. Thus, d might be used as an annealing-type parameter for "freezing" a representative partition in the limit d . Concerning the number k of clusters in the population, there are two possibilities. Either one assumes k = , which results in the Ewensprocess model, or one expects a finite k. Our framework is It follows that under the kernel K = 1, two parameter settings (µ1 , 1 ) and (µ2 , 2 ) are equivalent if L(µ1 -µ2 ) = 0 and L(1 - 2 )Lt = 0, i.e. if µ1 - µ2 1, and (1 - 2 ) {1n v t + v1t : v Rn }, a space which n is usually denoted by sym2 (1 Rn ). It is also useful to introduce the distributional symbol S W(K, ) for the generalized Wishart distribution of the random matrix S = XX t when X N (K, 0, ). The key observation in (McCullagh, 2009) is that Dij = Sii + Sjj - 2Sij defines a linear transformation on symmetric matrices with kernel sym2 (1 Rn ) which implies that the distances follow a generalized Wishart distribution with kernel 1: -D W(1, 2B ) = W(1, -). In the multi-dimensional case The Translation-invariant Wishart-Dirichlet Process applicable to both scenarios. Estimation of k, however is nontrivial if no precise knowledge about is available. Unfortunately, this is usually the case, and k = might be a plausible assumption in many applications. Alternatively, one might fix k to a large constant which serves as an upper bound of the expected number, which can be viewed as truncating the Ewens process. The Dirichlet rate is difficult to estimate, since it only weakly influences the likelihood. Consistent ML-estimators only exist for k = : ^ = kB / log n, and even in this case the variance only decays like 1/ log(n), cf. (Ewens, 1972). In practice, we should not expect to be able to reliably estimate . Rather, we should have some intuition about , maybe guided by the observation that under the Ewens process model the probability of two objects belonging to the same cluster is 1/(1 + ). We can then either define an appropriate prior distribution, or we can fix . Due to the weak effect of on conditionals, these approaches are usually very similar. The scale can be integrated out analytically. The likelihood in is not of recognized form, and we propose to use a discretized prior set {p(j )}J for which we comj=1 pute the posteriors {p(j |·)}J . A new value of is j=1 then sampled from the categorical distribution defined by {p(j |·)}J . We define a sweep of the Gibbs sampler as j=1 one complete update of (B, ). The most time consuming part in a sweep is the update of B by re-estimating the assignments to blocks for a single object (characterized by a row/column in D), given the partition of the remaining objects. Therefore we have to compute the membership probabilities in all existing blocks (and in a new block) by evaluating equation (15), which looks formally similar to (12), but a factorization over blocks is no longer obvious. Every time a new partition is analyzed, a naive implementation requires O(n3 ) costs for computing the determinant of W and the product W D. In one sweep we need to compute kB such probabilities for n objects, summing up to costs of O(n4 kB ). However, a more efficient algorithm exists: Theorem 1. Assuming kB blocks in the actual partition and a fixed maximum iteration number in numerical rootfinding, a sweep of the Gibbs sampler for the translation2 invariant WD model can be computed in O(n2 +nkB ) time. Proof. Assume we want to compute the membership probabilities of the l-th object, given the partition of the remaining objects and all other parameter values. We first have to downdate all quantities which depend on object l and its current block and compute the assignment probabilities for all blocks (and a new one). From the resulting categorical distribution we sample a new assignment (say block c) and update all quantities depending on object l and block c. We repeat this procedure for all objects l = 1, . . . , n. Since up- and down-dating are reverse to each other but otherwise identical operations, it suffices to consider the updating situation. To compute the membership probabilities we have to assign the new object to a block and evaluate (15) for the augmented matrix D , which has one additional column and row. For notational simplicity we will drop the subscript . Eq. (15) has two components: the prior P (B|, k) and the likelihood term which requires us to compute det(W(=1) ) and tr(W D). With identity (x + 1) = x(x) in (1), the contribution of the prior is nc + /k for existing clusters and (1 - kB /k) for a new cluster (one simply sets k = for the Ewens-process). For the likelihood term, consider first the generalized determinant det(W ) in (15). Since W = W - (1t W 1)-1 W 11t W , we have to compute := (1t W 1)-1 for the augmented matrix W after assigning the new object l to block c. Analyzing (7) one derives -1 = bB nb b , where b = (1 + nb )-1 are the kB smallest eigenvalues of W(=1) , see eq. (9). Thus, we increase nc , recompute c and update . Given , we need to compute the eigenvalues of W - W 11t W =: W - vv t , where the latter term defines a rank-one update of W . Analyzing the characteristic polynomial, it is easily seen that the (size~ ordered) eigenvalues i of W fulfill three conditions, see (Golub & Van Loan, 1989): (i) the smallest eigenvalue is ~ zero: 1 = 0; (ii) the largest n - kB eigenvalues are identi~ cal to their counterparts in W : i = i , i = kB +1, . . . , n; (iii) for the remaining eigenvalues with indices i = 2, . . . , kB it holds that if i is a repeated eigenvalue of ~ W , i = i . Otherwise, they are the simple roots of the B secular equation f (y) = + j=1 y-jj fulfilling the re~ lations i < i+1 < i+1 . Note that f can be evaluated in O(kB ) time, and with a fixed maximum number of iterations in the root-finding procedure, det(W ) can be computed in O(kB ). A sweep involves n "new" objects and kB 2 blocks. Thus, the costs sum up to O(nkB ): k nj 2 for i = 1 to n do for c = 1 to kB do nc nc + 1, recompute c and update Find roots of secular equation O(kB ) end for end for For the trace tr(W D) we have to compute tr(W D) = tr(W D) - · tr(W 11t W D) = tr(W D) - · 1t W DW 1. O(1) (16) ¯ We first precompute a B: Dia = ja Dij , which induces O(n) costs since there are n summations in total. The first term in (16) is tr(W D) = bB tr(Dbb ) - ¯ ¯ 1+nb Dbb , so we first update D by recomputing its c-th ¯ row/column: update c = nc c and a B : Dac ¯ ¯ Dac + Dia + Dii a,c O(kB ) time, and update the c- The Translation-invariant Wishart-Dirichlet Process ¯ th term in tr(W D) in constant time. Defining Dab := na t 1a Dab 1b and a := 1+na , the second term in (16) reads ab 1t Wa Dab Wb 1b =: abB ab , a ¯ ab - a Dab - b Dab + a b Dab . ¯ ¯ ¯ =D abB (17) ¯ Since we have already updated and D, it requires O(kB ) time to update the c-th row. In a sweep, the costs for the 2 trace sum up to O(n2 +nkB ): for i = 1 to n do ¯ a B: Dia = ja Dij O(n) for c = 1 to kB do ¯ Update D O(kB ) Recompute c-th term in tr(W D) O(1) O(kB ) Compute a B : ac = ca end for end for The sweep is completed by resampling from a discrete 2 set with J levels which induces costs of O(kB ). From the above theorem it follows that the worst case complexity in one sweep is O(n3 ) in the infinite mixture (i.e. Ewens process-) model, since kB n, and O(n2 ) for the truncated Dirichlet process with kB k < . If the "true" k is finite, but one still uses the infinite model, it is very unlikely to observe the worst-case O(n3 ) behaviour in practice: if the sampler is initialized with a one-block partition (i.e. kB = 1), the trace of kB typically shows an "almost monotone" increase during burn-in, see Figure 3. 3.5. Model Extensions One possible extension of the TIWD cluster process is to include some preprocessing step. From the model assumptions S W(B ) it follows that if B contains kB blocks and if the separation between the clusters (i.e. ) is not too small, there will be only kB dominating eigenvalues in S. Thus, one might safely apply kernel PCA to the centered matrix Sc = -(1/2)QDQ, i.e. compute ~ Sc = V V t , consider only the first k "large" eigenvalues ~ ~ in for computing a low-rank approximation Sc = V V t , ~ ~ and switch back to dissimilarities via Dij = (Sc )ii + ~ ~ (Sc )jj - 2(Sc )ij . Such preprocessing might be particularly helpful in cases where Sc = -(1/2)QDQ contains some negative eigenvalues which are of relatively small magnitude. Then, the low-rank approximation might be ~ positive semi-definite so that D contains squared Euclidean distances. Such situations occur frequently if the dissimilarities stem from pairwise comparison procedures which can be interpreted as approximations to models which are guaranteed to produce Mercer kernels. A popular example are classical string alignments which might be viewed as approximations of probabilistic alignments using pairwise hidden Markov models. We present such an example in section 4. The downside of kernel PCA are the added costs of O(n3 ), but randomized approximation methods have been introduced which significantly reduce these costs. In our TIWD software we have implemented a "symmetrized" version of the random projection algorithm for low-rank matrix approximation proposed in (Vempala, 2004) which uses the idea proposed in (Belabbas & Wolfe, 2007). Another extension of the model concerns semi-supervised situations where for a subset of nm observations class labels, i.e. assignments to km groups, are known. We denote this subset by the set of row indices A = {1, . . . , nm }. Traditional semi-supervised learning methods assume that at least one labeled object per class is observed, i.e. that the number of classes is known. This assumption, however, is questionable in many real world examples. We overcome this limitation by simply fixing the assignment to blocks for objects in A during the sampling procedure, and re-estimating only the assignments for the unlabeled objects in B = {nm + 1, . . . , n}. Using an Ewens process model with k = (or a truncated version thereof with k > km ), the model has the freedom to introduce new classes if some objects do not resemble any labeled observation. We present such an example below, where we consider protein sequences with experimentally confirmed labels (the "true" labels) and others with only machine predicted labels (which we treat as unlabeled objects). 4. Experiments In a first experiment we compare the proposed TIWD cluster process with several hierarchical clustering methods on synthetic data, generated as follows: (i) a random blockpartition matrix B of size n = 500 is sampled with kB = 10; (ii) d = 100 samples from N (0n , ) are drawn, with = In +B, = 2 and different -values; (iii) squared Euclidean distances are stored in the matrix D(n×n) . A two-dimensional kernel PCA projection of an example distance matrix is shown in the left panels of Fig. 2 (large clear cluster separation in the upper panel, and small highly overlapping clusters in the lower panel). 5000 Gibbs sweeps are computed for the TIWD cluster process (after a burn-in phase of 2000 sweeps), followed by an annealing procedure to "freeze" a certain partition, cf. section 3.4. For comparing the performance, several hierarchical clustering methods are applied: "Wards", "complete linkage", "single linkage", "average linkage", see (Jain & Dubes, 1988), and the resulting trees are cut at the same number of clusters as found by TIWD. The right panels show the agreement of the inferred partitions with the true labels, measured in terms of the adjusted rand index. If the clusters are well-separated, all methods perform very well, but for highly overlapping clusters, TIWD shows signifi- The Translation-invariant Wishart-Dirichlet Process cant advantages over the hierarchical methods. 7 7 7 7 7 7 7 7 7 7 777 7 7 7 77 7 77 7 7 7 77 77 7 7 777 777 7 7 7 77 7 7 7 7 7 7 7 77 7 7 7 77 7 7 7 77 7 7 7 7 7 7 theta = 100/dim theta = 100/dim 10 q 7 q 5 3 3 3 2 3 5 adjusted rand index 1 5 5 5 5 1 5 5 5 5 5 2 4 5 5 5 5151 5 5 1 8 5 5 5 1 51 5 8 5 5 1 1 55 51 5 33 3 5 9 3 3 11 51 8 8 3 0 4 9 9 8 8 88 3 3 2 34 4 1 4 3 4 404 3 0 4 9 99 9 4 4 44 88 9 9 99999 9 3 3 3 8 4 0 9 9 9 9 99 99 9 9 9 9 9 0 3 33 4 444 343333 4 3 4 4 4 4 9 4 8 8 8 99 9 99 9 9 9 9 9 9 9 8 9 99 9 99 9 9 99999 99 333 4 4 444 40 4 0 8 3 33 4 4 4 8 9 99 9 99 9 9 9 99 9 9 99 4 99 44 3 9 9 99 999 3 3 4 3 4 4 4 44 4 9 44 40 9 9 9 9 99 9 99 9 9 6 6 6 6 6 9 6 6 6 66 6 6 0 9 9 9 9 99 99 9 9 999 9 9 9 9 9 66 6 4 6 9 6 66 6 6 6 9 4 9 9 9 9 9 6 6 66 6 6 6 66666 6 666666 6 99 6 6 6 6 66 66666 6666 6 6 6 6 6 9 66 6 6 66 666 666 6 6 66 6 6 6 6666 66 6 6 6 6 6 66 6 6 6 6 6666 6 6 666 6 6 66 6 66 6 6 6 plots). Row-mean subtraction, however, introduces significant bias and variance. For nonzero mean vectors (2nd and 4th panel), standard WD completely fails to detect the cluster structure, and row-mean subtraction can only partially overcome this problem. The TIWD process clearly outperforms the other models for nonzero mean vectors. 1.0 1.0 1.0 q 5 0 -5 0.0 0.2 0.8 0.8 -5 0 99 9 5 9 theta = 15/dim TIWD Wards CL SL AL q adjusted rand index adjusted rand index adjusted rand index 0.8 theta = 15/dim adjusted rand index 0.6 0.6 7 -6 -4 -2 0 2 4 6 0.0 9 9 9 9 999 9 7 99 7 7 9 99 77 9 7 77 9 9 7 9 7 7 3 9 9 9 9 99 99 99 3 9 7 7 7 7 9 9 9 999 9 99 9 7 7 9 9 9 77 7 7 3 77 7 9 9 99 77 9 7 77 75 3 7 7 7 9 9 9 737 7 5 7 77 7 3 9 7 7 99 9 3 5 99 9 9 9 7 9 77 7 4 7 7 1 3 7 7 7 7 7 77 7 7 7 7 3 7 77 99 9 7 14 5 9 9 9 7 5 9 77 77 7 7777 7 7 77777 7 7 7 7 7 1 7 7 9 77 9 99 7 7 7 7 7 77 7 7 7 227 3 9 7 7 7 1 7 8 77 7 777 7 9 9 1 1 7 7 77 7 7 77 77 7 77 7 7 9 7 7 7 7 77 7 7 5 3 5 1 11 9 9 9 9 9 7 7 7 7 77 3 583 9 12 7 7 7 7 7 7 7 77 4 6 7 5 5 9 9 9 777 77 3777 7 7 7 0 3 0 7 5 21 3 8 5 55 3 5 2 7 7 2 2 7 7 717 27 0 73 1 0 1 6 56 2 7 7 77 5 7 5 50 7 5 53 55 5 582 7 7 6 7 7 3 7 7 5 25 2 51 6 5 7 1 5 7 5 7 777 56 2 55 6 5 72 8 0 26 6 2 0 7 7 3 6 0 23 0 5 1 6 5 8 5 5 2 2 7 6 2 6 60 3 8 6 7 6 5 00 6 0 0 87 6 8 7 6 6 5 5 3 6 5 66 6 6 6 6 0 2 0 5 6 2 6 5 5 05 6 5 6 07 5 0 60 6 6 06 5 5 6 00 0 66 6 6 6 0 6 6 6 5 6 6 0 6 7 1.0 9 9 0.6 q q 0.4 0.4 0.8 adjusted rand index 0.4 0.4 2 0.6 4 0.8 1.0 6 0.4 0.6 0.8 1.0 9 q q q 0.6 0.2 0.2 q 0.4 0.0 0.0 0.2 -2 WD WD_R TIWD WD WD_R TIWD 0.0 q WD WD_R TIWD 0.0 0.2 0 WD WD_R TIWD -4 0.2 q TIWD Wards CL SL AL Figure 2. TIWD vs. hierarchical clustering ("Wards", "complete linkage", "single linkage", "average linkage") on synthetic data (k = 10, n = 500, d = 100, repeated 20 times). Figure 4. Comparison of WD and TIWD cluster process on synthetic data. "WD": standard WD, "WD R": WD with row-mean subtraction. Left to right: (i) d = 3, µ = 0; (ii) d = 3, µi N (40, 0.1); (iii) and (iv): same for d = 100. In a second experiment we investigate the scalability of the algorithm. The "small "-experiment above (lower panels in Fig. 2) is repeated for n = 8000. Figure 3 depicts the trace of the number of blocks kB during sampling. The sampler stabilizes after roughly 500 sweeps. Note the remarkable stability of the sampler (compared to the usual situations in "traditional" mixture models), which follows from the fact that no label-switching can appear. On a standard computer, this experiment took roughly two hours, which leads us to the conclusion that the proposed sampling algorithm is so efficient (at least for moderate k) that memory constraints are probably more severe than time constraints on standard hardware. 14 12 10 8 kB 6 4 2 0 1000 2000 3000 Gibbs sweep 4000 5000 Figure 3. Traceplot of the number of blocks kB during the Gibbs sweeps for a large synthetic dataset. (10 clusters, n = 8000). In a next experiment we analyze the influence of encoding the translation invariance into the likelihood (our TIWD model) versus the standard WD process and row-mean subtraction as described in section 3.2. A similar random procedure for generating distance matrices is used, but this time we vary the number of replications d and the mean vector µ. If µ = 0n , both the standard WD process and the TIWD process are expected to perform well, which is confirmed in the 1st and 3rd panel (left and right box- In a last experiment we consider a semi-supervised application example in which we study all globin-like protein sequences from the UniProtKB database (with experimentally confirmed annotations) and the TrEMBL database (with unconfirmed annotations). The former set consists of 1168 sequences which fall into 114 classes. These sequences form the "supervised" subset, and their assignments to blocks in the partition matrix are "clamped" in the Gibbs sampler. The latter set contains 2603 sequences which are treated as the "unlabeled" observations. Pairwise local string alignment scores sij are computed between all sequences and transformed into dissimilarities using an exponential transform. The resulting dissimilarity matrix D is not guaranteed to be of negative type (and indeed, -QDQ has some small negative eigenvalues). We overcome this problem by using the randomized low-rank approximation technique according to (Vempala, 2004; Belabbas & Wolfe, 2007), cf. section 3.5, which effectively translates D into a ~ matrix D which is of negative type. The Ewens process model makes it possible to assign the unlabeled objects to existing classes or to new ones. Finally, almost all unlabeled objects are assigned to existing classes, with the exception of three new classes which have an interesting biological interpretation. Two of these classes contain globinlike bacterial sequences from Actinomycetales, a very special group of obligate aerobic bacteria which have to cope with oxidative stress. The latter might explain the existence of redox domains in the globin sequences, like the Ferredoxin reductase-type (FAD)-binding domain observed in all sequences in one of the clusters and the additional Nicotinamide adenine dinucleotide (NAD)-binding domain present in all sequences in the second new cluster, see Figure 5. Some of the latter sequences appear to be similar to another class that also contains Actinomycetales (see the The Translation-invariant Wishart-Dirichlet Process large "off diagonal" probabilities surrounded by the blue circle) which, however, share a different pattern around some heme binding sites. The third new class contains short sequence fragments which show a certain variant of the hemoglobin beta subunit. With the exception of the above mentioned similarity of one of the Actino-bacterial classes to another one, the three new classes show no similarity to any of the other classes, which nicely demonstrates the advantage of a semi-supervised learning model that is flexible enough to allow the creation of new groups. Actinomycetales (FAD/NAD-binding domain, different globin domain sub-structure) classes for objects which are dissimilar to any labeled observation. We could identify an interesting class of bacterial sequences, and a subsequent analysis of their domain structure showed that these sequences indeed share some unusual structural elements. We have implemented a software package for the TIWD model which links efficient C++ MCMC code to a userfriendly R-interface. We will make this package (including the datasets used in this paper) available on mloss.org. Acknowledgement. This work has been partially supported by the FP7 EU project SIMBAD. References Anderson, T.W. The non-central wishart distribution and certain problems of multivariate statistics. Ann. Math. Statist., 17(4): 409­431, 1946. Hemoglobin beta subunit variant Actinomycetales (FAD) Actinomycetales (FAD/NAD-binding) Belabbas, M.A. and Wolfe, P.J. Fast low-rank approximation for covariance matrices. In IEEE Workshop on Computational Advances in Multi-Sensor Processing, pp. 293 ­ 296, 2007. Blei, D. and Jordan, M. Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1:121­144, 2006. Dahl, D.B. Sequentially-allocated merge-split sampler for conjugate and non-conjugate Dirichlet process mixture models. Technical report, Texas A&M University, 2005. Ewens, W.J. The sampling theory of selectively neutral alleles. Theoretical Population Biology, 3:87­112, 1972. Golub, G.H. and Van Loan, C.F. Matrix Computations. The Johns Hopkins University Press, Baltimore, MD, USA, 1989. Jain, A. and Dubes, R. Algorithms for Clustering Data. Prentice Hall, 1988. Jasr, A., Holmes, C.C., and Stephens, D.A. Markov chain monte carlo methods and the label switching problem in Bayesian mixture modeling. Statist. Sci., 20(1):50­67, 2005. MacEachern, S.N. Estimating normal means with a conjugatestyle Dirichlet process prior. Communication in Statistics: Simulation and Computation, 23:727­741, 1994. McCullagh, P. Marginal likelihood for distance matrices. Statistica Sinica, 19:631­649, 2009. McCullagh, P. and Yang, J. How many clusters? Bayesian Analysis, 3:101­120, 2008. Neal, R.M. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249­265, 2000. Pitman, J. Combinatorial stochastic processes. In Picard, J. (ed.), Ecole d'Ete de Probabilites de Saint-Flour XXXII-2002. Springer, 2006. Srivastava, M.S. Singular Wishart and multivariate beta distributions. Annals of Statistics, 31(2):1537­1560, 2003. Vempala, S. The Random Projection Method. Series in Discrete Mathematics and Theoretical Computer Science. AMS, 2004. Figure 5. Co-membership probabilities of globin proteins. 5. Conclusion We introduced a flexible probabilistic model for clustering dissimilarity data. It contains an exchangeable partition process prior which avoids label-switching problems. The likelihood component follows a generalized Wishart model for squared Euclidean distance matrices which is invariant under translations and rotations, under permutations, and under scaling transformations. We call this clustering model the Translation Invariant Wishart-Dirichlet (TIWD) cluster process. The main contributions of this work are threefold: (i) On the modelling side, we propose that it is better to work directly on the distances, without computing an explicit dot-product- or vector-space- representation, since such embeddings add unnecessary bias and variance to the inference process. Experiments on simulated data corroborate this proposition by showing that the TIWD model significantly outperforms alternative approaches. In particular if the clusters are only poorly separated, the full probabilistic nature of the TIWD model has clear advantages over hierarchical approaches. (ii) On the algorithmic side we show that costly matrix operations can be avoided by carefully exploiting the inner structure of the likelihood term. We prove that a sweep of a Gibbs sampler can be 2 computed in O(n2 +nkB ) time, as opposed to O(n4 kB ) for a naive implementation. Experiments show that these algorithmic improvements make it possible to apply the model to large-scale datasets. (iii) A semi-supervised experiment with globin proteins revealed the strength of our partition process model which is flexible enough to introduce new