Convex Clustering with Exemplar-Based Models Anonymous Author(s) Affiliation Address email Abstract Clustering is often formulated as the maximum likelihood estimation of a mixture model that explains the data. The EM algorithm widely used to solve the resulting optimization problem is inherently a gradient-descent method and is sensitive to initialization. The resulting solution is a local optimum in the neighborhood of the initial guess. This sensitivity to initialization presents a significant challenge in clustering large data sets into many clusters. In this paper, we present a different approach to approximate mixture fitting for clustering. We introduce an exemplar-based likelihood function that approximates the exact likelihood. This formulation leads to a convex minimization problem and an efficient algorithm with guaranteed convergence to the globally optimal solution. The resulting clustering can be thought of as a probabilistic mapping of the data points to the set of exemplars that minimizes the average distance and the information-theoretic cost of mapping. We present experimental results illustrating the performance of our algorithm and its comparison with the conventional approach to mixture model clustering. 1 Introduction Clustering is one of the most basic problems of unsupervised learning with applications in a wide variety of fields. The input is either vectorial data, that is, vectors of data points in the feature space, or proximity data, the pairwise similarity or dissimilarity values between the data points. The choice of the clustering cost function and the optimization algorithm employed to solve the problem determines the resulting clustering [1]. Intuitively, most methods seek compact clusters of data points, namely, clusters with relatively small intra-cluster and high inter-cluster distances. Other approaches, such as Spectral Clustering [2], look for clusters of more complex shapes lying on some low dimensional manifolds in the feature space. These methods typically transform the data such that the manifold structures get mapped to compact point clouds in a different space. Hence, they do not remove the need for efficient compact-cluster-finding techniques such as k -means. The widely used Soft k -means method is an instance of maximum likelihood fitting of a mixture model through the EM algorithm. Although this approach yields satisfactory results for problems with a small number of clusters and is relatively fast, its use of a gradient-descent algorithm for minimization of a cost function with many local optima makes it sensitive to initialization. As the search space grows, that is, the number of data points or clusters increases, it becomes harder to find a good initialization. This problem often arises in emerging applications of clustering for large biological data sets such as gene-expression. Typically, one runs the algorithm many times with different random initializations and selects the best solution. More sophisticated initialization methods have been proposed to improve the results but the challenge of finding good initialization for EM algorithm remains [4]. We aim to circumvent the initialization procedure by designing a convex problem whose global optimum can be found with a simple algorithm. It has been shown that mixture modeling can be formulated as an instance of iterative distance minimization between two sets of probability 1 distributions [3]. This formulation shows that the non-convexity of mixture modeling cost function comes from the parametrization of the model components . More precisely, any mixture model is, by definition, a convex combination of some set of distributions. However, for a fixed number of mixture components, the set of all such mixture models is usually not convex when the distributions have, say, free mean parameters in the case of normal distributions. Inspired by combinatorial, non-parametric methods such as k -medoids [5] and affinity propagation [6], our main idea is to employ the notion of exemplar finding, namely, finding the data points which could best describe the data set. We assume that the clusters are dense enough such that there is always a data point very close to the real cluster centroid and, thus, restrict the set of possible cluster means to the set of data points. Further, by taking all data points as exemplar candidates, the modeling cost function becomes convex. A variant of EM algorithm finds the globally optimal solution. Convexity of the cost function means that the algorithm will unconditionally converge to the global minimum. Moreover, since the number of clusters is not specified a priori, the algorithm automatically finds the number of clusters depending only on one temperature-like parameter. This parameter, which is equivalent to a common fixed variance in case of Gaussian models, defines the width scale of the desired clusters in the feature space. Our method works exactly in the same way with both proximity and vectorial data, unifying their treatment and providing insights into the modeling assumptions underlying the conversion of feature vectors into pairwise proximity data. In the next section, we introduce our maximum likelihood function, the algorithm that maximizes it. In Section 3, we make a connection to the Rate-Distortion theory as a way to build intuition about our objective function. Section 4 presents implementation details of our algorithm. Experimental results comparing our method with similar mixture model fitting methods are presented in Section 5, followed by a discussion of the algorithm and the related work in Section 6. 2 Convex Cost Function Given a set of data points X = {x1 , · · · , xn } I d , mixture model clustering seeks to maximize R the scaled log-likelihood function l({qj }k=1 , {mj }k=1 ; X ) j j , n jk 1i = log qj f (xi ; mj ) n =1 =1 (1) where f (x; m) is an exponential family distribution on random variable X. It has been shown that there is a bijection between regular exponential families and a broad family of divergences called Bregman divergence [7]. Most of the well-known distance measures, such as Euclidean distance or Kullback-Leibler divergence (KL-divergence) are included in this family. We employ this relationship and let our model be an exponential family distribution on X of the form f (x; m) = C (x) exp(-d (x, m)) where d is some Bregman divergence and C (x) is independent of m. Note that with this representation, m is the expected value of X under the distribution f (x; m). For instance, taking Euclidean distance as the divergence, we obtain normal distribution as our model f . In this work, we take models of the above form whose parameters m lie in the same space as data vectors. Thus, we can restrict the set of mixture components to the distributions centered at the data points, i.e., mj X . Yet, for a specified number of clusters k , the problem still has a combinatorial nature of choosing the right k cluster centers among n data points. To avoid this problem, we increase the number of possible components to n and represent all data points as cluster-center candidates. The new log-likelihood function is l({qj }n=1 ; X ) = j + n n jk jn 1i 1i log qj fj (xi ) = log qj e- d (xi ,xj ) const. , n =1 n =1 =1 =1 (2) where fj (x) is an exponential family member with its expectation parameter equal to the j th data vector and the constant denotes a term that does not depend on the unknown variables {qj }n=1 . j The constant scaling factor in the exponent controls the sharpness of mixture components. We Q . n maximize l(·; X ) over the set of all mixture distributions Q = |Q(·) = j =1 qj fj (·) 2 ^ The log-likelihood function (2) can be expressed in terms of the KL-divergence by defining P (x) = 1/n, x X , to be the empirical distribution of the data on I d and by noting that R x ^ ^ ^ D(P Q) = - P (x) log Q(x) - H(P ) = -l({qj }n=1 ; X ) + const. , (3) j X ^ where H(P ) is the entropy of the empirical distribution and does not depend on the unknown mixture coefficients {qj }n=1 . Consequently, the maximum likelihood problem can be equivalently stated as j ^ the minimization of the KL-divergence between P and the set of mixture distributions Q. It is easy to see that unlike the unconstrained set of mixture densities considered by the likelihood function (1), set Q is convex. Our formulation therefore leads to a convex minimization problem. Furthermore, it is proved in [3] that for such a problem, the sequence of distributions Q(t) with (t) corresponding weights {qj }n=1 defined iteratively via j qj (t+1) = qj (t) x X ^ P (x)fj (x) n (t) (x) j =1 qj fj (4) is guaranteed to converge to the global optimum solution Q if the support of the initial distribution (0) is the entire index set, i.e., qj > 0 for all j . 3 Connection to Rate-Distortion Problems Now, we present an equivalent statement of our problem on the product set of exemplars and data points. This alternative formulation views our method as an instance of lossy data compression and directly implies the optimality of the algorithm (4). The following proposition is introduced and proved in [3]: Proposition 1. Let Q be the set of distributions of the complete data random variable (J, X) {1, · · · , n} × I d with elements Q (j, x) = qj fj (x). Let P be the set of all distributions on the R ^ same random variable (J, X) which have P as their marginal on X. Then, QQ ^ min D(P Q) = P P ,Q Q D min (P Q ) (5) where Q is the set of all marginal distributions of elements of Q on X. Furthermore, if Q and (P , Q ) are the corresponding optimal arguments, Q is the marginal of Q . This proposition implies that we can express our problem of minimizing (3) as minimization of D(P Q ) where P and Q are distributions of the random variable (J, X). Specifically, we define: 1 ^ n rij , x X ; Q (j, x) = qj C (x)e-d (x,xj ) P (j, x) = P (x)P (j |x) = (6) 0, otherwise where qj and rij = P (j |x = xi ) are probability distributions over the set {j }n=1 . This formulation j ensures that P P , Q Q and that the objective function is expressed only in terms of variables qj and P (j |x) for x X . Our goal is then to solve the minimization problem in the space of distributions of random variable (J, I ) {j }n=1 × {j }n=1 , namely, in the product space of exemplar j j × data point indices. Substituting expressions (6) into the KL-divergence D(P Q ), we obtain the equivalent cost function: l + n rij 1i Q rij og + d (xi , xj ) const. (7) D(P )= n ,j =1 qj i 1 It is straightforward to show that for any set of values rij , setting qj = n rij minimizes (7). Substituting this expression into the cost function, we obtain the final expression l + n 1i rij Q ) D(P (P ) = rij og 1 i + d (xi , xj ) const. n ,j =1 ri j n = I(I ; J ) + EI ,J d (xi , xj ) + const. 3 (8) where the first term is the mutual information between the random variables I and J under the distribution P and the second term is the expected value of the pairwise distances with the same distribution on indices. The n2 unknown values of rij lie on n separate n-dimensional simplices. These parameters have the same role as cluster responsibilities in soft k -means: they stand for the probability of data point xi choosing data point xj as its cluster-center. The algorithm defined in (4) is in fact the standard Arimoto-Blahut algorithm [10] commonly used for solving problems of the form (8). We established that the problem of maximizing log-likelihood function (2) is equivalent to minimizing the objective function (8). This helps us to interpret this problem in the framework of the Rate-Distortion theory. The data set can be thought of as an information source with a uniform distribution on the alphabet X . Such a source has entropy log n, which means that any scheme for encoding an infinitely long i.i.d. sequence generated by this source requires on average this number of bits per symbol, i.e., has a rate of at least log n. We cannot compress the information source beyond this rate without tolerating some distortion, when the original data points are mapped to other points with nonzero distances between them. We can then consider rij 's as a probabilistic encoding of our data set onto itself with the corresponding average distortion D = EI ,J d (xi , xj ) and the rate I(I ; J ). A solution rij that minimizes (8) for some yields the least rate that can be achieved having no more than the corresponding average distortion D. This rate is usually denoted by R(D), a function of average distortion, and is called the rate-distortion function [8]. Note that we have R/ D = - , 0 < < at any point on the rate-distortion function graph. The weight qj for the data point xj is a measure of how likely this point is to appear in the compressed representation of the data set, i.e., to be an exemplar. Here, we can rigorously quantify our intuitive idea that higher number of clusters (corresponding to higher rates) is the inherent cost of attaining lower average distortion. We will see an instance of this rate-distortion trade-off in Section 5. 4 Implementation The implementation of our algorithm costs two matrix-vector multiplications per iteration, that is, has a complexity of order n2 per iteration, if solved with no approximations. Letting sij = exp(- d (xi , xj )) and using two auxiliary vectors z and , we obtain the simple update rules n jn 1i sij (t) (t) (t) (t+1) (t) (t) zi = sij qj j = qj = j qj (9) n =1 z (t) =1 i where the initialization qj is nonzero for all the data points we want to consider as possible exemplars. At the fixed point, the values of j are equal to 1 for all data points in the support of j j and are q less than 1 otherwise [10]. In practice, we compute the gap between maxj (log j ) and qj log j in each iteration and stop the algorithm when this gap becomes less than a small threshold. Note (t) (t) (t) that the soft assignments rij = qj sij /nzi need to be computed only once after the algorithm has converged. Similar to other interior-method point methods, the convergence of our algorithm becomes slow as we move close to the vertices of the probability simplex where some qj 's are very small. In order to improve the convergence rate, after each iteration, we identify all qj 's that are below a certain threshold (10-3 /n in our experiments,) set them to zero and re-normalize the entire distribution over the remaining indices. This effectively excludes the corresponding points as possible exemplars and reduces the cost of the following iterations. Any value of [0, ) yields a different solution to (8). Smaller values of correspond to having wider clusters and greater values correspond to narrower clusters. Neither extreme, one assigning all data points to the central exemplar and the other taking all data points as exemplars, is interesting. Thus, we should have a reference value o that identifies a reasonable range of values. We chose i an empirical value o = n2 log n/ ,j d (xi , xj ) as an a priori estimate of the derivative of ratedistortion function at a reasonable point on its graph. This value is the slope of a line connecting the following two (R, D) points: the case of sending any point to itself with R(0) = log n and the case i 2 of a completely random code rij = 1/n with zero rate and average distortion of ,j d (xi , xj )/n . The corresponding line is illustrated in Figure 1 (left) for the specific example described in Section 5. In order to further speed up the algorithm for larger data sets, we can search over values of sij for any i and keep only the largest no values in any row turning the proximity matrix into a sparse 4 (0) 6 12 5 10 4 8 Rate (bits) 3 6 2 4 1 2 0 0 100 200 300 Average Distortion 400 500 600 700 800 900 1000 0 0 0.5 1 / 1.5 2 2.5 o Figure 1: (Left) Rate-distortion function for the example described in the text. The line with slope -o is also illustrated for comparison (dotted line) as well as the point corresponding to = o (cross) and the line tangent to the graph at that point. (Right) The exponential of rate (dotted line) and number of hard clusters for different values of beta (solid line.) The rate is bounded above by logarithm of number of clusters. one. The reasoning is simply that we expect any point to be represented in the final solution with exemplars relatively close to it. We observed that as long as no values are a few times greater than the expected number of data points in each cluster, the final results remain almost the same with or without this preprocessing. This approximation decreases the running time of the algorithm by factor n/no . 5 Experimental Results To illustrate some general properties of our method, we apply it to the set of 400 random data points in I 2 shown in Figure 2. We use Euclidean distance and run the algorithm for different values R of . Figure 1 (left) shows the resulting rate-distortion function for this example. As we expect, the estimated rate-distortion function is smooth, monotonically decreasing and convex. To visualize the clustering results, we turn the soft responsibilities into hard assignments. Here, we first choose the set of exemplars to be the set of all indices j that are MAP estimate exemplars for some data point i under P (j |xi ). Then, any point is assigned to its closest exemplar. The resulting number of hard clusters for different values of are shown in Figure 1 (right). The figure indicates two regions of with relatively stable number of clusters, namely 4 and 10, while other cluster numbers have a more transitory nature with varying . The distribution of data points in Figure 2 shows that this is a reasonable choice of number of clusters for this data set. However, we also observe some fluctuations in the number of clusters even in the more stable regime of 's. Comparing this behavior with the monotonicity of our rate depicts how by turning the soft assignments into the hard ones, we lose the strong optimality guarantees we have for the original soft solution. Nevertheless, since our global optimum is minimum to a well justified cost function, we expect to obtain relatively good hard assignments. We further discuss this aspect of the formulation in Section 6. Figure 2 illustrates the shapes of the resulting hard clusters for different values of . We can see how clusters split when we increase . Such cluster splitting behavior also occurs in the case of a Gaussian mixture model with unconstrained cluster centers and has been studied as phase transitions of a corresponding statistical system [9]. The nature of this connection remains to be further investigated. The main motivation for developing a convex formulation of clustering is to avoid the well-known problem of local optima and sensitivity to initialization. We compare our method with a soft k means algorithm, namely, a mixture model of the form (1) where f (x; m) is a Gaussian distribution and the problem is solved using the EM algorithm. We generate the data sets by drawing points from the normal distributions centered around a set of vectors in I 50 . The latter set is also generated with R a Gaussian distribution around zero with standard deviation 1.5 or 2, depending on the experiment, in each dimension. First, we consider the case of mean-parameterized Gaussian models with a common variance 2 = 1/2 where has the same value as the one we use in our method. In addition to the initial cluster centers and their corresponding weights, the number of clusters must be specified for the mixture model. In contrast, our method does not require specified number of clusters and is 5 40 40 30 30 20 20 10 10 0 0 -10 -10 -20 -20 -30 (a) - 30 - 20 - 10 0 10 20 30 -30 (b) -30 - 20 - 10 0 10 20 30 40 -40 -40 40 -40 40 - 40 40 30 30 20 20 10 10 0 0 -10 -10 -20 -20 -30 (c) - 30 - 20 - 10 0 10 20 30 -30 (d) -30 - 20 - 10 0 10 20 30 40 -40 - 40 40 -40 40 - 40 40 30 30 20 20 10 10 0 0 -10 -10 -20 -20 -30 (e) - 30 - 20 - 10 0 10 20 30 -30 (f) -30 - 20 - 10 0 10 20 30 40 -40 - 40 -40 40 - 40 Figure 2: The clusters found for different values of , (a) 0.1o (b) 0.5o (c) o (d) 1.2o (e) 1.6o (f) 1.7o . The exemplar data point of each cluster is denoted by a cross. The range of normal distributions for any mixture model is illustrated here by circles around these exemplar points with radius equal to the square root of the variance corresponding to the value of used by the algorithm ( = (2 )-1/2 ). Shapes and colors denote cluster labels. independent of initialization. We use the following approach to compare the two methods. We run soft k -means 50 times with different random initializations and uniform weights where k is set equal to the true number of clusters. Then we apply our algorithm to the same data set and choose the k highest resulting exemplar probabilities and corresponding data points to initialize another round of soft k -means. Note that the global minimum of our problem is also a local optimum of a similar unconstrained mean-parameterized cost function as long as all our resulting exemplars are considered. Since we ignore the real support of qj and select only some subset of exemplars, the solution can be improved in practice with a few extra steps of the EM algorithm. Figure 3 compares the final log-likelihood values for the Gaussian mixture model found with the above procedure for varying values of parameter 2 (i.e., 1/2 ). The two data sets are generated around 50 mean vectors (i.e., k =50) with 250 or 300 data points in each cluster. The single run of soft k -means initialized with our exemplars yields results almost as good or even better than the best other runs for this large data set. This example demonstrates how our simplification of the objective function helps us move from a random-search approach to a more deterministic one with satisfactory final results. 2 Next, we compare our method with a full parameterized Gaussian model where the variance k of each Gaussian component is also estimated in the maximization step of EM. We set the number 6 x 10 -5.5 -6 4 x 10 -6.5 -7 -7.5 4 -6.5 Log-likelihood -7 -7.5 -8 -8.5 -9 -9.5 Log-likelihood -8 -8.5 -9 -9.5 -10 -10.5 -11 -11.5 -10 1 1.2 1.4 / 1.6 1.8 2 1 1.2 1.4 o / 1.6 1.8 2 o Figure 3: Resulting log-likelihood value of all the fifty runs of soft k-means with random initializations (scattered points) and one initialized with the highest exemplar probabilities of our method (solid line) for (Left) 12,500 and (Right) 15,000 data points in 50 clusters generated as described in the text. of clusters to k , number of actual generating Gaussians, and run the EM algorithm 50 times with random initializations. We select the result with maximum log-likelihood as the final solution of k -means. The resulting soft-assignments are then turned into hard clusters by taking the highest responsibility for any data point as its corresponding cluster assignment. We run our algorithm only once with parameter = o , take the k number of j indices with the same highest qj and assign each data point to its closest exiemplar. As the measure of clustering quality for these hard assignments, we define distortion as xi - mci 2 where ml is the mean of all data points assigned to cluster l and ci is the cluster assignment of data point i. Again, we generate 50 points in I 50 to be the centers of normal distributions generating the actual R data sets. Different data sets are then formed by randomly drawing different numbers of data points in each cluster. Figure 4 (left) shows the resulting distortions of our method and the full mixture model with variance estimation where our method has outperformed the best runs of the mixture model. In another experiment, we change the number of clusters (each containing 100 data points) and repeat the previous procedure comparing our method with the best of 200 runs of the mean and variance parameterized Gaussian mixture model. Figure 4 (right) clearly indicates that with the increasing the number of clusters, our performance significantly improves compared to the regular mixture model method. This is another evidence that for larger data sets, the less precise nature of our constrained search, compared to the full mixture models, is well compensated by its ability to always find its global optimum. In general the value of should be tuned around o to find the desired solution. We plan to develop a more systematic way for choosing . 6 Discussion and Related Work Since only the distances take part in our formulation and the values of data point vectors are not required, we can extend this method to any proximity data. Given a matrix Dn×n = [dij ] that describes the pairwise symmetric or asymmetric dissimilarities between data points, we can replace d (xi , xj )'s in (8) with dij 's and solve the same minimization problem whose convexity can be directly verified. The algorithm works in exactly the same way and all the aforementioned properties carry over to this case, as well. A previous application of rate-distortion theoretic ideas in clustering led to the deterministic annealing (DA). In order to avoid local optima, DA gradually decreases an annealing parameter, tightening the bound on the average distortion [9]. However, at each temperature the same standard EM updates are used. Consequently, the method does not provide strong guarantees on the global optimality of the resulting solution. Affinity propagation is another recent exemplar-based clustering algorithm. It finds the exemplars by forming a factor graph and running a message passing algorithm on the graph as a way to minimize the clustering cost function [6]. If the data point i is represented by the data point ci , assuming a common preference parameter value for all data points, the objective function of affinity propi agation can be stated as dici + k where k is the number of found clusters. This last step is needed to put some cost on picking any point as an exemplar to prevent the trivial case of send7 70 68 66 64 70 65 Distortion 60 58 56 54 52 50 50 70 90 110 130 150 170 190 210 230 250 270 290 310 Distortion 62 60 55 50 45 5 15 25 35 45 55 65 75 85 95 105 115 125 Number of Data Points Per Cluster Number of Clusters Figure 4: Comparison of the resulting distortion from the hard clustering assignments found by our method (solid line) and the best result of Gaussian mixture model (dotted line) for (Left) different number of data points and (Right) different number of clusters. ing any point to itself. Outstanding results have been reported for the affinity propagation [6] but theoretical guarantees on its convergence or optimality are yet to be established. We can interpret our algorithm as a relaxation of this combinatorial problem to the soft assignment case allowing probabiliities P(ci = j ) = rij of associating point i with an exemplar j . The marginal 1 distribution qj = n rij is the probability that point j is an exemplar. In order to use analytical tools for solving this problem, we have to turn the regularization term k into a continuous function of assignments. A possible choice might be H(q ), entropy of distribution qj , which is bounded above by log k . However, the entropy function is concave and any local or global minimum of a concave minimization problem over a simplex occurs in an extreme point of the feasible domain which in our case corresponds to the original combinatorial hard assignments [11]. In contrast, using mutual information I(I , J ) induced by rij as the regularizing term turns the problem into a convex problem. Mutual information is convex and serves as a lower bound on H(q ) since it is always less than the entropy of both of its random variables. Now, by letting = 1/ we arrive to our cost function in (8). We can therefore see that our problem is a convex relaxation of the original combinatorial problem. In conclusion, we proposed a framework for constraining the search space of general mixture models to achieve global optimality of the solution. In particular, our method promises to be useful in problems with large data sets where regular mixture models fail to yield consistent results due to their sensitivity to initialization. We also plan to further investigate generalization of this idea to the models with more elaborate parameterizations. References [1] J. Puzicha, T. Hofmann, and J. M. Buhmann, "Theory of proximity based clustering: Structure detection by optimization," Pattern Recognition, Vol. 33, No. 4, pp. 617­634, 2000. [2] A. Y. Ng, M. I. Jordan, and Y. Weiss, "On Spectral Clustering: Analysis and an Algorithml," Advances in Neural Information Processing Systems, Vol. 14, 2001. ´ [3] I. Csiszar and P. Shields, "Information Theory and Statistics: A Tutorial," Foundations and Trends in Communications and Information Theory, Vol. 1, No. 4, pp. 417­528, 2004. [4] M. Meila, and D. Heckerman, "An Experimental Comparison of Model-Based Clustering Methods," Machine Learning, Vol. 42, No. 1-2, pp. 9­29, 2001. [5] J. Han, and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001. [6] B. J. Frey, and D. Dueck, "Clustering by Passing Messages Between Data Points," Science, Vol. 315., No. 5814, pp. 972­976, 2007. [7] A. Banerjee, S. Merugu, I. S.Dhillon, and J. Ghosh, "Clustering with Bregman Divergences," Journal of Machine Learning Research, No. 6, pp. 1705-1749, 2005. [8] T. M. Cover, and J. A. Thomas, Elements of information theory, New York, Wiley, 1991. [9] K. Rose, "Deterministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems," Proceedings of the IEEE, Vol. 86, No. 11, pp. 2210­2239, 1998. [10] R. E. .Blahut, "Computation of Channel Capacity and Rate-Distortion Functions," IEEE Transactions on Information Theory, Vol. IT-18, No. 4, pp. 460­473, 1974. [11] M. Pardalos, and J. B. Rosen, "Methods for Global Concave Minimization: A Bibliographic Survey," SIAM Review, Vol. 28, No. 3., pp. 367­379, 1986. 8