Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary? Nguyen Xuan Vinh n.x.vinh@unsw.edu.au Julien Epps j.epps@unsw.edu.au The University of New South Wales, Sydney, Australia & ATP Laboratory, National ICT Australia (NICTA) James Bailey jbailey@csse.unimelb.edu.au The University of Melbourne, Vic. 3010, Australia & Victoria Research Laboratory, National ICT Australia Abstract Information theoretic based measures form a fundamental class of similarity measures for comparing clusterings, beside the class of pair-counting based and set-matching based measures. In this paper, we discuss the necessity of correction for chance for information theoretic based measures for clusterings comparison. We observe that the baseline for such measures, i.e. average value between random partitions of a data set, does not take on a constant value, and tends to have larger variation when the ratio between the number of data points and the number of clusters is small. This effect is similar in some other non-information theoretic based measures such as the well-known Rand Index. Assuming a hypergeometric model of randomness, we derive the analytical formula for the expected mutual information value between a pair of clusterings, and then propose the adjusted version for several popular information theoretic based measures. Some examples are given to demonstrate the need and usefulness of the adjusted measures. to find better clustering methods will be hardly feasible without the development of effective measures for clusterings comparison, an open research area which has also received much attention. Various clustering comparison measures have been proposed: besides the class of pair-counting based measures including the well-known Adjusted Rand Index (Hubert & Arabie, 1985), and set-matching based measures, such as the H criterion (Meil, 2005), information theoretic based a measures, such as the Mutual Information (Strehl & Ghosh, 2002) and the Variation of Information (Meil, a 2005), form another fundamental class of clustering comparison measures. In this paper, we aim to improve the usability of the class of information theoretic-based measures for comparing clusterings. We first observe that such measures either do not have a fixed bound, or do not have a constant baseline value, i.e. average value between random partitions of a data set. Since a measure is meant to provide a comparison mechanism, it is generally preferable that it lies within a predetermined range and has a constant baseline value, so as to facilitate comparison and enhance intuitiveness. For information theoretic-based measures, the former has often previously been accomplished through a normalization scheme, e.g. division by the maximum value of the index, while the latter, i.e. baseline adjustment, to our knowledge, has not been addressed. As will be seen shortly, unadjusted information theoretic based measures have a considerable inherent bias attributable solely to chance, which potentially reduces their usefulness in a number of common situations, such as measuring the distance from a set of clusterings with different number of clusters to a "true" clustering. In this paper, by assuming a hypergeometric model of randomness, we derive the analytical formula for the expected mutual information value between a pair of clusterings, and then propose the adjusted version for 1. Introduction Clustering is the "art" of dividing data points in a data set into meaningful groups. The usefulness of this technique has been proven through its widespread application in virtually all fields of science. Over the past few decades, there have been thousands of papers proposing hundreds of clustering algorithms. The endeavor Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary? several popular information theoretic based measures. While the assumption of a well specified randomness model is needed for theoretical analysis, our experimental results suggest that the adjusted measures also work well in more realistic scenarios where such an assumption is usually violated. The paper begins by reviewing some well-known clustering comparison measures in section 2, and discussing the need for baseline adjustment for the class of information theoretic based measures in section 3. The derivation of the adjusted measures is presented in section 4. Selected demonstrations for the new measures are given in section 5 while section 6 concludes the paper. and V, while N01 and N10 can be used as disagreement indicators. A well known index of this class is the Rand Index (Rand, 1971), defined straightforwardly as: RI(U, V) = (N00 + N11 )/ N 2 (1) 2. Background and Related Work Let S be a set of N data points {s1 , s2 , . . . sN }. We consider the case of hard clustering. Given two clusterings of S, namely U = {U1 , U2 , . . . , UR } with R clusters, and V = {V1 , V2 , . . . , VC } with C clusters (R Ui = C Vj = , R Ui = C Vj = S), the i=1 j=1 i=1 j=1 information on cluster overlap between U and V can be summarized in the form of a R × C contingency i=1...R table M = [nij ]j=1...C as illustrated in Table 1, where nij denotes the number of objects that are common to clusters Ui and Vj . Based on this contingency table, various cluster similarity indices can be built. Table 1. The Contingency Table, nij = |Ui Vj | U/V U1 U2 . . . UR Sums V1 n11 n21 . . . nR1 b1 V2 n12 n22 . . . nR2 b2 ... ... ... .. . ... ... VC n1C n2C . . . nRC bC Sums a1 a2 . . . aR nij = N ij The Rand Index lies between 0 and 1. It takes the value of 1 when the two clusterings are identical, and 0 when no pair of points appear either in the same cluster or in different clusters in both clusterings, i.e. N00 = N11 = 0. This happens only when one clustering consists of a single cluster while the other consists only of clusters containing single points. However this scenario is quite extreme and has little practical value. In fact, it is desirable for the similarity index between two random partitions to take values close to zero, or at least a constant value. The problem with the Rand index is that its expected value between two random partitions does not even take a constant value. Hubert and Arabie (1985), by taking the generalized hypergeometric distribution as the model of randomness, i.e. the two partitions are picked at random subject to having the original number of classes and objects in each, found the expected value for (N00 + N11 ). They suggested using a corrected version of the Rand index of the form: Index - Expected Index Adjusted Index = M ax Index - Expected Index (2) thus giving birth to the (Hubert and Arabie) Adjusted Rand Index (ARI): nij ARI = ij 1 2 i ai 2 2 - j bj 2 i ai 2 i j ai 2 bj 2 / j N 2 bj 2 + - / N 2 (3) 2.1. Indices based on Pair Counting An important class of criteria for comparing clusterings is based upon counting the pairs of points on which two clusterings agree or disagree. Any pair of data points from the total of N distinct pairs in S 2 falls into one of the following 4 categories: (1) N11 : the number of pairs that are in the same cluster in both U and V; (2) N00 : the number of pairs that are in different clusters in both U and V; (3) N01 : the number of pairs that are in the same cluster in U but in different clusters in V; (4) N10 : the number of pairs that are in different clusters in U but in the same cluster in V. Explicit formulae for calculating the number of the four types can be constructed using entries in the contingency table (Hubert & Arabie, 1985), e.g. R C N11 = 1 i=1 j=1 nij (nij - 1). Intuitively, N11 and 2 N00 can be used as indicators of agreement between U where nij 's are entries in the contingency table and ai , bj 's are its marginal sums. The ARI is bounded above by 1 and takes on the value 0 when the index equals its expected value (under the generalized hypergeometric distribution assumption for randomness). Besides the Adjusted Rand Index, there are many other, possibly less popular, measures in this class. (Albatineh et al., 2006) discussed correction for chance for a comprehensive list of 28 different indices in this class, a number which is large enough to make the task of choosing an appropriate measure difficult and confusing. Their work, and subsequent extension of (Warrens, 2008), however, showed that after correction for chance, many of these measures become equivalent, facilitating the task of choosing a measure. 2.2. Information Theoretic based Indices Another class of clustering comparison measures, which are information theoretic based, have also been Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary? employed more recently in the clustering literature (Banerjee et al., 2005; Strehl & Ghosh, 2002; Meil, a 2005). Although there is currently no consensus on which is the best measure, information theoretic based measures have received increasing attention for their strong theoretical background. Let us first review some of the very fundamental concepts of information theory (Cover & Thomas, 1991) and then see how those concepts might be used toward assessing clusterings agreement. Definition 2.1 The information entropy of a discrete random variable X, that can take on possible values in its domain X = {x1 , x2 , . . . , xn } is defined by: H(X) = - xX where P (i, j) denotes the probability that a point belongs to cluster Ui in U and cluster Vj in V: P (i, j) = |Ui Vj | N (9) MI is a non-negative quantity upper bounded by both the entropies H(U) and H(V), i.e. I(U, V) min{H(U), H(V)}. It quantifies the information shared by the two clusterings and thus can be employed as a clusterings similarity measure as in (Banerjee et al., 2005). (Meil, 2005) suggested using the soa called Variation of Information (VI), which she proved to be a true metric on the space of clusterings: V I(U, V) = H(U) + H(V) - 2I(U, V) (10) p(x) log p(x) (4) Definition 2.2 The mutual information between two random variables X and Y with respective domains X and Y is defined by: I(Y, X) = xX yY (Strehl & Ghosh, 2002) on the other hand, employed a normalized version of the Mutual Information defined as : I(U, V) (11) N M I(U, V) = H(U)H(V) The VI is lower bounded by 0 (when the two clusterings are identical) and always upper bounded by log(N ), though tighter bounds are achievable depending on the number of clusters (Meil, 2005). The a Normalized Mutual Information (NMI), on the other hand, has a fixed lower bound of 0 and upper bound of 1. It takes the value of 1 when the two clusterings are identical and 0 when the two clusterings are independent, i.e. share no information about each other. In the latter case, the contingency table takes the form of the so-called "independence table" where nij = |Ui ||Vj |/N for all i, j. It can be seen that this scenario is quite intuitive (larger clusters are expected to share more data points), and less extreme than the one where the Rand Index takes on a zero value as described earlier, i.e. a clustering contains only 1 cluster while the other contains only singleton clusters. p(y, x) log p(y, x) p(x)p(y) (5) The mutual information is a symmetric measure that quantifies the mutual dependence between two random variables, or the information that X and Y share. It measures how much knowing one of these variables reduces our uncertainty about the other. This property suggests that the mutual information can be used to measure the information shared by two clusterings, and thus, assess their similarity. For this purpose, we need to put the clusterings in a statistical context. Let us first define the entropy of a clustering U. Suppose that we pick an object at random from S, then the probability that the object falls into cluster Ui is: P (i) = |Ui | N (6) We define the entropy associated with the clustering U as: R 3. Information Theoretic Measures: is a Correction for Chance Necessary? Depending upon the specific type of application, it might be preferable for a clustering comparison measure to have fixed bounds. This is accomplished with a normalization scheme such as (11) for the Normalized Mutual Information. The baseline value of such a measure attributable solely to chance agreement is also of interest. For information theoretic based measures, the latter has received less attention. Let us consider the following two motivating examples: 1) Example 1 - Distance to a "true" clustering: given a set of N data points and a clustering U with R clusters being the "true" clustering. Suppose an algorithm H(U) = - i=1 P (i) log P (i) (7) H(U) is non-negative and takes the value 0 only when there is no uncertainty determining an object's cluster membership, i.e. there is only one cluster. Similarly, the entropy of the clustering V can be calculated as C H(V) = - j=1 P (j) log P (j) where P (j) = |Vj |/n. Now we arrive at the Mutual Information (MI) between two clusterings: P (i, j) I(U, V) = P (i, j) log P (i)P (j) i=1 j=1 R C (8) Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary? generates a clustering V with C clusters, another generates a clustering V with C clusters. We need to assess the goodness of the two clustering algorithms, that is, find out whether V or V is closer to the true clustering U. If C = C then the situation would be quite simple. Since the setting is quite the same for both V and V , we expect the comparison to be "fair" under any particular measure. However if C = C the situation would be more complicated. If a measure without fixed bounds, such as the Variation of Information (VI) were employed, its upper bound would not be the same for V I(U, V) and V I(U, V ), and the conclusion that a VI value of, says, 0.3, is better than a VI value of 0.5, might be misleading without knowing the respective upper bounds. In this context, it is preferable to employ a normalized measure such as the Normalized Mutual Information (NMI), with fixed bounds 0 and 1. The NMI however, is not totally problem-free. We construct a small experiment as follows: consider a set of N data points, let the number of clusters vary from 2 to Kmax and suppose that the true clustering has Ktrue = [Kmax /2] clusters. Now for each value of K, generate 10,000 random independent clustering solutions (by assigning each data points to a random clusters with equal probability), and calculate the average NMI (of the form given by (11)), Rand Index (RI) and Adjusted Rand Index (ARI) between those clusterings to the true clustering. The results for various combinations of (N, Ktrue ) are given in Figure 1. It can be observed that the unadjusted measures such as the RI and NMI have a monotonic increasing pattern as K increases. For the NMI, the variation is more markedly visible at smaller values of the ratio N/K. Thus even by selecting totally at random, a 7-clusters solution would have a greater chance to defeat a 3clusters solution, although there isn't any difference in the clustering generation methodology. A measure which has been corrected for chance such as the Adjusted Rand Index, on the other hand, has a baseline value always close to zero, and appears not to be biased in favor of any particular value of K. Thus for this example, an adjusted version of the Mutual Information with correction for chance will be necessary for our purpose. 2) Example 2 - Determining the number of clusters via Consensus Clustering: We start by first providing some background on Consensus Clustering. In an era where a huge number of clustering algorithms exist, the Consensus Clustering idea (Monti et al., 2003; Strehl & Ghosh, 2002; Yu et al., 2007) has recently received increasing interest. Consensus Clustering is not just another clustering algorithm: it rather provides a framework for unifying the knowledge obtained from N=30 data points, Ktrue= 5 1 0.8 0.6 0.4 0.2 0 2 4 6 8 Number of clusters K 10 N=100 data points, Ktrue= 5 1 0.5 0 2 4 6 8 Number of clusters K 10 N=500 data points, Ktrue= 10 1 average NMI average ARI average RI N=2000 data points, Ktrue= 25 1 0.5 0.5 0 5 10 15 Number of clusters K 20 0 10 20 30 40 Number of clusters K 50 Figure 1. Average distance between sets of random clusterings to a "true" clustering. N=30 data points 0.8 0.6 0.4 0.2 0 2 N=100 data points 0.8 0.6 0.4 0.2 0 2 4 6 8 Number of clusters K N=500 data points 10 4 6 8 Number of clusters K 10 N=2000 data points 0.8 0.6 0.4 0.2 0 0.8 0.6 0.4 0.2 0 5 10 15 Number of clusters K 20 average NMI average ARI average RI 10 20 30 40 Number of clusters K 50 Figure 2. Average pairwise distance within a set of random clusterings, each with the same number of clusters K. the other algorithms. Given a data set and a single or a set of clustering algorithms, Consensus Clustering employs the clustering algorithm(s) to generate a set of clustering solutions on either the original data set or its perturbed versions. From those clustering solutions, Consensus Clustering aims to choose a robust and high quality representative clustering. Although the main objective of consensus clustering is to discover a high quality cluster structure in a data set, closer inspection of the set of clusterings obtained can often give valuable information about the appropriate number of clusters present. More specifically, we empirically observe that the set of clusterings obtained when the specified number of clusters coincides with the true number of clusters tends to be less diverse, an indication of the robustness of the obtained cluster Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary? structure. To quantify this diversity we have recently developed a novel index (Vinh & Epps, 2009), namely the Consensus Index (CI), which is built upon a suitable clustering similarity measure. Given a value of K, suppose we have generated a set of B clustering solutions UK = {U1 , U2 , . . . , UB }, each with K clusters. We define the Consensus Index (CI) of UK as: CI(UK ) = i