SIGIR 2007 Proceedings Poster OMES: A New Evaluation Strategy Using Optimal Matching for Document Clustering Xiaojun Wan Institute of Computer Science and Technology, Peking University, Beijing 100871, China wanxiaojun@icst.pku.edu.cn ABSTRACT Existing measures for evaluating clustering results (e.g. F-measure) have the limitation of overestimating cluster quality because they usually adopt the greedy matching between classes (reference clusters) and clusters (system clusters) to allow multiple classes to correspond to one same cluster, which is in fact a locally optimal solution. This paper proposes a new evaluation strategy to overcome the limitation of existing evaluation measures by using optimal matching in graph theory. A weighted bipartite graph is built with classes and clusters as two disjoint sets of vertices and the edge weight between any class and any cluster is computed using a basic metric. Then the total weight of the optimal matching in the graph is acquired and we use it to evaluate the quality of the clusters. The optimal matching allows only one-to-one matching between classes and clusters and a globally optimal solution can be achieved. A preliminary study is performed to demonstrate the effectiveness of the proposed evaluation strategy. information retrieval [1], and thus we use F-measure as an illustration example in this study. Assume we have p classes and q clusters, where q is usually equal to p. We treat each cluster as if it were the result of a query and each class as if it were the desired set of documents for a query. We then calculate the recall and precision of that cluster for each given class. More specifically, for class i (1ip) and cluster j (1jq) Precision(i,j) = nij/nj Recall(i,j) = nij/ni where nij is the number of members of class i in cluster j, nj is the number of members of cluster j and ni is the number of members of class i. The F-measure of class i and cluster j is then given by F(i,j)=(2*Recall(i,j)*Precision(i,j))/(Precision(i,j)+Recall(i,j)) For an entire clustering the F-measure of any class is the maximum value it attains at any cluster and an overall value for the F-measure is computed by taking the weighted average of all values for the F-measure as given by the following. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval ­ clustering General Terms: Measurement, Performance, Design Keywords: OMES, Evaluation Measure, Document Clustering, Optimal Matching Document clustering has been extensively investigated in a number of different areas of text mining and information retrieval. The aim of document clustering is to group documents into clusters within which the documents are similar and sharing the same characteristics. A variety of clustering algorithms have been proposed to achieve this goal, such as k-means, agglomerative clustering, divisive clustering, and normalized cut [2, 5]. Different clustering algorithms lead to different clustering results, and specific measures (e.g. F-measure, entropy) [2, 5] are designed and used to evaluate the quality of the produced clusters. Usually, the automatically produced clusters (which are denoted as "clusters") are compared with the manually labeled clusters (which are denoted as "classes"), and the more similar the clusters are to the classes, the better quality the clusters have. In this study, we take F-measure as an illustration example. We first discuss its limitations and then propose a novel evaluation strategy using optimal matching to address the limitations. A preliminary study demonstrates the effectiveness of the proposed evaluation strategy. F= i =1 p ni max n j {F ( i,j )} (1) where the max is taken over all clusters and n is the number of all documents in the set. The larger the F-Measure is, the better the cluster quality is. Though the above F-measure has been widely used for evaluating data clustering in much previous work, it has the following limitations. The first limitation is that the above F-measure finds the cluster with the maximum F-measure value for each class in a greedy way, which may potentially let one same cluster correspond to more than one class, thus allowing multiple-to-one matching between classes and clusters. Our intuition is that one cluster should be matched to only one class when comparing the classes and the clusters, which is actually what we expect for clustering result. The second limitation is that the above F-measure finds the cluster with the maximum F-measure value for each class without considering other matchings between classes and clusters, which may overestimate the cluster quality by using a locally optimal solution to evaluate the whole clustering. Our intuition is that the comparison between classes and clusters should be based on an optimal solution globally on the whole clustering. 2. THE PROPOSED OMES In order to address the limitations of F-measure, we propose a novel evaluation strategy to formalize the mapping between classes and clusters as the optimal matching problem in graph theory, and allow only one-to-one matching between classes and clusters. A globally optimal solution can be achieved by solving the optimal matching problem. 1. F-MEASURE AND ITS LIMITATIONS F-measure is one of the most popular measures for evaluating clustering result by combing the precision and recall ideas from Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. 693 SIGIR 2007 Proceedings Poster Given p classes X={x1,x2,...,xp} and q clusters Y={y1,y2,...yq}, the pairwise F-measure value F(i,j) is computed between any class xi in X and any cluster yj in Y. If classes and clusters are treated as nodes, we can build a bipartite graph G={X, Y, E}, where V=XY is the vertex set, and E={eij} is the edge set with each edge eij linking a class xi and a cluster yj. The weight wij associated with each edge eij is computed using the basic F-measure as follows: 3. PRELIMINARY STUDY The evaluation of the effectiveness of the proposed OMES is based on the assumption that automatic evaluations should correlate highly, positively, and consistently with human assessments, which ensures whenever a human recognizes a good clustering result, an automatic evaluation will do the same with high probability. In the preliminary study, we randomly chose 10 topics from the TDT3 corpus [6], and for each topic (class) we randomly selected 10 documents, and thus we had totally 100 documents for evaluation. Various clustering algorithms (e.g. k-means, agglomerative clustering, divisive clustering, etc.) with different parameters were adopted to group the documents into 10 clusters. The clustering result corresponding to each algorithm were compared with the ground truth result (classes) using both F-measure and OMES, and also by subjects. The clustering results for different algorithms were ranked according to the overall evaluation values. The ranked lists based on F-measure and OMES were compared with that based on human judgment. To quantify the correlation, we computed the Spearman rank order correlation coefficient () between automatic ranked list and human-labeled ranked list as follows: wij = F (i,j ) (2) Note that in addition to F-measure, other basic measures (e.g. entropy) can also be used to compute the edge weight. Given the weighted bipartite graph G, the Kuhn-Munkres algorithm (also known as the Hungarian method) [3, 4] is employed to solve the optimal matching problem. Optimal matching (OM) and maximal matching (MM) are classical problems in graph theory. A matching M of G is a subset of the edges with the property that no two edges of M share the same node. Given the graph G, MM is to find a matching M that has as many edges as possible. OM is basically an extension of MM and it is to find the matching M that has the largest total weight (i.e. the sum of the weights of all edges in the matching M). The computational complexity of Kuhn-Munkres algorithm is O((p+q)3). Lastly the optimal matching M in the graph G is acquired and we use the normalized total weight as the performance value for the clusters: ~ = 1- N ( N 2 - 1) 6( D 2 ) (4) OMES = ~ eij M ni wij n (3) where 6 is a constant (it is always used in the formula), D refers to the difference between the ranks of each clustering result in the two lists, and N is the number of clustering algorithms. Table 1 shows the results and we can see that the OMES based ranking of the clustering results has higher correlation with human judgment than the F-Measure based ranking. Table 1. Results of preliminary study F-Measure vs. Subject 0.90 OMES vs. Subject 0.95 Figure 1 gives an example to illustrate the class-to-cluster matchings in F-measure and OMES, where x1, x2, x3 are three classes and y1, y2, y3 are three clusters. wij denotes the basic F-measure value between class xi and cluster yj. Figures 1(b) and 1(c) show the matchings between classes and clusters in F-measure and OMES, respectively. x1 x2 x3 w11=0.5; w12=0.4; w13=0.3 w21=0.7; w22=0.2; w23=0.4 w31=0.3; w32=0.8; w33=0.5 y1 x1 y2 x2 y3 x3 x1 x2 x3 The preliminary study demonstrates the effectiveness of the proposed evaluation strategy of OMES for document clustering. We will perform more thorough user study to make the conclusion more convincing in future work. How to appropriately take into account the balance between the sizes of classes and clusters when q is unequal to p will be investigated in future work. 4. REFERENCES [1] Baeza-Yates, R., and Ribeiro-Neto, B. Modern Information Retrival. ACM Press and Addison Wesley, 1999. [2] Jain, A. K, Murty, M. N., and Flynn, P. J. Data clustering: a review. ACM Computing Surveys, 31(3):264-323, 1999. [3] Kuhn, H. W. The Hungarian method for the assignment problem, Naval Res. Logist. Quart. 2: 83­97, 1955. [4] Munkres, J. Algorithms for the assignment and transportation problems, J. Soc. Indust. Appl. Math. 5: 32­38, 1957. [5] SteinBack, M., Karypis, G., and Kumar, V. A comparison of document clustering techniques. In KDD Workshop on Text Mining, 2000. [6] TDT3. http://projects.ldc.upenn.edu/TDT3/ (a) Weighted bipartite graph y1 y2 y3 y1 y2 y3 (b) Greedy matching in F-measure (c) Optimal matching in OMES Figure 1. An illustration example for F-measure and OMES 694