SIGIR 2007 Proceedings Poster Document Clustering - an Optimization Problem Ao Feng Center for Intelligent Information Retrieval Depar tment of Computer Science University of Massachusetts Amherst, MA 01003 aofeng@cs.umass.edu ABSTRACT Clustering algorithms have b een widely used in information retrieval applications. However, it is difficult to define an objective "b est" result. This article analyzes some document clustering algorithms and illustrates that they are equivalent to the optimization problem of some global functions. Exp eriments show their good p erformance, but there are still counter-examples where they fail to return the optimal solution. We argue that Monte-Carlo algorithms in the global optimization framework have the p otential to find b etter solutions than traditional clustering, and they are able to handle more complex structures. Categories and Sub ject Descriptors: H.3.3 [Information Search and Retrieval]: Clustering General Terms: Algorithms, Performance, Theory Keywords: Clustering algorithm, Global optimization, Simulated annealing scor e(i, j, Rij ) = 2.1 Hierarchical Clustering With a pair-wise similarity matrix of n documents, hierarchical agglomerative clustering (HAC) starts from n singleton clusters (each containing exactly one document). In each round, the most similar cluster pair is merged, and the process goes on until the highest similarity falls b elow the preset clustering threshold. Since only the similarity matrix is available, the goal for clustering is to group together documents that have high similarities, and keep the non-similar documents apart. If we formulate a relation matrix R where Rij = 0 when documents di and dj are in the same cluster and -1 otherwise, a global score function can b e defined as S= c 1 i=j scor e(i, j, Rij ) if Rij = -1 if Rij = 0 (1) (2) i,j n sim(di , dj ) 1. INTRODUCTION For a text collection with a large numb er of unlab eled documents, the commonly-used analysis is to run a clustering process based on the proximity among these elements. Documents in a cluster are very likely to match the same information need [2]. There are many clustering algorithms [1], and most evaluation methods involve manual annotation of "truth" data. How to define "optimal" clusters without any sub jective relevance judgment is still an op en problem. Some clustering algorithms have b een identified equivalent to the optimization of their cost functions [1]. In this article, we will show that it is true for most (if not all) clustering methods. Section 2 introduces a few algorithms and defines their corresp onding functions. Section 3 compares the p erformance of clustering to simulated annealing and shows when it fails. Conclusions and further extension of the framework are introduced in Section 4. here di , dj are documents, and c is a constant. If c is the same as the clustering threshold, the global score S is always increased in the process of HAC. Supp ose that Ck and Cl are the most similar clusters, and sim(Ck , Cl ) = d sim(di , dj )/(|Ck ||Cl |) > c i Ck ,dj Cl (3) After merging them, the change in S is S - S= d sim(di , dj ) - c|Ck ||Cl | > 0 i Ck ,dj Cl (4) In a divisive clustering algorithm, the basic op eration is to split a cluster from an edge with low across-edge similarity. If the average similarity is smaller than c, it also raises S . 2.2 Partitional Clustering K-means is an algorithm that clusters ob jects in a vector space into k partitions. It starts with k initial cluster centroids and assigns each ob ject to its closest one. In each round, the centroids are recalculated and ob jects are reassigned. It keeps running until the clusters converge. The process of K-means is trying to minimize the intracluster variance. V= 2. CLUSTERING ALGORITHMS There are mainly two typ es of clustering algorithms, hierarchical and partitional. In this section, we will select one representative method from each and show that it corresp onds to a global optimization problem. We b elieve that other algorithms have their own global functions as well. Copyright is held by the author/owner(s). SIGIR ' 07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. ix k =1 |xj - µi |2 j Ci (5) here xj is an ob ject, Ci is a cluster and µi is its centroid. It is easy to prove that V decreases in each step. 819 SIGIR 2007 Proceedings Poster 3. ANALYSIS In the previous section, we have describ ed two clustering algorithms and shown their global functions. The clustering process corresp onds to the optimization (in a hill-climbing manner) of the function. A natural question is, are they guaranteed to return the b est solution? Algorithm pr ecision r ecall F1 S HAC 0.330 0.540 0.410 3787.6 SA-b est S 0.267 0.535 0.356 3779.1 SA-b est F 1 0.307 0.555 0.395 3773.1 Table 1: Performance of HAC and SA 3.1 Evaluation There are two ways to evaluate the clustering output. With the global functions defined ab ove, we can calculate the ob jective: values closer to the optimum mean b etter results. The other is to compare system-generated clusters to some "truth" data, where a b etter match gets higher score. The latter dep ends on the quality of the manual annotation. Assume that we randomly select two ob jects xi and xj , each of them will b e assigned to some cluster in the system output and in the truth data, resp ectively. If they have the same memb ership status in b oth cases, it is regarded as a successful case for the system. pr ecision = P (C (xi ) = C (xj )|C (xi ) = C (xj )) r ecall = P (C (xi ) = C (xj )|C (xi ) = C (xj )) 2×pr ecision×r ecall F1 = pr ecision+r ecall (6) The result of K-means is greatly dep endent on the initial selection of centroids. Even with deterministic process, different starting states lead to different results. Sometimes the solution after convergence can b e much worse than the global optimum, indicating that there are local minima in the top ology of the intra-cluster variance V . Figure 1: Counter-example: HAC - solid, optimal solution- dashed here C (xi ) is the cluster xi b elongs to in the truth data, and C (xi ) is its cluster in system output. 3.2 Implementation The collection used in the exp eriments is part of TDT31 . Six topics are selected from the same scenario (science/discovery), with a total of 280 news stories. They are divided into ab out 30 manually generated clusters, each describing a news event. The similarity matrix is calculated with tf-idf, where the term vectors are built based on the b ody text of stories. The only parameter in HAC is the clustering threshold, and we tune it to maximize F 1 defined in Equation 6. The optimal threshold is 0.09 from the exp eriment. Simulated annealing (SA) is implemented to optimize the global score S in Equation 1. It also starts with a list of singleton clusters (R is all -1 except for the diagonal elements that are 0). In each round, a story di is randomly picked, and another story dj is selected based on the probability distribution of Ri . If Rij = 0, the cluster they are in will b e broken into two, where di and dj are in different parts. If Rij = -1, the clusters di and dj are in will b e merged. Whenever S increases after a round, the change is kept. Otherwise, it is kept with some probability. Since SA does not have deterministic output, we run it 20 times, and the b est runs are shown in Table 1. HAC achieves b etter p erformance on b oth F 1 and S , but SA runs are very close to it. In the exp eriment, HAC gets higher S than all SA runs. Does that mean HAC always finds the optimal solution? Figure 1 shows a counter-example. HAC combines 1 and 2 first since they are the most similar pair, then no other nodes can b e merged b ecause the average similarity cannot exceed 0.4. However, the b est clusters are {1,3} and {2,4}, which gets 2.6 instead of 2.55 for S . Monte-Carlo algorithms can easily find the ideal solution with their ability to get out of local maxima. 1 Available from the linguistic data consortium (LDC), catalog numb er LDC2001T58. 4. CONCLUSIONS With two candidate algorithms, we have shown that many clustering methods corresp ond to the optimization of ob jective global functions, and they often fail to return the optimal solution. Comparison with HAC shows that SA in the global optimization framework can achieve at least close p erformance to the deterministic algorithm, and it has the p otential to find b etter results. However, the computational complexity limits its application in large collections. Another advantage of the global optimization framework is that it can model more complex relations. For example, news stories contain contextual information, which shows logical, causal, temp oral or spatial links among rep orts, in addition to the term similarity that is used to cluster documents. Optimizing multiple relations together is likely to yield b etter results than a traditional clustering algorithm for such applications. Acknowledgments This work was supp orted in part by the Center for Intelligent Information Retrieval and in part by the Defense Advanced Research Pro jects Agency (DARPA) under contract numb er HR0011-06-C-0023. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect those of the sp onsor. 5. REFERENCES [1] A. K. Jain and R. C. Dub es. Algorithms for Clustering Data. Prentice-Hall, 1988. [2] C. J. van Rijsb ergen and K. S. Jones. A test for the separation of relevant and non-relevant do cuments in exp erimental retrieval collections. Journal of Documentation, 29(3):251­257, 1973. 820