WWW 2007 / Poster Paper Topic: Search A Clustering Method For Web Data With Multi-Type Interrelated Components Levent Bolelli1 , Seyda Er tekin1 , Ding Zhou1 , C. Lee Giles1,2 1 Depar tment of Computer Science and Engineering 2 School of Information Sciences and Technology The Pennsylvania State University University Park, PA 16802 {bolelli, ser tekin, dzhou}@cse.psu.edu giles@ist.psu.edu We present K-SVMeans clustering, which integrates two sources of information into a single clustering framework. The clustering along the main data typ e of interest is p erformed using the p opular K-Means algorithm and relational similarity in the additional dimension is learned through Online Supp ort Vector Machines [1]. The most significant advantage of online SVMs is that they are not batch learners and thus, can handle streaming data. The ability to use SVMs in an online setting enables us to efficiently integrate them with unsup ervised learning algorithms, and as will b e shown later, this combination does not require manually lab eled data for SVM training. ABSTRACT Traditional clustering algorithms work on "flat" data, making the assumption that the data instances can only b e represented by a set of homogeneous and uniform features. Many real world data, however, is heterogeneous in nature, comprising of multiple typ es of interrelated comp onents. We present a clustering algorithm, K-SVMeans, that integrates the well known K-Means clustering with the highly p opular Supp ort Vector Machines(SVM) in order to utilize the richness of data. Our exp erimental results on authorship analysis of scientific publications show that K-SVMeans achieves b etter clustering p erformance than homogeneous data clustering. 2. K-SVMEANS CLUSTERING The original formulation of K-Means algorithm first initializes p clusters with data ob jects and then assigns each ob ject xi , 1 i N to a cluster cj , 1 j p where xi 's distance to the representative of its assigned cluster cj is minimum. Variants of K-Means algorithm differ in the initialization of clusters (e.g. random or maximum cluster distance initialization), the definition of similarity (e.g. Euclidean or Kullback-Leibler Divergence), or the definition of cluster representativeness (e.g. mean, median or weighted centroid vector). K-SVMeans algorithm is indep endent of any of those variations, but for brevity, we will describ e the algorithm for Spherical K-Means with random initialization where each cluster is represented by its centroid vector. During K-SVMeans clustering process, an SVM is trained for each cluster on the additional(secondary) dimension of the data. For instance, in document clustering, the documents are clustered using K-Means and an SVM for each cluster is trained on the authors of the documents that b elong to their resp ective clusters. The clustering decisions in K-SVMeans can b e represented as follows: Let us denote the ob jects in the primal data typ e as X = (x1 , x2 , · · · , xn ) and the second data typ e as U = (u1 , u2 , · · · , um ). Let ui j represent the relationship b etween xi and uj and let ui denote the set of u's connected to xi . Intermediate cluster assignment decisions in K-SVMeans are determined by two conditions. A data ob ject xi is moved from cluster cj to ck when 1) xi is closer to ck 's centroid and cj 's SVM classifies ui as negative and ck 's SVM classifies ui as p ositive (b oth K-Means and SVM have to agree on the cluster assignment change), and 2) in case xi 's candidate cluster ck 's Categories and Subject Descriptors I.5.3 [Pattern Recognition]: Clustering, Algorithms General Terms Algorithms, Exp erimentation Keywords K-SVMeans, Multi-Typ e Data Clustering, Online SVM, KMeans 1. INTRODUCTION Discovery of latent semantic groupings and identification of intrinsic structures in datasets is a crucial task for many data analysis needs. Most real-world data, esp ecially data available on the web, p ossess rich structural relationships, such as web images and surrounding texts, web pages and hyp erlinks, scientific publications and authors. In these examples, the secondary data typ es are often either neglected by traditional clustering algorithms, or individual clusterings on each dimension are mapp ed onto a combined clustering solution. The former approach under-utilizes the information available to the clusterer, whereas the latter neglects the structural relationships b etween the individual data typ es. Copyright is held by the author/owner(s). WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. 1121 WWW 2007 / Poster Paper Distance / Clus. Init. Spherical / Random Spherical / Well Sep. Euclidean / Random Euclidean / Well Sep. K-Means 68.418 69.306 55.945 58.712 K-SVMeans(x1) 73.318 75.243 60.284 64.392 K-SVMeans(x2) 76.102 77.713 61.575 65.941 K-SVMeans(x3) 76.194 80.596 62.082 66.746 Topic: Search Table 1: Experimental Results based on the F1 scores of the clustering solutions. SVM learner decides that ui do not b elong to that cluster (i.e. the decision values of the ui are negative), then we apply a p enalty term on the distance function of K-Means so that the similarity b etween xi and the candidate cluster centroid must b e strong enough to warrant a cluster assignment change of xi . The p enalty term also ensures us that the SVM learners are not adversely effected by the incorrect clustering decisions of K-Means that result in mislab eling of the ui . Only highly similar xi are allowed a cluster change in case the SVM classification decision is not trusted. If xi moves from cj to ck , ui are added to the SVM of ck as p ositive, and SVMs of cp , p = k as negative observations. K-SVMeans can b e run in multiple iterations where the initialization of the SVM learner is p erformed by using the clustering solution generated in the previous run. In the first iteration, we run standard K-Means algorithm to yield a clustering based on the primary data typ e X . This iteration has two purp oses. First, we use the clustering result from this step as a baseline for comparison. Second, and more imp ortantly, it generates the lab eled initialization set for the SVM learners of K-SVMeans. In the b eginning of an t iteration t + 1, K-SVMeans looks at each cluster i generated in the previous run and selects m ob jects closest to the t centroid of i and use their associated u's for SVM initialization of ci . We use one-against-rest classification in the SVMs, so the u's b ecome p ositive observations for their resp ective clusters, and negative observations for the rest of the clusters. 3. EXPERIMENTS We conducted exp eriments on a subset of CiteSeer's1 rep ository of scientific literature to evaluate the clustering p erformance of K-SVMeans by comparing the predicted cluster of each document with the categorical lab els from the document corpus. The CiteSeer dataset we used contains 7623 pap ers from 16 conferences, authored by 5623 distinct authors. The pap ers are group ed into 5 topical categories based on their publication venues. Each author ai is represented as a collection of the words in the documents that ai has (co)authored. Since each document can p otentially have multiple authors, each author is represented as X 1 a i fj = · w(fj , dk ) (1) Rank(ai , dk ) a i dk authors by %50 at each successive iteration of K-SVMeans. The p enalty term that accounts for SVM misclassification of authors for the clustering distance function of the documents is set to 1.5 empirically. As the evaluation metric, we used the standard F1 measure that measures the harmonic mean of precision(p) and recall(r ). Our rep orted results are micro-averaged F1 scores which gives equal weight to each document and is indep endent of cluster sizes. For the KMeans clustering section of K-SVMeans algorithm, we used the Gmeans clustering toolkit [2] and we integrated it with the LASVM package [1]. We rep ort results for two clustering criterion functions of K-Means, averaged over ten runs. The first clustering algorithm is the Euclidean K-Means that makes clustering decisions based on the euclidean distances b etween the document vectors. The second algorithm we used is the Spherical K-Means that uses the cosine distances b etween documents as the similarity metric. For b oth clusterings, we exp erimented with two initialization schemes. In the first scheme, each document is initially assigned a random cluster ID. The second scheme chooses one of the cluster centroids as the farthest p oint from the center of the whole data set, and all cluster centroids are well separated. Our exp erimental results show that K-SVMeans ourp erforms K-Means significantly, regardless of the clustering criterion function or the initialization scheme of K-Means. KSVMeans(x2) and K-SVMeans(x3) are the second and third iterations of the clustering, resp ectively. The inclusion of more and more authors to the SVM initialization set in each successive iteration enables the learners to build accurate models earlier in the clustering solution, and thus, increases the clustering accuracies. 4. CONCLUSIONS Traditional clustering algorithms are not sufficient to deal with the existing (and emerging) data that is heterogeneous in nature, where relationships b etween ob jects can b e represented through multiple layers of connectivity. We presented a novel clustering algorithm K-SVMeans which is designed to p erform clustering on rich structured multivariate datasets. Our exp erimental results on the integration of authorship analysis with topical clustering of documents show significant improvements over traditional K-Means and confirms that there is great b enefit in incorp orating additional dimensions of similarity into a unified clustering solution. where Rank(ai , dk ) is the rank of authorship of author ai in document dk and w(fj , dk ) is the TF-IDF score of feature fj in dk . The author vectors are L2 normalized to eliminate the effects of different document lengths and numb er of authored documents. We initialize K-SVMeans(x1) with 50 authors from the clustering solution obtained from the KMeans iteration, and increase the numb er of initialization 1 5. REFERENCES [1] A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classifiers with online and active learning. Journal of Machine Learning Research, 6:1579­1619, Septemb er 2005. [2] I. S. Dhillon and D. S. Modha. Concept decomp ositions for large sparse text data using clustering. Machine Learning, 42(1):143­175, Jan 2001. http://citeseer.ist.psu.edu 1122