Document Clustering with Prior Knowledge
Xiang Ji


Wei Xu

Sh e n g h u o Z h u

Yahoo! Inc. 701 First Avenue Sunnyvale, CA 94089, U.S.A.

NEC Labs America, Inc. 10080 Nor th Wolfe Road Cuper tino, CA 95014, U.S.A.
{xw,zsh}@sv.nec-labs.com

xiangji@yahoo-inc.com ABSTRACT
Document clustering is an important tool for text analysis and is used in many different applications. We propose to incorporate prior knowledge of cluster membership for document cluster analysis and develop a novel semi-supervised document clustering model. The method models a set of documents with weighted graph in which each document is represented as a vertex, and each edge connecting a pair of vertices is weighted with the similarity value of the two corresponding documents. The prior knowledge indicates pairs of documents that known to belong to the same cluster. Then, the prior knowledge is transformed into a set of constraints. The document clustering task is accomplished by finding the best cuts of the graph under the constraints. We apply the model to the Normalized Cut method to demonstrate the idea and concept. Our experimental evaluations show that the proposed document clustering model reveals remarkable performance improvements with very limited training samples, and hence is a very effective semi-supervised classification tool.

Categories and Subject Descriptors
H.3.3 [Information Systems]: Information Search and Retrieval; I.5.3 [Computing Metho dologies]: Clustering

General Terms
Algorithms, Experimentation, Theory

Keywords
Semi-supervised learning, Spectral clustering, Clustering with prior knowledge

1. INTRODUCTION
Document clustering is the unsupervised classification of a given document set into clusters such that document within This work was done when at NEC Labs America.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00.

each cluster are more similar between each other than those in different clusters. As an important tool for exploratory data analysis, the document clustering is enabling techniques for a wide range of information retrieval tasks such as efficient organization [16, 24], browsing [3, 6] and summarization [14] of large volumes of text documents. Document clustering generally assumes no training samples from the user, and aims to partition an unlabeled sample set into the user specified number of disjoint clusters through an unsupervised, exploratory process. Most clustering techniques define loss functions over all possible cluster assignment. The loss functions measure the degree to which the clustering goal is met. By optimizing certain loss functions, the optimal assignment should be achieved. The optimization problem is typically solved without considering any prior knowledge on cluster assignment of the data sets. However, in practical, there is usually some prior knowledge available for many data clustering problems. The prior knowledge includes, but not limited to, cluster assignment of some data instances and size of cluster. It comes either from the nature of the data sets or derived clues through intended exploration of data sets. The prior knowledge can be used to form constraints on the ob jective functions of cluster analysis. By applying the constraints to the optimization problem, prior knowledge of data clustering is incorporated into clustering methods to improve the clustering performance. In this paper, we present a novel semi-supervised clustering model that incorporates prior knowledge about documents' membership for document cluster analysis. It can also be viewed as a technique that enables the user to control the document clustering process by providing his/her prior knowledge specific to the target data set to be clustered. The prior knowledge is provided by indicating pairs of documents that are known to belong to the same cluster, and the number of such training samples can be arbitrary. The user's prior knowledge is then enforced as a set of constraints, and the document clustering task is accomplished by finding the cluster set that globally optimizes the constrained cost function. The proposed clustering model can be applied to any spectral clustering methods, and in this paper, we apply the model to the Normalized Cut method, which has been shown to be the one of the best spectral clustering method [17], to demonstrate the idea and concept. Our clustering model proposed in this paper can be considered as a mechanism of incorporating the user's supervision into the unsupervised document clustering process, which results in a manageable mixture of supervised classification and unsupervised clustering. When there is no

405


prior knowledge about the target data set, the model equals the ordinary data clustering method. When the user provides the prior knowledge, the model starts to show the characteristic of the supervised classification, which is the attempt to generate the cluster set that minimizes the errors on the training samples. The more training samples the user provides, the more resemblance the model bears to the supervised data classification. Our experimental evaluations have shown that this data clustering model is very effective for clustering document corpora with extreme distributions, and for performing data clustering with special users' needs. The remainder of the paper is organized as follows. We review previous work on various data clustering methods in Section 2. Section 3 presents our proposed model, which contains the weighted graph model, constraints, and constrained spectral clustering model. In Section 4, we discuss details of our experiments. We conclude the paper in Section 5 and discuss several possible future research topics.

20 15 10 5 0 -5 -10 -15 -20 -20

-10

0

10

20

30

(a) An artificial data set.

20 15 10 5 0

2. TRADITIONAL VS. SEMI-SUPERVISED CLUSTERING
In this section, we first review problems associated with traditional data clustering methods, and then review related works on semi-supervised clustering/classification.

-5 -10 -15 -20 -20

-10

0

10

20

30

2.1 Problems of Traditional Clustering Methods
To date, almost every clustering method employs a certain cost function to reflect the designer's general knowledge about the internal structures and distributions of target data corpora. The data clustering task is accomplished by finding the cluster set that optimizes the predefined cost function. For example, K-Means [8] uses the sum of squared distances between the data points and their corresponding cluster centers as the cost function, and partitions a data set into K clusters that locally minimizes this cost functions. Spectral clustering models the given data set using an affinity graph, and the data clustering task is accomplished by finding the best cuts of the graph that optimize certain predefined cost functions. Many cost functions, such as the Ratio Cut [4], Average Association [17], K-Means [25], Normalized Cut [17], Min-Max Cut [5], etc, have been proposed, and the optimization of the cost functions usually leads to the computation of the top eigenvectors of certain graph affinity matrices. As data clustering results are derived by solving certain eigen-systems, they are guaranteed to be globally optimal with respect to the predefined cost functions. The above data clustering methods, as well as most other methods in the literature, all make use of such cost functions that enforce the concept of maximizing the intra-cluster similarity, and/or minimizing the inter-cluster similarity [9]. This approach works well in general cases, but fails drastically when target data sets possess complex, extreme data distributions, and when the user has special needs for the data clustering task. Figure 1(a) shows an artificially generated data set with an extreme distribution. The data set consists of a total of 100 data points which form three natural clusters: a circular ring with 80 data points, a group of 10 data points centered at (0, 0), and another group of 10 data points centered at (25, 0). Assume that the user wants to group the data set

(b) Clustering into two clusters using Normalized Cut. Figure 1: An artificial data set and its clustering using Normalized Cut.

into two clusters: the circular ring with the group of 10 data points inside it as the first cluster, and the remaining group of 10 data points outside the circular ring as the second cluster. If we apply the ordinary Normalized Cut data clustering method to this data set, we will obtain the two data clusters as shown in Figure 1 (b) where data points belonging to the same cluster are depicted using the same symbol. Obviously, the Normalized Cut method fails drastically by generating a completely unexpected clustering result. The above artificial example brings about a very important question: For each given data set, there are always many possible ways of partitioning the data set. Which partition is the most appropriate one? The answer to this question depends on both the internal structure/distribution of the data set, and the user's data clustering need. With traditional data clustering methods, certain cost functions are employed to specify the designer's general knowledge about the internal structures and distributions of the target data sets, while no mechanisms are provided to reflect the user's prior knowledge or special data clustering needs on particular data sets. This is one of the main causes that traditional methods are prone to generate unsatisfactory results when target data sets possess complex distributions, and when users have special data clustering needs [12].

2.2

Related Work on Semi-Supervised Learning

There have been research studies that modify traditional K-means clustering to incorporate labeled data. Wagstaff, et al. introduced two types of constraints: the "must-link" and the "cannot-link" constraints [19, 20], and their semisupervised K-means produces data partitions by ensuring

406


none of the user specified constraints are violated. Basu, et al. also developed a semi-supervised K-means that make use of labeled data to generate initial seed clusters, and to guide the clustering process [1]. Another work closely related to our proposed data clustering model is the one from Yu and Shi, which performs image region segmentations using the Normalized Cut method with partial grouping constraints [23, 22]. The proposed method pro jects the original data to a feasible solution space based on constraints, and then generalizes Rayleigh-Ritz theorem [7] to get optimal solution in a relaxed continuous domain. In this new solution space, the constraints were strictly enforced with no exceptions (i.e.,"hard" constraints). The performance evaluations were conducted by showing region segmentation results of a few images, and there is no quantitative measures for the performance improvement. Compared with their method, our proposed data clustering model introduces a penalty term with a penalty coefficient matrix to incorporate the user's prior knowledge on the data set. This approach essentially imposes a set of "soft" constraints on the data clustering process, which enables the user to control the degree of enforcement of the prior knowledge. With the penalty coefficient matrix, the user is even able to enforce prior knowledge from multiple sources with different strengths. Another advantage of our method is that it tolerates false constraints with errors because of its soft constraint nature. Besides the semi-supervised clustering, there are also many research works on semi-supervised classification in the literature, such as semi-supervised EM [15], co-training [13], spectral graph transducer [11], Gaussian random field based semi-supervised learning [27], transductive SVM [2, 10], etc. Among these works, the transductive SVM extends the standard SVM by attempting to find a boundary that has the maximum margin on both the labeled data and unlabeled data. It is reported that the transductive SVM outperforms the standard SVM by a large margin, and has become a benchmark for semi-supervised learning methods.

and each edge (i, j )  E is assigned a weight wij  W to reflect the similarity between the documents i and j . Let Si , Sj denote two document clusters of the given data set V, and W (Si , Sj ) denote the sum of similarities between the two clusters Si and Sj : u wuv (1) W (Si , Sj ) =
Si ,v Sj

The cost function employed by Normalized Cut is defined as: FN C = iK W (Si , Si ) Ż W (Si , V) =1 (2)

where K is the total number of data clusters. In Eq.(2), Ż the numerator W (Si , Si ) measures how tightly the cluster Si is connected with the rest of the data set, while the denominator W (Si , V) measures how compact the entire data set is. Given the target data set, the Normalized Cut method attempts to partition the data into the clusters that minimize this cost function. The strategy for finding the optimal solution to this cost function is as follows. Let N be the total number of data points in the data set V, Xi = [x1i , x2i , . . . , xN i ]T be the indicator vector of the cluster Si in which each element xki takes a binary value {1, 0} to indicate if the k'th data point in the data set V belongs to Si or not. It is easy to prove that the following identities hold true:
T |Si | = Xi Xi T W (Si , Sj ) = Xi W Xj T W (Si , V) = Xi DXi T Ż W (Si , Si ) = Xi (D - W )Xi

(3) (4) (5) (6)

where D is the diagonal matrix such that D · e = W · e, e = [1, 1, . . . , 1]T . Using the above identities, the cost function FN C can be rewritten as: FN C =
T iK Xi (D - W )Xi T Xi DXi =1

3. CONSTRAINED SPECTRAL CLUSTERING
We present a document clustering model that enables the user to supervise/control the clustering process by providing his/her prior knowledge specific to the target data set to be clustered. The user provides his/her prior knowledge by showing an arbitrary number of training samples each of which indicates a pair of documents which the user wishes to be grouped into the same cluster. Our clustering model encodes the user's prior knowledge with a set of constraints to the cost function, and the document clustering task is carried out by solving a constrained optimization problem. The document clustering model can be applied to any spectral clustering methods, and in this paper we apply the model to the Normalized Cut method to demonstrate the basic idea and the concept.

=

K- K- K-
1

T iK Xi W Xi T Xi DXi =1 T i K Xi D 1 D - 1 W D - 1 D 1 Xi 2 2 2 2 =1 T Xi D - 2 D 2 Xi
1 1

=

iK
=1
1

=

YiT D- 2 W D- 2 Yi

1

1

(7)

3.1 Normalized Cut
Normalized Cut method, as well as other spectral clustering methods, model the given document set using a undirected graph G(V, E, W ) where V, E, W denote the vertex set, edge set, and graph affinity matrix, respectively. In graph G, each vertex Vi  V represents a document vector,

where Yi = D 2 Xi /||D 2 Xi ||. Let Y = [Y1 , Y2 , . . . , YK ]. Then we have Y T Y = I. By introducing the above cluster indicator vector Xi , we can turn the problem of finding the optimal clustering result into the following optimization problem: Find the set of K indicator vectors [X1 , X2 , . . . , XK ] with binary-valued elements that minimizes Eq.(7). Generally, solving this optimization problem has been proven to be NP-hard. However, if we relax all the elements of each indicator vector Xi from binary values to real values, the above optimization problem can be easily solved under the constraint of Y T Y = I. As in [7], FN C  K - (1 + 2 + · · · + K ) (8)

407


where 1 , · · · , K are the largest K eigenvalues of the matrix 1 1 D- 2 W D- 2 , and FN C reaches the minimum when Y1 , · · · , YK are the eigenvectors of the K largest eigenvalues. The K eigenvectors Y1 , · · · , YK obtained above encode the cluster membership information for the given data set V. However, since these eigenvectors take real values for their elements, they do not directly indicate the cluster membership for each data point. A common approach for deriving the final cluster set is to pro ject each data point into the eigen-space spanned by the above K eigenvectors (e.g., Vi is represented by the i'th row vector of the matrix Y = [Y1 , Y2 , · · · , YK ]), and apply the K-mean algorithm within this eigen-space.

By replacing Xi in Eq.(10) with Yi , we have FC N C iK =
=1

YiT D- 2 (D - W )D- 2 Yi + || U D- 2 Y ||2
1 1

1

1

1

= trace(Y T D- 2 (D - W )D- 2 Y ) +  · trace(Y T D- 2 U T U D- 2 Y ) = trace(Y T D- 2 (D - W +  U T U )D- 2 Y ) (12) Similar to Eq.(7), if we relax Yi to take real values, the above cost function FC N C can be minimized with the globally optimal solution. According to [7], FC N C  N + N -1 + · · · + N -K +1 (13)
1 1 1 1

3.2 Incorporating Prior Knowledge
The proposed document clustering model enables the user to supervise/control the clustering process by providing his/her prior knowledge specific to the target data set to be clustered. The prior knowledge is provided in the form of indicating several pairs of documents which the user wishes to be grouped into the same cluster. Assume that user has indicated that data points Vi , Vj  V belong to the same cluster. To encode this prior knowledge, we introduce the constraint vector Uc = [u1c , u2c , . . . , uN c ]T of dimension N in which uic = 1, uj c = -1, and the rest of the elements all equal zero. The user can specify an arbitrary number of training pairs, and for each training pair, a constraint vector Uc is introduced to encode the knowledge. Let U T = [U1 , U2 , . . . , UC ] be the constraint matrix in which each column Uc is a constraint vector, and X = [X1 , X2 , . . . , XK ] be the indicator matrix in which each column Xi is the indicator vector for the cluster Si . Obviously UX = 0 (9)

where N , . . . , N -K +1 are the K smallest eigenvalues of 1 1 the matrix D- 2 (D - W +  U T U )D- 2 , and the minimum is achieved when Y1 , . . . , YK are the eigenvectors of these K smallest eigenvalues. Once the K eigenvectors are obtained, we pro ject all the data points in the data set V into the eigen-space spanned by the these K eigenvectors, and apply the K-mean algorithm within this eigen-space. The algorithm of applying our proposed data clustering model to the Normalized Cut method is summarized as follows: 1. Create the graph affinity matrix W = [wij ] in which each element wij represents the similarity between the two documents Vi and Vj . 2. Compute the diagonal matrix D such that D · e = W · e, e = [1, 1, . . . , 1]T . 3. Form the constraint matrix U = [U1 , U2 , · · · , UC ] where each column vector Uc encodes a particular training pair provided by the user. ^ 4. Form the matrix W = D- 2 (D - W +  U T U )D- 2 , compute its K smallest eigenvalues N , . . . , N -K +1 and the corresponding eigenvectors Y1 , . . . , YK . 5. Pro ject each document Vi into the eigen-space spanned by the above K eigenvectors (i.e., Vi is represented by the i'th row vector of the matrix Y = [Y1 , Y2 , · · · , YK ]). Apply the K-means algorithm to find the K document clusters within this eigen-space. The computational cost of the above algorithm depends mainly on the total cost for computing the eigenvectors. The eigenvectors can be obtained by Lanczos method. Its computation cost is proportional to the number of nonzero elements of the ob ject matrix W , nnz (W ). Thus the cost of the ^ algorithm is O(kNLanczos nnz (W )), where k is the number of eigenvectors desired and NLanczos denotes the number of Lanczos iteration steps.
1 1

To enforce the user's prior knowledge, we add a penalty term to the original Normalized Cut cost function as follows: FC N C .
T iK Xi (D - W )Xi + || U X ||2 T Xi DXi =1

FC N C =

(10)

where   0 is a control parameter that controls the degree of enforcement of the user's prior knowledge. The larger value the parameter  takes, the stronger enforcement of the user's prior knowledge we will have, and the stricter the data clustering result will obey to the user's training samples; vise versa. In order to applying different control values for each constraint, the  can be defined as a diagonal matrix,  = diag (1 , 2 , · · · , n ). Similar to the Normalized Cut method, we try to find the best data clustering result by computing the indicator matrix X = [X1 , X2 , . . . , XK ] that minimizes the cost function FC N C . Using the symbols Yi and Y defined in Section 3.1, we can rewrite Eq.(9) as follows: U D- 2 Y  0
1

4.

PERFORMANCE EVALUATIONS

In this section, we present the details of our performance evaluations, including the descriptions of the data sets, the evaluation metrics and the performance comparisons.

4.1

Data Description

(11)

We have evaluated the performance of our document clustering model using two data sets: the Reuters-21578 and the

408


Table 1: Statistics Reuters-21578 do cument corpus. Reuters-21578 No. documents 21578 No. docs. used 8631 No. clusters 135 No. clusters used 21 Max. cluster size 3923 Min. cluster size 40 Med. cluster size 90 Avg. cluster size 411

20 Newsgroups document corpora. The Reuters-21578 document corpus has been one of the most widely used date sets for document clustering/classification purposes. This document corpus contains 21578 documents which have been manually clustered into 135 clusters. Each document in the corpus has been assigned one or more labels indicating which topic/topics it belongs to. In our test, we discarded documents with multiple category labels, and removed the clusters with less than 40 documents. This has lead to a data set that consists of 21 clusters with a total of 8631 documents. Table 1 provides the statistics of the Reuters document corpus. The Newsgroups data set contains about 20000 documents that were collected from 20 newsgroups in the public domain. The topics of these newsgroups are very diversified, ranging from computer graphics, automobiles, to religions, politics, etc. The number of news articles in each newsgroup is roughly the same. We pre-processed each document by tokenizing the text into bag-of-words. Then, we applied downcasting, stopwords removing, and stemming to the words. Based on the processed words, a feature vector is constructed with TFIDF weighting and normalization for each document.

ci and c , respectively, and p(ci , c ) denotes the joint probability that this randomly selected document belongs to both j ) clusters ci and c at the same time. MI(C , C takes values ) ) between zero and max(H(C ), H(C ), where H(C ) and H(C , are the entropies of C and C respectively. It reaches the ) maximum max(H(C ), H(C ) when the two sets of document clusters are identical, whereas it becomes zero when the two sets are completely independent. Another important char) acter of MI(C , C is that, for each ci  C , it does not need , to find the corresponding counterpart in C and the value keeps the same for all kinds of permutations. To simplify comparisons between different pairs of cluster sets, instead ) of using MI(C , C , we use the following normalized metric ) MI(C , C which takes values between zero and one: MI(C , C
)

j

j

=

MI(C , C max(H(C ), H(C ))

)

(15)

For comparisons with the semi-supervised classification methods, we use the accuracy metric AC as the performance measures. The AC measures how accurate a learning method assigns labels to a data set compared to the ground truth, and therefore is more appropriate for evaluating classification tasks. The accuracy of classification denotes the percentage of correctly classified test data out of the total number of test data, which is defined as below n ^ i=1 (yi , yi ) . (16) AC = n

4.3

Experimental Results

4.2 Evaluation Metrics
To study the performance of our proposed constrained Normalized Cut model (abbreviated as CNC), we have compared the performance of the model against two popular unsupervised clustering methods of K-means and Normalized Cut (NC) as well as two representative semi-supervised classification methods of Semi-supversed K-means (SK-means) [19] and Transductive SVM (TSVM) [10] with Reuters-21578 and 20 Newsgroups data sets. The goals of the empirical evaluation include (1) testing whether the constrained Normalized Cut model can utilize the transductive setting by comparing it against unsupervised clustering methods; (2) comparing against existing semi-supervised classification methods. For comparisons against the unsupervised clustering methods, we use the normalized mutual information metric MI, which measures the similarity between two sets of data partitions (the generated partition and the actual classes of document labels). , Given the two sets of document clusters C , C their mu) tual information metric MI(C , C is defined as: MI(C , C
)

c =
i C ,c C j j

(ci , c ) · log2
p

j

p(ci , c ) j p(ci ) · p(c )

j

(14)

where p(ci ), p(c ) denote the probabilities that a document randomly selected from the data set belongs to the clusters

We have tested the CNC model using both the Reuters and the Newsgroup document corpora, and compared its results with the following four methods: (1) K-means, (2) Normalized Cut, (3) Semi-supervised K-means, and (4) Transductive SVM. K-means and NC are the two most popular unsupervised data clustering methods, whereas SK-means and TSVM are the representative semi-supervised data clustering and classification methods, respectively. Through these comparison studies, we reveal the relative positions of our CNC model in comparisons with the representative clustering, semi-supervised clustering and classification methods. For comparisons with the K-means and NC, we conducted evaluations using different number of clusters (k) ranging from 2 to 6 without loss of generality. For each given cluster number k, 10 test runs were conducted, each on a test set created by mixing k topics randomly selected from the Reuters corpus. The final performance score was obtained by averaging the scores from 10 test runs. We also tested the CNC model with different amount of constraints. For each test run, the constraints were generated by pairing randomly selected documents that belong to the same topic, and the percentages of the randomly selected documents range from 2.5% to 7.5% of the total documents within the test set. Similar setting is also applied to the Newsgroups corpus. Table 2 and 3 show the normalized mutual information MI values on Reuters and Newsgroups, respectively. In the table, the first row (with the label "K-means") and the second row (with the label "NC") show the results generated by the traditional K-means and NC methods, respectively. The remaining rows show the results generated by the CNC model under different percentages of training samples. It is

409


Table 2: MI of K-Means, # of clusters 2 K-Means 0.6646 NC 0.7348 C NC 2.5% 0.7579 5.0% 0.7598 on all 7.5% 0.7707 data C NC 2.5% 0.7553 on test 5.0% 0.7560 7.5% 0.7643 data

NC, and CNC 3 4 0.6998 0.6367 0.8218 0.7660 0.8512 0.8063 0.8574 0.7805 0.8629 0.8207 0.8499 0.8036 0.8553 0.7748 0.8589 0.8134

on Reuters corpus,  = 20 5 6 Avg. 0.6377 0.6711 0.6620 0.6965 0.7066 0.7451 0.7121 0.7206 0.7696 0.7382 0.7338 0.7739 0.7407 0.7622 0.7914 0.7094 0.7192 0.7675 0.7321 0.7277 0.7692 0.7473 0.7530 0.7874

clear from the tables that NC always outperforms K-means, and that as long as the training samples are provided, CNC always outperforms NC with a large margin. The more constraints to the clustering process, the more improvements to the clustering accuracy. If no constraint is available, CNC becomes equivalent to NC, and will generate the same document clustering accuracies as shown in the second row of the table. To further reveal how the constraints affect the document clustering performance, we computed MI in two different ways. The first batch of MI's (from row 3 to row 5) were computed by treating the constraints as part of the entire test set, while the second batch of MI's (from row 6 to row 8) were computed by excluding the constraints from the test set. In other words, MI's from row 3 to row 5 indicate the document clustering accuracies of the CNC model on both the training and the test data, while MI's from row 6 to row 8 show the document clustering accuracies on the test data only. As expected, MI's computed from the test data only are slightly lower than those computed from both the training and the test data. Considering the small difference of the two sets of MI, the CNC model has a very good generalization ability.
0.8

0.795

0.79

normalized mutual information

0.785

0.78

0.775

0.77

0.765

0.76

0.755

0.75

5

10

15 beta

20

25

30

Figure 2: The p erformance changes when  varies. The coefficient  for the penalty term in Eq.(12) regulates the degree of enforcement of the user's prior knowledge.  . When  changes from 5 to 30, the clustering results for 2 clusters are plotted in Figure 2. With the increase of  , prior knowledge is enforced with stronger and stronger degree. Since all constraints are correct, the clustering performance keeps improved. Without loss of generality, we have set  to 20 in our experiments. To reveal the effectiveness of CNC model as a semi-supervised

classification method, we conducted performance comparisons with the SK-Means and TSVM classifier as well. We adopted the linear SVM here because many previous research studies suggest that the linear SVM generates excellent document classification results on the Reuters as well as many other document corpora [21]. In most semi-supervised classification settings, the classification process usually starts with a handful number of training samples such as 2 or 9 [19, 10]. In our experiment, 2.5%, 5.0%, and 7.5% of the test data, which are equal to about 2  45 documents in most cases, were selected as training samples. All the three methods were then tested using the remaining documents in the test run. The results are listed in Table 4. We also calculated the p-values of the one-sided Wilcoxon signedrank test [18] between the CNC and the TSVM as well as the CNC and SK-means. The one sided Wilcoxon signedrank test is a nonparametric paired test without assuming the underline distribution of the tested values. The results with p-value smaller than 0.05 are considered statistically significant. For the Newsgroups data set, we randomly select 100 data samples from each newsgroup and pair them together to form ten data sets each with 200 data samples. A certain percentage (2.5%, 5.0%, and 7.5%) of data, which are equal to 5, 10, and 15 data, were selected as training samples in our experiments. Table 5 presents the experimental results of the three semi-supervised learning methods based on 10 test runs. The Newsgroups data set contains more outliers than the Reuters data, thus clustering the newsgroups data is a more challenging task. The CNC outperforms TSVM with less than 15 training samples and outperforms SK-means with all of the three setting. The results are statistically significant. Our finding can be summarized as follows: (1) As long as the constraints are provided, the CNC model always outperforms the traditional K-means and NC methods. (2) The CNC model always outperforms the SK-means by a large margin. (3) When the number of training samples is very small, the CNC model significantly outperforms the TSVM (p-value < 0.05). When the number of training samples grows, (7.5% for Newsgroup, 10% for Reuters), CNC performs comparable with TSVM.

5.

CONCLUSION

In this paper, we proposed a constrained spectral clustering (CNC) method to incorporate user's prior knowledge during the document cluster analysis. This method enables users to control the document clustering process to either achieve desired cluster structures or get accurate clus-

410


Table 3: MI of K-Means, NC, and CNC on # of clusters 2 3 4 K-Means 0.6326 0.4071 0.3590 NC u t 0.7037 0.5279 0.4104 C NC 2.5% 0.7240 0.5628 0.4639 5.0% 0.7358 0.5839 0.4756 on all 7.5% 0.7502 0.5947 0.4870 data C NC 2.5% 0.7192 0.5589 0.4602 on test 5.0% 0.7308 0.5803 0.4708 7.5% 0.7462 0.5809 0.4843 data

Newsgroups corpus,  = 20 5 6 Avg. 0.3362 0.3219 0.4114 0.3892 0.3532 0.4769 0.4059 0.3725 0.5058 0.4268 0.3782 0.5201 0.4416 0.3691 0.5285 0.3922 0.3647 0.4990 0.4219 0.3758 0.5159 0.4357 0.3632 0.5221

Table 4: AC of CNC, TSVM, and Percentage of Labelled Data C NC TSVM p-value with TSVM SK-means p-value with SK-means

SK-means on Reuters Do cument Corpus 2.5% 5.0% 7.5% 10.0% 0.9632 0.9750 0.9771 0.9792 0.9428 0.9673 0.9752 0.9835 0.0041 0.0031 0.0074 0.5892 0.9320 0.9461 0.9559 0.9617 0.0012 0.0062 0.0020 0.0035

ter results. We tested our algorithm using both Reuters21578 and 20-newsgroup document corpora, and compared the results with both the unsupervised clustering and semisupervised clustering/classification methods. The experimental evaluations reveal that our proposed CNC model is a very effective semi-supervised document clustering tool, especially with very low amount of training samples. During the research work, we are inspired to further extend the method in several directions. We would like to form constraints for prior knowledge related to cannot-link constraints. We hope to explore new approaches to utilize the prior knowledge more efficiently to achieve even higher performance improvement with very few constraints. It is also interesting to investigate different ob jective functions to incorporate user's constraints.

[6]

[7] [8] [9]

[10]

6. ACKNOWLEDGMENT
The authors want to thank Mei Han for her valuable comments, Yihong Gong for his help on editing, and the reviewers for their constructive comments and suggestions. [11]

[12]

7. REFERENCES
[1] S. Basu, A. Banerjee, and R. J. Mooney Semi-supervised Clustering by Seeding, Proceedings of the 19th International Conference on Machine Learning (ICML-2002), pp. 19-26, Sydney, 2002. [2] K. P. Bennett, and A. Demiriz Semi-Supervised Support Vector Machines, Advances in Neural Information Processing Systems, pp 368-374, 1998 [3] D. Cutting, D. Karger, J. Pederson, and J. Tukey. A cluster-based approach to browsing large document collections. Proceedings of ACM SIGIR, 1992. [4] P. K. Chan, D. F. Schlag, and J. Y. Zien. Spectral k-way ratio-cut partitioning an clustering. IEEE Transaction Computer-Aided Design, 13:1088-1096, September 1994. [5] C. Ding, X. He, H. Zha, M. Gu, and H. Simon. Spectral Min-Max Cut for Graph Partitioning and [13]

[14]

[15]

[16]

Data Clustering. Proc. of 1st IEEE Int'l Conf. Data Mining. San Jose, CA, 2001. pp.107-114. D. Eichmann, M. Ruiz, and P. Srinivasan. Cluster-Based adaptive and batch filtering. In Proceedings of the 7th Text Retrieval Conference. NIST, 2000 G. H. Golub and C. F. Van Loan. Matrix Computations. John Hopkins Press, 1999 J. Hartigan and M. Wong. A k-means clustering algorithm. Applied Statistics, 28:100-108, 1979 A. K. Jain, M. N. Murty, and P. J. Flynn. Data Clustering: A Review. ACM Computing Surveys, Vol. 31, No. 3, Septermber 1999. T. Joachims Transductive inference for text classification using support vector machines. Proc. 16th International Conf. on Machine Learning, pp.200-209, Morgan Kaufmann, San Francisco, CA. T. Joachim. Transductive Learning via Spectral Graph Partitioning. Proceedings of the International Conference on Machine Learning, pp.290-297, 2003. R. S. Michalski and R. E. Stepp. Learning from observation: Conceptual clustering. Machine Learning, an Artificial Intel ligence Approach, pages 331-363. Tioga Publishing Co., Palo Alto, CA, 1983. T. Mitchell. The role of unlabeled data in supervised learning. Proceedings of the Sixth International Col loquium on Cognitive Science, 1999. J.L.Neto, A.D.Santos, C.A.A. Kaestner, and A.A. Freitas. Document Clustering and Text Summarization. 4th International Conference on Practical Applications of Know ledge Discovery and Data Ming, London, 2000. K. Nigam, R. Ghani, Analyzing the effectiveness and applicability of co-training, Ninth International Conference on Information and Know ledge Management, pp. 86-93, 2000. S. Siersdorfer and S. Sizov Restrictive Clustering and Metaclustering for Self-organizing Document Collections. Proceedings of ACM SIGIR, 2004.

411


Table 5: AC of CNC, TSVM, and SK-means on Newsgroups Do cument Corpus. Percentage of Labelled Data 2.5% 5.0% 7.5% 10.0% C NC 0.9514 0.9618 0.9623 0.9708 TSVM 0.9483 0.9560 0.9674 0.9813 p-value with TSVM 0.0025 0.0036 0.2850 0.4290 SK-means 0.6878 0.6893 0.7234 0.7328 p-value with SK-means 0.0009 0.0005 0.0011 0.0018

[17] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transaction on Pattern Analysis and Machine Intel ligence, 2000. [18] F. Wilcoxon. Individual comparisons by ranking methods. Biometrics, 1, 80-93. [19] K. Wagstaff, S. Rogers, and S. Schroedl. Constrained K-means clustering with background knowledge. Proceedings of the 18th Internation Conference on Machine Learning, 577-584, 2001 [20] K. Wagstaff and C. Cardie. Clustering with instance-level constraints. Proceedings of the 17th Internation Conference on Machine Learning, 1103-1110, 2001 [21] Y. Yang and X. Liu A re-examination of text categorization methods. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, pp 42-49, 1999. [22] S. X. Yu and J. Shi Grouping with Bias. Neural Information Processing Systems, 2001.

[23] S. X. Yu and J. Shi Segmentation Given Partial Grouping Constraints IEEE Transactions on Pattern Analysis and Machine Intel ligence, vol. 26, No2, February 2004. [24] H. J. Zeng, Q. C. He, Z. Chen, W. Y. Ma, and J. Ma Learning to clustering web search results. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, pp 210-216, 2004 [25] H. Zha, C. Ding, M. Gu and H. Simon. Spectral relaxation for k-means clustering. Proceedings of Advances in Neural Information Processing Systems, vol 14, 2002. [26] http://svmlight.joachims.org/ [27] X. Zhu, Z. Ghahramani, J. Lafferty. Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. The Twentieth International Conference on Machine Learning, 2003.

412