Text Clustering with Extended User Feedback Yifen Huang Carnegie Mellon University 5000 Forbes Ave Pittsburgh, Pennsylvania USA Tom M. Mitchell Carnegie Mellon University 5000 Forbes Ave Pittsburgh, Pennsylvania USA hyifen@cs.cmu.edu tom.mitchell@cs.cmu.edu ABSTRACT Text clustering is most commonly treated as a fully automated task without user feedback. However, a variety of researchers have explored mixed-initiative clustering methods which allow a user to interact with and advise the clustering algorithm. This mixed-initiative approach is esp ecially attractive for text clustering tasks where the user is trying to organize a corpus of documents into clusters for some particular purp ose (e.g., clustering their email into folders that reflect various activities in which they are involved). This pap er introduces a new approach to mixed-initiative clustering that handles several natural typ es of user feedback. We first introduce a new probabilistic generative model for text clustering (the Sp eClustering model) and show that it outp erforms the commonly used mixture of multinomials clustering model, even when used in fully autonomous mode with no user input. We then describ e how to incorp orate four distinct typ es of user feedback into the clustering algorithm, and provide exp erimental evidence showing substantial improvements in text clustering when this user feedback is incorp orated. Categories and Sub ject Descriptors: H.3.3 [Information Search and Retrieval]: Clustering General Terms: Algorithms, Theory Keywords: text clustering, user feedback, mixed-initiative learning 1. INTRODUCTION We as human b eings are quite familiar with clustering ob jects into categories based on features of these ob jects. For example, a computer user may sort her emails into folders that are p ersonally meaningful b ecause each one represents a particular activity she is involved in, or b ecause they are emails from a particular group of p eople, etc. For each folder, or cluster, the user may have in mind a rich category description, but assigns ob jects to these categories based on their surface features (e.g., the words in the email, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00. or recipients in the header). There are many other examples: we may informally cluster news stories into categories such as sp orts, p olitics, etc., or we may easily recognize in a sup ermarket what typ e of products a corridor b elongs to. Computer algorithms for clustering are typically cast as fully automated, unsup ervised learning algorithms; that is, the algorithm is given only the collection of instances and the surface features that describ e each, without any information ab out the nature of the clusters. Recently, however, a variety of researchers have studied ways of allowing a user to provide limited information to improve clustering quality. One approach is to allow the user to provide cluster lab els for some of the instances, indicating which cluster that instance b elongs to. For example, [11][2][8] use lab els of this typ e to form initial cluster descriptions, which are then refined using b oth the unlab eled and lab eled instances. A second typ e of input information consists of pair-wise constraints among instances [13]. These constraints may assert that two documents must b elong to the same cluster without indicating which one it is, or may assert that two documents must b elong to different clusters. Various constraint-based methods and distance-based methods have b een prop osed to use this typ e of information. See [1] for a short survey on different approaches and also for an approach to integrating distance-based and constraint-based approaches into a probabilistic framework. A third typ e of additional input involves background knowledge to enrich the set of features that describ e each instance. For example, [6] enriches their document representation by using an ontology (WordNet) as background knowledge. A fourth typ e of extra information, which we are primarily interested in, is information ab out the key surface features for a particular class, or cluster. For example, [9] uses a few user-supplied keywords p er class and a class hierarchy to generate preliminary lab els to build an initial text classifier for the class. [10] prop oses an interesting technique in which they ask a user to identify interesting words among automatically selected representative words for each class of documents, and then use these user-identified words to re-train the classifier as in [9]. Researchers working on active learning have also studied using feedback ab out key features. For example, [5] converts a user-recommended feature into a mini-document which is used to help train an SVM classifier. An altenative approach to using this information is prop osed by [12] who adjust the SVM weights associated with these key features to a predefined value in binary classification tasks. We are interested in how to b est incorp orate user input into automated clustering algorithms, and more generally 413 into mixed-initiative clustering approaches that allow the user and computer to jointly arrive at coherent clusters that capture the categories of interest to the user. Note this goal of discovering clusters of interest to the user is somewhat different from the ob jective optimized in totally unsup ervised clustering algorithms that attempt to maximize some statistical prop erty of the clusters (such as data likelihood, or inter-cluster distance). We are sp ecifically interested in how to incorp orate into clustering algorithms the user's emerging understanding ab out a category1 , stimulated by seeing the instances that are clustered together, and by seeing (and editing) summaries of these emerging clusters. A user's understanding ab out a category may b e expressed in a variety of forms, such as by keywords, imp ortant p erson names, other typ es of entities, and relationships among entities. It may encapsulate a variety of typ es of information, and it may b e difficult for a user to articulate fully their notion of the cluster. The chief contribution of this pap er is to introduce a new probabilistic model for clustering that outp erforms standard unsup ervised clustering in our exp eriments, and that also can accomodate a variety of typ es of user feedback to iteratively refine the clusters. We present exp eriments in b oth an email clustering domain, and in a second document clustering domain (20 Newsgroups) showing the p erformance of this clustering approach. The research we rep ort here is part of our larger research effort to build computer algorithms to automatically infer the key activities, or pro jects, a user is involved in, given the contents of their workstation (e.g., their emails, files, directories, calendar entries, p ersonal contacts lists, etc.). For example, a user may b e involved in activities such as teaching a particular course, participating in a particular committee, hanging out with a particular group of friends, etc. In our previous work[7], we have shown that unsup ervised clustering of emails can result in useful descriptions of user activities, such as the one shown in Figure 1. The work we rep ort in this pap er is motivated in part by our interest in developing a more mixed-initiative approach to inferring such activity clusters, using b oth computer analysis of workstation data and user feedback based on examining prop osed clusters. discusses how several typ es of user feedback can b e incorp orated into the clustering algorithm. The exp erimental setup and evaluation are describ ed in section 4 and conclusions are presented in section 5. 2. SPECLUSTERING MODEL 2.1 Separating specific from general topics We present here a clustering algorithm based on a novel probabilistic model. One commonly used probabilistic model for text clustering is the multinomial naive Bayes model describ ed in [11], which models a document as a vector of words with each word generated indep endently by a multinomial probability distribution conditioned on the document's class (i.e., conditioned on which cluster it b elongs to). Our Sp eClustering model also assumes words are generated probabilistically, but differs in an imp ortant way from this standard model. In particular, the Sp eClustering model assumes that only some of the words in the document are conditioned on the document's cluster, and that other words follow a more general word distribution that is indep endent of which cluster the document b elongs to. To see the intuition b ehind this model, consider a cluster of emails ab out skiing. There will b e some words (e.g., "snow") that app ear in this cluster of emails b ecause the topic is skiing, and there will b e other words (e.g., "contact") that app ear for reasons indep endent of the cluster topic. The key difference b etween the standard model and our Sp eClustering model is that our model assumes each document is generated by a mixture of two multinomials ­ one associated with the document's cluster, and the other shared across all clusters. As we show b elow, our more elab orate Sp eClustering model can lead to improved accuracy when used for automatic clustering, and it also provides a formalism that can easily accomodate several imp ortant typ es of user feedback to supp ort mixed-initiative clustering. To construct this Sp eClustering model, we extend the standard multinomial model in two ways. The first modification is to add a G topic variable that is intended to capture general topics not related to the cluster. The second modification is to introduce a hidden b oolean variable, X, associated with each word O in each document. If X = 1, the observation O is generated by the cluster-sp ecific topic S, and if X = 0, the observation O is generated by a general topic G. Throughout this pap er we simplify the model by assuming there is only one general topic instead of multiple topics, so the value of G is fixed at G = g . Figure 2 shows the graphical model representation of the model. Here the outer rectangle (or plate) is duplicated for each of the D documents, and the inner plate is duplicated for each of the N observations O and associated variables X. Note the general topic G is constant across all documents and words, whereas the cluster topic S is different for each document. The Sp eclustering model has four sets of parameters: c = P (S = c) Figure 1: An example output of activity extractor, which is extracted statistically from unsupervised clustering results. We will describ e our probabilistic model and the associated clustering algorithm in the next section. Section 3 then We use the word "cluster" to indicate a set of similar instances group ed together by a clustering algorithm, and the word "category" to indicate a concept in a user's mind which may or may not b e reflected by some cluster of instances. 1 c = P (X = 1|S = c) cv = P (O = v |S = c) gv = P (O = v |G = g ) 414 Given X and Y as the incomplete and complete data, the algorithm iterates through two steps: in the E step, we evaluate Q(|t ) = E [log P (Y |)|X , t )], and in M step, we obtain new estimation of parameters t+1 = arg max Q(|t ). In our Sp eClustering model, the incomplete data is X = {oij i {1, ..., D} j {1, ..., ni }} and complete data is Y = {si , xij , oij i {1, ..., D} j {1, ..., ni }}. The exact estimation for each parameter in M step is listed b elow. Figure 2: Graphical representation of SpeClustering mo del. S is a variable representing the cluster asso ciated with a do cument, O represents an observed word in a do cument, and X is a bo olean variable that indicates whether word O is generated conditioned on the cluster S or whether it is generated according to a clusterindependent general distribution of words G. t g+1 v È È (c) (c) · n (c) È (c) È (o = v) (c) = È (c) È (c) È È (k) È (o = v) (k) = È È (k) È (k) t c+1 = D i=1 t i D i=1 ni j =1 t ij t i i t+ 1 cv D i=1 t i D i=1 ni j =1 ij t ij t i ni j =1 t ij D i=1 |S | t k =1 i |S | D i=1 k =1 ni j =1 ij t ij t i ni j =1 t ij t c+1 = È È D i=1 t (c) i D where the following quantities are computed in the E step: t (c) i = t ij (c) P (si = c|di ; t ) where c {1, 2, ..., |S |}, g {1} for the simplified case and v {1, 2, ..., |O|}. Given a corpus C that contains D instances C = {d1 , d2 , ..., dD }, and di is represented as a vector of observations {oij ; j {1, 2, ..., ni }}, we use the notation si to indicate the value of the hidden S variable for instance di and xij to indicate the value of the hidden X variable associated with observation oij . The corpus likelihood of C given is defined as follows: È ni tt t t j =1 [c coij + (1 - c )g oij ] |S | ni t tt t t j =1 [k koij + (1 - k )g oij ] k =1 k t c É É (1) = P (xij = 1|si = c, oij ; t ) tt c coij tt t t c coij + (1 - c )goij (2) P (C|) = is j D =1 D |S | P (s i ) By iterating through E step and M step, the likelihood will converge to a (local) maximum and values of parameters will b e stabilized. =1 i =1 ni 2.3 Extension to multiple types of features In some cases instances may b e describ ed by multiple typ es of features. For example, when clustering emails we might describ e each email by the set of words in its b ody, plus the set of email addresses the email is sent to. If there are multiple typ es of features in an instance, we can extend the Sp eClustering model. Figure 3 shows the extended model with two feature typ es. The model adds one new block {Y , Q} for the introduction of a new feature typ e. {Y , Q} is identical and parallel to {X, O}. In the activitydiscovery-via-emails task, we can apply this model to represent an activity in terms of b oth its key words and the primary participants of the activity. Parameter estimation in the extended Sp eClustering model is nearly identical to that describ ed in section 2.2. The only exception is a change to the p osterior probability estimate in Eq 1. The new p osterior probability estimate in the extended model combines generative probabilities from multiple feature typ es. Eq 3 shows the estimate from two different feature typ es. t (c) P (si = c|di ; t ) i = [P (xij = 1|si )P (oij |si ) + P (xij = 0|si )P (oij |g )] which can b e written in terms of the model parameters as follows: P (C|) = is |S | si j ni [si si oij + (1 - si )goij ] =1 =1 i =1 Note the probability P (X = 1|S = c, O = f ; ), which can b e derived from the model parameters, describ es the probability that any particular feature f is generated by a particular cluster c, as opp osed to the general topic g. 2.2 Learning Clusters with the SpeClustering Model In the most general case we are interested in unsup ervised clustering of documents given just the observed features O of a set of documents, where the values for the S and X variables are unobserved. Because of the existence of unobserved variables, we use an EM process [3] for parameter estimation. The EM algorithm is commonly applied to find a (local) maximum-likelihood estimate of the parameters in situations when the observable data is incomplete and the model dep ends on unobserved latent variables. È ni mi t tt t t tt t t c j =1 [c co +(1-c )go ] h=1 [c cq +(1-c )gq ] ij ij ih ih ni mi |S | t t t t ) t t t t t [ +(1-k )gq ] goij ] j =1 [k koij +(1-k k =1 k h=1 k kqih ih É É É É (3) 415 During each EM iteration while training the Sp eClustering model, we p erform typ e 2 to 4 adjustments. For typ e 2 feedback, we adjust the value of P (si = c|di ; t ) to b e one if the email(di )-to-cluster(c) b ound is confirmed by the user or set it to zero if the b ound is disapproved by the user. Prop er adjustment to normalize p osterior probabilities of {P (si = c |di ; t ) c = c} is also required in this case. For typ e 3 and 4 feedback, we adjust the value of P (xij = 1|si = c, oij = v ; t ) to b e one if the key-feature(v )to-cluster(c) b ound is confirmed by the user or set it to zero if the b ound is disapproved by the user. For typ e 5 feedback, we tokenize the description T and make each token of T a confirmed keyword as in typ e 3 feedback. Figure 4 summarizes this Mixed-Initiative-Clustering process which integrates user feedback into the clustering process. Algorithm: Mixed-Initiative-Clustering Input: Corpus C with D instances. Output: A list of activity clusters A, where each activity cluster is describ ed by its top K features for each feature typ e. Method: 1. Generate initial model ini and summarization of clusters Aini with top K features of each feature typ e. t = ini , At = Aini , F = {}. 2. Add user's feedback regarding At into F . 3. (t+1 , At+1 ) =Sp eClustering-with-Feedback(C , t , F ). 4. t = t+1 , At = At+1 5. rep eat step 2 to 4 until user's satisfaction. Algorithm: Sp eClustering-with-Feedback Input: Corpus C with D instances. t as the current model. F as the collection of user's feedback. Output: t+1 as the model after adaption according to user's feedback. At+1 as the new summarization of clusters according to t+1 . Method: 1. Estimate p osterior probabilities P t of Eq 3 and Eq 2 given C and t . t 2. Adjust P t according to F to obtain Padj . t 3. Re-estimate model parameters using Padj to obtain t adj . t 4. t = adj ; rep eat step 1 to 3 until the model converges. t 5. t+1 = adj . Generate At+1 according to t+1 and F . Figure 3: Graphical representation of the SpeClustering mo del with two feature types, where O and Q are observations with different feature types and X and Y are bo olean variables deciding whether their respective observation is generated from the specific topic S or the general topic G 3. MODEL ADAPTATION ACCORDING TO USER'S FEEDBACK As discussed earlier, we are particularly interested in allowing extended forms of user feedback to help direct the clustering process. In this section we discuss how several typ es of user feedback are incorp orated to guide the clustering algorithm. We describ e each feedback typ e in terms of the task of clustering emails to discover descriptions of a user's activities. The typ es of user feedback allowed are: 1. 2. 3. 4. 5. Remove an activity cluster An email b elongs (or does not) to its assigned cluster A keyword b elongs (or does not) to its assigned cluster A p erson b elongs (or does not) to its assigned cluster A short text description T for an activity cluster The p osterior probabilities in the Sp eClustering model turn out to b e highly related to the ab ove typ es of feedback. To b e more sp ecific, typ e 1 and 2 feedback are related to Eq. 3 and typ e 3, 4, and 5 feedback are related to Eq. 2. There are two methods to initialize the Sp eClustering model with user feedback. The simple method inherits previous clustering results for which the user gives her feedback. When feedback includes removing a cluster S = c, we reset the initial value of P (si |di ; t ) for each di with si = c by distributing the probability mass uniformly among all clusters but halving the probability to cluster c. Alternatively, the joint method uses multiple feedback typ es to initialize the model. We first select several documents that have the highest cosine similarity with confirmed documents and keywords (where we treat keywords as a mini-document) and associate them with current clusters. We then search for a small set of similar documents that maximize inter-cluster distances and replace any cluster that is removed in the feedback. Figure 4: The algorithm for mixed-initiative clustering. 3.1 Connection to Supervised Classification We have describ ed details of the Sp eClustering model. However, the model is not restricted to clustering; it can 416 also b e applied to sup ervised classification tasks. The difference in classification is that the topic variable S is no longer a hidden variable. We can treat the classification tasks as knowing all the typ e 2 user feedback and replace the estimate of p osterior probabilities P (si = c|di ; t ) with the true value sp ecified by the instance lab el. the ratio b etween ma jor chunks of clusters to its reference. NMI measures the similarity b etween cluster partitions and reference partitions. 4.3 Results and Discussion To exp erimentally study the Sp eClustering model and algorithms, we consider three distinct algorithms. First, we consider the standard multinomial naive Bayes text clustering[11] algorithm as a baseline approach representing a typical probabilistic approach to text clustering. We modified this baseline approach by allowing it to search for a good cluster initialization and to avoid situations in which one cluster gets eliminated during the EM iterations[7]. Two versions of Sp eClustering algorithm are tested. The fist version is the original Sp eClustering algorithm as describ ed in Section 2. The second version, SpeClustering-bound, adds range constraints on parameter values : for word features, the range is [0.1, 0.4] and for p erson features, the range is [0.6, 0.9]. The reason for introducing these range constraints is to avoid situations where some values of converge to 1 or 0. This is undesirable b ecause the value of reflects the p ercentage of sp ecific features (X = 1) occuring over all observations. Both Sp eClustering algorithms are initialized using the output from the baseline naive Bayes clustering. 4. EXPERIMENTS 4.1 D atasets To test the Sp eClustering algorithm we used two data sets. The first is an email dataset (EmailYH ) from one of the authors that contains 623 emails. This dataset had previously b een sorted into 11 folders according to the user's activities. It contains 6684 unique words and 135 individual p eople after pre-processing2 . The second data set is the publicly available 20-Newsgroups collection. This data set contains text messages from 20 different Usenet newsgroups, with 1000 messages harvested from each newsgroup. We derived three datasets according to [1]. The first, NewsSimilar-3, consists of messages from 3 similar newsgroups (comp.graphics, comp.os.ms-windows.misc, comp.windows.x) where cross-p osting occurs often b etween these three newsgroups. News-Related-3 consists of messages from 3 related newsgroups (talk.p olitics.misc, talk.p oli-tics.guns and talk.p olitics.mideast). News-Different-3 contains 3 newsgroups of quite different topics (alt.atheism, rec.sp ort.baseball, and sci.space). We only use the text part of messages in the three newsgroup datasets b ecause a reviewer won't have the knowledge needed to decide which author is the key-p erson with regard to which newsgroup. For the text part, we applied the same pre-processing we used in (EmailYH ). There are 3000 messages in these datasets. News-Different-3 contains 8465 unique words, News-Related-3 contains 9998 unique words and News-Similar-3 has 10037 unique words. 4.3.1 Autonomous Clustering First we compared our Sp eClustering approach to the Naive Bayes baseline in fully autonomous clustering without user feedback. We made 50 individual runs on EmailYH dataset and 20 runs each on News-Similar-3, News-Related3, and News-Different-3. Table 1 shows the average accuracy and NMI results of different datasets and the three clustering algorithms. Notice in all datasets, the Sp eClustering algorithm p erforms b etter than the naive Bayes algorithm, and the Sp eClustering-b ound model p erforms b etter than Sp eClustering. The naive Bayes clustering results are used to initialize its associated Sp eClustering and Sp eClusteringb ound runs, so the p erformance gain are directly due to the difference b etween the Sp eClustering probabilistic model and naive Bayes model. When we examined the details of individual runs, we found that every one of the runs resulted in Sp eClustering-b ound outp erforming Naive Bayes in terms of the NMI measure, and that in the vast ma jority of these runs it also outp erformed Naive Bayes in terms of the accuracy measure. 4.2 Measurement for Cluster Evaluation We use two measurements to estimate cluster quality: folder-reconstruction accuracy, and normalized mutual information (NMI) [4]. In order to calculate the folder-reconstruction accuracy, we search through all p ossible alignments of cluster indices Ic , to folder indices If in order to find the alignment resulting in optimal accuracy, then rep ort the accuracy under this optimal alignment: Acc = maxA È D i=1 (A(si ) = fi ) D A {M ap(Ic ) 1-to-1 - If } (4) 4.3.2 Clustering with User Feedback We next studied the impact of user feedback on the b ounded Sp eClustering model. In particular, we chose 5 clustering results using the multinomial naive Bayes model with the b est log-likelihood among 50 runs on EmailYH and presented each of these to the user. We also chose one b est run from 20 runs on News-Different-3, News-Related-3, and News-Similar-3. The user gave feedback using the interface shown in Fig 5. The top left panel shows a list of documents that are clustered into the selected cluster lab el, the top right panel shows 5 key-p ersons of the cluster and the b ottom right panel shows 20 keywords of the cluster. The keywords and key-p ersons of the cluster are selected using a Chi-squared measurement [14]. When a user clicks on a document in the document list, the content of the document shows in the b ottom left panel. The user can give various typ es of feedback describ ed in Section 3 and the interface The normalized mutual information measurement is fined as Eq. 5, where I (S ; F ) is the mutual information tween cluster assignment S and folder lab els F, H (S ) is entropy of S and H (F ) is the entropy of F. It measures shared information b etween S and F. NMI = I (S ; F ) (H (S ) + H (F ))/2 deb eth e th e (5) These two measurements are correlated but show different asp ects of clustering p erformance. Accuracy calculates 2 The pre-processing for words includes stemming, stop word removal and removal of words that app ear only once in the dataset. The pre-processing for p eople contains referencereconciliation over email senders and recipients, and removal of p eople that are involved in only one email. 417 dateset EmailYH NewsSim-3 NewsRel-3 NewsDiff-3 metho d naive Bayes Sp eCluster Sp eC-b ound naive Bayes Sp eCluster Sp eC-b ound naive Bayes Sp eCluster Sp eC-b ound naive Bayes Sp eCluster Sp eC-b ound Accuracy (%) 48.44 ± 7.01 52.28 ± 8.61 53.98 ± 8.04 46.31 ± 7.21 51.38 ± 6.33 51.98 ± 5.91 60.18 ± 10.64 60.61 ± 11.08 61.14 ± 11.41 91.24 ± 13.45 93.80 ± 11.49 96.52 ± 6.47 NMI (%) 48.02 ± 3.93 53.25 ± 5.65 56.25 ± 4.90 9.86 ± 7.34 15.80 ± 6.82 16.46 ± 6.27 34.36 ± 10.58 36.06 ± 10.71 36.92 ± 11.04 79.76 ± 14.56 83.57 ± 14.27 87.79 ± 11.56 run Email1 Email2 Email3 Email4 Email5 Sim1 Rel1 Diff1 do c # 623 623 623 623 623 3000 3000 3000 CR 3 3 4 7 4 2 1 0 PP 99 73 92 32 91 39 29 16 WX 37 35 48 28 43 9 20 39 HX 30 31 26 15 28 - Table 2: Entry numbers of different feedback types for 5 selected naive Bayes runs. 0.8 0.7 Accuracy 0.6 0.5 0.4 0.3 naive Bayes SpeC-bound CR PP WX HX Table 1: Clustering results of different datasets and different clustering algorithms. SpeCluster and Specbound are short-hands of the SpeClustering mo del with unbounded and bounded parameter values. Both versions of the SpeClustering mo del out-perform the multinomial naive Bayes mo del and the bounded SpeClustering mo del achieves the best performance. Email1 Email2 Email3 Email4 Email5 0.7 displays feedback the user has entered so far. The user can also go back and forth to correct conflict assumptions she has made to achieve consistent cluster interpretations. An interesting observation we found is that displaying keywords and key-p ersons tremendously helps the user make judgements ab out a cluster. In fact, to decide the meaning of a large cluster based only on examining the documents is extremely difficult. A reviewer would tend to decide based on the first several documents she goes through even when the cluster contains more than hundreds of documents, and the biased decision often causes conflicts with later clusters. The reviewer usually chooses to remove a cluster, if the keywords and key-p ersons don't show any consistency and are not meaningful to the user, or if documents in the cluster are a hodgep odge from several categories. If the keywords or key-p ersons make sense to the user, the user gives feedback ab out document-cluster associations according to these meanings. We don't put constraints on how the reviewer does the feedback, so the reviewer can make decisions freely based on how she p erceives the clustering results, and gives feedback using her own interpretation of the results. This way of collecting feedback may result in a situation where the meaning in the reviewer's mind doesn't match the ma jority of documents associated with the cluster b ecause the reviewer rationalizes clusters mostly according to key features. Keyword selection favors words that occur in the cluster and don't app ear in other clusters, so if a category contains many documents and gets spread out to several clusters, even the ma jority of documents in the cluster b elong to that category, the keyword selection may give low scores to words b elong to that category b ecause those words app ear in other clusters. We use the following notation to indicate various feedback typ es: · CR: cluster removal · PP: document-to-cluster association · WX: keyword-to-cluster association · HX: keyp erson-to-cluster association 0.6 NMI naive Bayes SpeC-bound CR PP WX HX 0.5 0.4 Email1 Email2 Email3 Email4 Email5 Figure 6: Performance of using single feedback types (CR, PP, WX and HX) on the EmailYH dataset. SpeCbound is the SpeClustering-bound mo del without feedback. The SpeClustering-bound mo del with one type of feedback out-performs naive Bayes and SpeClusteringbound without feedback in 17 out of 20 runs. Table 2 shows how many entries of different feedback typ es the reviewer enters for each selected run. The reviewer sp ends ab out 15 mins to finish one run from EmailYH dataset and 5-10 mins to finish one run from newsgroup datasets. We ran the Sp eClustering-b ound algorithm with user feedback and compared the results to the naive Bayes baseline and the Sp eClustering-b ound algorithm without feedback. The difference b etween Sp eClustering with and without feedback is the parameter adjustment describ ed in Section 3. We used the simple initialization method on EmailYH dataset in order to break down feedback to single typ es. Figure 6 shows the results using just one typ e of feedback on 5 selected runs from EmailYH dataset. The CR feedback is indep endent from other typ es of feedback and all other typ es involve feedback only from clusters that are not removed. All 5 runs with CR or PP feedback, 4 runs with WX feedback and 3 runs with HX feedback outp erform b oth naive Bayes baseline and Sp eClustering-b ound without feedback. Figure 7 shows the results using combination of feedback typ es. User's feedback gives huge improvements in all runs (19.55% average accuracy improvements from naive Bayes results to Sp eClustering-b ound with full feedback). Sp eClustering-b ound with full feedback p erforms 418 Figure 5: The user interface for feedback gathering. It displays a list of do cuments, keywords and key-persons for a selected cluster. A reviewer can decide (1) to keep the cluster or not, (2) confirm or remove keywords or key-persons (3) confirm or remove do cuments, (4) give a short description about the cluster. The reviewer can also go back and forth between clusters to make her feedback consistent. b est in 4 out of 5 runs. In the remaining one run, CR+PP feedback p erforms b est. The quantity of PP feedback is ab out 1/7th to 1/9th to the whole dataset and even higher if we exclude documents in removed clusters. The numb er of WX+HX feedback are fewer than PP feedback in these runs. However, CR+WX+HX p erforms b etter than CR+PP in 2 runs, which shows that meanings of clusters gives comparable information like document-cluster association. More comp ellingly, it is also much easier to get CR+WX+HX feedback than CR+PP in terms of time efficiency. In [12], they measure the time sp end on lab eling a document or a feature, and they find a p erson only need 1/5th of time to lab el a feature compared to the time to lab el a document. For the 3 newsgroup datasets, the ratio of the amount of feedback to the corpus size is very small. In this case, the inheritance of old results which is noisy in the simple initialization overwhelms the training process so we used the joint initialization method to remedy the problem. The user feedback is quite different across these three runs. For the selected run of News-Similar-3, the naive Bayes results are extremely noisy and the cluster summarization is hardly recognizable by the reviewer. It turns out the feedback contains the removal of two out of three clusters and the reason that one is kept is b ecause some keywords weakly indicates the meaning of one newsgroup, but the documents in the remaining cluster contain huge chunks from each newsgroup. For the selected run of NewsRelated-3, talk.p olitics.guns and talk.p olitics.mideast are referred to two remaining clusters while talk.p olitics.misc has no reference due to the removal of the last cluster, which the reviewer cannot figure out its meaning. The cluster summarization is noisy but comprehensible, so the reviewer can make p ositive and negative feedback easily. For NewsDifferent-3, the baseline accuracy is very high so most feedback is p ositive ab out the automatically generated summarization. Figure 8 shows exp erimental results from user feedback on one selected run from each newsgroup dataset. It is difficult to improve on the already accurate News-Different-3 run. Incorp orating feedback gives no significant improvement on the selected News-Similar-3 runs whose feedback is based on extremely noisy clusters and a user is barely able to associate meaningful criterion to any cluster. However, one sees huge improvement from using feedback on the noisy but still meaningful cluster results. The accuracy of the selected News-Related-3 run jumps from 63.23% to 81.07%. 5. CONCLUSIONS AND FUTURE WORK In this pap er, we focus on the problem of how to cluster text documents based on the meanings of categories a user understands or wants. Often the meanings of clusters b ecome clear to a user only after examining their descriptions 419 0.8 0.7 Accuracy 0.6 0.5 0.4 0.3 Accuracy naive Bayes SpeC -bound CR+PP CR+WX+HX CR+PP+WX+HX 1 0.8 0.6 0.4 0.2 0 naive Bayes SpeC-bound CR+PP+WX Email1 Email2 Email3 Email4 Email5 Sim1 Rel1 Diff1 0.8 0.7 0.6 0.5 naive Bayes SpeC -bound CR+PP CR+WX+HX CR+PP+WX+HX 1 0.8 0.6 0.4 0.2 naive Bayes SpeC-bound CR+PP+WX NMI 0.4 Email1 Email2 Email3 Email4 Email5 NMI 0 Sim1 Rel1 Diff1 Figure 7: Performance of using combination of feedback types on the EmailYH dataset. SpeC-bound is the SpeClustering-bound mo del without feedback. User feedback gives huge improvements in all runs. Figure 8: Experiments results of SpeClustering with user feedback on the newsgroup datasets. SpeClusterbound is the mo del without feedback and CR+PP+WX is the SpeClustering-bound mo del with full user feedback. Incorporating feedback gives significant improvement on the selected News-Related-3 run, whose feedback is harvested from noisy but still meaningful clustering results. and providing feedback to explore the space of p ossible clusters. Our solution to this problem involves three comp onents. First, we prop ose a new Sp eClustering model that separates the features of a document that are sp ecific to a cluster from other general features that are unrelated to the cluster's semantics. The second comp onent is a method to collect user feedback ab out the meanings of the clusters. We present an interface that enables a user to browse through cluster results and provide several typ es of feedback. The process requires the user's understanding of desired categories, and her judgement ab out which cluster is associated with the meaning of which category. The third comp onent is an algorithm for integrating user feedback with the Sp eClustering model. The structure of the Sp eClustering model provides a natural way to adjust parameters according to a variety of typ es of user feedback. Our exp erimental results show our unsup ervised Sp eClustering algorithm outp erforms the commonly used multinomial naive Bayes clustering algorithm for b oth of the text datasets we considered. Furthermore, when provided with user feedback, the Sp eClustering model gains significant improvement in a p ersonal email dataset and in the newsgroup dataset when the clustering results is noisy but meaningful. Our approach combines the advantage of the machine's computational p ower to analyze huge amount of data, with the advantages of a human's understanding of categories of interest. The results indicate that coop eration b etween computers and human b eings is a promising direction for future work. There are many future challenges, such as using active learning principles to optimize the summarization of a cluster, and building more sophisticated models to allow more natural typ es of user feedback. Acknowledgments We thank Sophie Wang for useful discussions and contributions to earlier versions of the clustering email algorithm. This work was supp orted in part by Darpa contract NBCD030010, as part of the Personal Assistants that Learn pro ject. 6. REFERENCES [1] S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In KDD-04, 2004. [2] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the 1998 Conference on Computational Learning Theory, 1998. [3] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. In Journal of the Royal Statistical Society, volume 39 of B, pages 1­38, 1977. [4] B. Dom. An information-theoretic external cluster-validity measure. Technical Report RJ 10219, IBM, 2001. [5] S. Godbole, A. Harpale, S. Sarawagi, and S. Chakrabarti. Document classification through interactive supervision of document and term labels. In PKDD-04, 2004. [6] A. Hotho, S. Staab, and G. Stumme. Text clustering based on background knowledge. Technical Report 425, University of Karlsruhe, Institute AIFB, 2003. [7] Y. Huang, D. Govindara ju, T. Mitchell, V. R. Carvalho, and W. Cohen. Inferring ongoing activities of workstation users by clustering email. In First Conference on Email and Spam, 2004. [8] T. Joachims. Transductive inference for text classification using support vector machines. In ICML-99, 1999. [9] R. Jones, A. McCallum, K. Nigam, and E. Riloff. Bootstrapping for text learning tasks. In IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, 1999. [10] B. Liu, X. Li, W. S. Lee, and P. S. Yu. Text classification by labeling words. In AAAI-04, 2004. [11] K. Nigam, A. K. McCallum, S. Thrun, and T. M. Mitchell. Learning to classify text from labeled and unlabeled documents. In AAAI-98, 1998. [12] H. Raghavan, O. Madani, and R. Jones. Interactive feature selection. In IJCAI-05, 2005. [13] K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained k-means clustering with background knowledge. In ICML-01, 2001. [14] Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In ICML-97, 1997. 420