Refining Hierarchical Taxonomy Structure Via Semi-supervised Learning Ruizhang Huang, Zhigang Zhang, and Wai Lam Depar tment of Systems Engineering & Engineering Management The Chinese University of Hong Kong Shatin, Hong Kong rzhuang@se.cuhk.edu.hk, zgzhang@se.cuhk.edu.hk, wlam@se.cuhk.edu.hk Categories and Sub ject Descriptors: H.3.3 [Information Storage and Retrieval]:Information Search and Retrieval General Terms: Algorithms, Design, Human Factors Keywords: Semi-sup ervised Learning, Metric Learning the taxonomy and cluster all lab eled and unlab eled documents. Figure 2 depicts an example of a refined taxonomy discovered from the partial taxonomy given in Figure 1. The unlab eled news stories can b e assigned to the existing sample clusters. New clusters with similar granularity are also generated appropriately. In the first level, a new node is discovered from the lab eled and unlab eled documents. Furthermore, in the second level, new nodes are also formed with similar granularity as the sibling nodes. 1. INTRODUCTION Maintaining a hierarchical taxonomy of text documents is of great interest for many applications. Hierarchical taxonomy provides a well-structured organization of documents by different levels of abstraction. The connection of the nodes in the taxonomy typically conveys the generalization and sp ecification relationships. Very often it is infeasible for a user to construct the full taxonomy. However, it is much easier for a user to sp ecify a partial taxonomy with some sample documents. For example, as shown in Figure 1, the user provides a partial taxonomy organized in a twolevel hierarchy in the finance news domain. The first level consists of general clusters, namely, "corp orate/industrial", "economics", and "markets". More sp ecific sample clusters under each node in first level are sp ecified. 1 In each of the leaf node, the user also provides some sample documents. Corporate/ Industrial Economics Markets Strategy/ Plans Legal/ Judicial Regulation/ Economic Policy Performance Monetary/ Economic Equity Markets Figure 2: The refined taxonomy discovered from the user provided partial taxonomy in Figure 1 using lab eled do cuments and unlab eled do cuments. Corporate/ Industrial Economics Markets Strategy/ Plans Legal/ Judicial Regulation/ Policy Economic Performance Monetary/ Economic Equity Markets Figure 1: An example of user provided partial taxonomy This pap er investigates a new method for expanding the hierarchical taxonomy based on lab eled and unlab eled documents. Given a user provided partial hierarchical taxonomy, some sample documents in the partial taxonomy, and a set of unlab eled documents, the ob jective is to expand The work describ ed in this pap er is substantially supp orted by grants from the Research Grant Council of the Hong Kong Sp ecial Administrative Region, China (Pro ject Nos: CUHK 4179/03E and CUHK4193/04E) and CUHK Strategic Grant (No: 4410001). This work is also affiliated with the Microsoft-CUHK Joint Lab oratory for Human-centric Computing and Interface Technologies. 1 The sample partial taxonomy is similar to a p ortion of Reuters RCV1 corpus taxonomy. There are some existing work on hierarchical clustering. Ying and George [4] evaluated nine agglomerative algorithms and six partitional algorithms. Shivakumer and Byron [3] investigated a model-based hierarchical clustering by formulating an ob jective function based on a Bayesian analysis. None of these approaches considers using the user provided sample documents to guide the clustering process. There are also some semi-sup ervised learning research work. Sugato et al. [1] investigated a theoretically motivated framework for semi-sup ervised clustering that employs Hidden Random Markov Fields. Mikhail et al. [2] describ ed an approach that unifies the constraint-based and metric-based semi-sup ervised clustering methods. A shortcoming of these approaches is that they cannot handle hierarchical taxonomy structures. 2. OUTLINE OF OUR APPROACH This problem can b e formulated as a semi-sup ervised clustering with taxonomy expansion. The whole document set D consists of a set of lab eled documents L and unlab eled documents U . The lab eled documents are organized in a hierarchy. The goal is to conduct clustering on the whole document set D and the taxonomy provided by user is allowed to expand if needed. The user provided lab els of the documents in L are not violated in the result. This problem can b e decomp osed into a series of sub- Copyright is held by the author/owner(s). SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. 653 where Dl is the document set that b elongs to the cluster l; ml is the numb er of features in al ; x and µl are the vector representation of a document and the centroid of the cluster l resp ectively; xi , ali , and µli are the weights of feature i in the vector representation of a document, local weight metric, and the centroid of the cluster l resp ectively; || · || denotes the L2 norm. The outline of the algorithm is presented in as follows: ------------------------------------------------1 initialization: the centroid of each cluster is initialized. 2 Rep eat until convergence 3 E-step: Given the centroid and the lo cal weight metric for each cluster, up date the do cuments to clusters assignments. 4 M-step(I): Given the cluster assignment, re-calculate the centroid of each cluster 5 M-step(I I): Estimate the lo cal weight metric al ------------------------------------------------- problems by considering the local structure of the taxonomy for each node. Consider a certain node. The problem can b e regarded as a semi-sup ervised flat clustering. The lab eled documents are regarded as constraints to guide the clustering algorithm towards a more appropriate data partitioning. In high-dimensional situations, clusters are b etter characterized by a subset of the whole feature dimension. Each cluster emphasizes on different features. Therefore, the p erformance of discovering the clusters can b e improved if the appropriate different set of features for each cluster can b e chosen. We investigate the idea of learning a separate weight metric vector locally for each cluster, denoted by al for cluster l. We assume that the numb er of final clusters K is predefined for the particular node. Our algorithm tries to learn the clusters by minimizing an ob jective function using an EM-based iterative algorithm. The ob jective function is formulated as follows: K ml xi ali µli (1 - i=1 ) (1) = ||x||al ||µl ||al l=1 xD XX P clusters. The dataset contains 2450 news stories organized as a two-level hierarchical taxonomy. In the first level, there are 12 topic categories. There are 94 event clusters in the second level. The second dataset was derived from Reuters RCV1 corpus. We randomly selected stories b elonging to at most one of the first two-level clusters. 2400 news stories were chosen. In the first level, there are 4 general clusters. There are 24 sp ecific clusters in the second level. 100 news stories were selected for each cluster. We used F-measure as our evaluation measure. Two sets of exp eriments were conducted. In each exp eriment, we varied the p ercentage of the clusters that have lab eled samples. In the first exp eriment, we compared the p erformance of the algorithm with and without local metric learning. The quality of the clustering solution is evaluated separately in each level of the taxonomy. The results are illustrated in Figure 3 and Figure 4. The results demonstrate that the local metric learning obviously improves the clustering p erformance. 0.8 0.7 0.6 F-measure 0.5 0.4 0.3 0.2 0.1 25 F-measure l 1st level, local metric learning 2nd level, local metric learning 1st level, no metric learning 2nd level, no metric learning 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 25 1st level, local metric learning 2nd level, local metric learning 1st level, no metric learning 2nd level, no metric learning 50 75 Coverage of Seeded Clusters (%) 100 50 75 Coverage of Seeded Clusters (%) 100 Figure 3: The p erformance with and without lo cal weight metric learning on TDT3 corpus Figure 4: The p erformance with and without lo cal weight metric learning on Reuters RCV1 corpus The user provided samples are used as seeds to initialize the cluster centroids. Let b e the numb er of user provided sample clusters. We assume that K . In addition to the sample document centroids, we select K - centroids using the farthest-first algorithm. In the E-step, the assignments of the documents to the clusters are up dated. Since we intend not to violate the user provided sample document lab els in the result, each document in L is kept unchanged in the progress of the algorithm. Each document in U is assigned to the cluster that minimizes its contribution to the ob jective function. The M-step consists of two parts. Every cluster centroid µl is re-estimated using the current document to cluster assignment as follows: µl = || In the second exp eriment, the numb er of sample lab eled documents for each cluster was also varied. As shown in Figure 5 and Figure 6, the p ercentage of the cluster with lab eled samples affects the clustering p erformance to a larger extent than the numb er of the sample lab eled documents for each cluster. 0.9 0.9 0.8 0.7 0.6 F-measure 0.5 0.4 0.3 F-measure 0.6 0.5 0.4 30 documents, 1st level 90 documents, 1st level 150 documents, 1st level 0.8 0.7 50 docs, 1st level 150 docs, 1st level 250 docs, 1st level 50 docs, 2nd level 150 docs, 2nd level 250 docs, 2nd level 0.3 0.2 0.2 0.1 25 0.1 25 50 75 Coverage of Seeded Clusters (%) 100 50 75 Coverage of Seeded Clusters (%) 100 Figure 5: P P xDl x of the M-step p erforms weight metric learning on each cluster. The metric is re-estimated to decrease the ob jective function . Fletcher-Reeves conjugate gradient algorithm is employed to learn the vector al until a local optimal value of the ob jective function is reached. xDl x||al . The second part The p erformance with different numb er of sample lab eled do cuments for each cluster on TDT3 corpus Figure The p erformance with different numb er of sample lab eled do cuments for each cluster on Reuters RCV1 corpus 6: 4. REFERENCES [1] S. Basu, M. Bilenko, and R. J. Mo oney. A probabilistic framework for semi-sup ervised clustering. In Proceedings of the tenth ACM SIGKDD International Conference on Know ledge discovery and data mining, pages 59­68, 2004. [2] R. J. Mo oney M. Bilenko, S. Basu. Integrating constraints and metric learning in semi-sup ervised clustering. In Proceedings of the International Conference on Machine Learning, 2004. [3] S. Vaithyanathan and B. Dom. Mo del-based hierarchical clustering. In Proceedings of the 16th Conference on Uncertainty in Artificial Intel ligence, pages 599­608, 2004. [4] Y. Zhao and G. Karypis. Hierarchical clustering algorithms for do cument datasets. Data Mining and Know ledge Discovery, 10:141­168­231, 2005. 3. EXPERIMENTAL RESULTS In our exp eriments, we used 2 data sets. The first dataset was derived from the TDT3 corpus 2 . We selected the native English newswire news documents with human assigned The description of the corpus can be found http://pro jects.ldc.up enn.edu/TDT3/TDT3 Overview.html 2 at 654