SIGIR 2007 Proceedings Poster Combining Error-Correcting Output Codes and Model-Refinement for Text Categorization 2 Information Security Center, ICT, P.O. Box 2704, Beijing, 100080, China Information Center, Chinese Academy of Geological Sciences, Beijing, 100037, China 1 Songbo Tan1, Yuefen Wang2 tansongbo@software.ict.ac.cn, tansongbo@gmail.com ABSTRACT In this work, we explore the use of error-correcting output codes (ECOC) to enhance the performance of centroid text classifier. The framework is to decompose one multi-class problem into multiple binary problems and then learn the individual binary classification problems by centroid classifier. However, this kind of decomposition incurs considerable bias for centroid classifier, which results in noticeable degradation of performance. To address this issue, we use Model-Refinement to adjust this socalled bias. problem into a large number (L) of two-class supervised learning problems [3]. Any learning algorithm can be used to learn each of these L problems. L can then be thought of as the length of the codewords with one bit in each codeword for each classifier. The ECOC algorithm is outlined in Figure 1. TRAINING 1 Load parameters, i.e., code length L and class K. 2 Create a L-bit code for the K classes. 3 For each bit, train a binary classifier over total training data. TESTING 1. Apply each of the L classifiers to the test example. 2. Assign the test example the class label with the largest votes. Figure 1: The Outline of ECOC Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval-search process General Terms Algorithms; Performance; Experimentation Keywords Error-Correcting Output Codes (ECOC); Text Categorization; Information Retrieval 3. METHODOLOGY 3.1 The bias incurred by ECOC for centroid classifier The basic idea of centroid classifier is to construct a centroid Ci for each class ci using formula (1) where d denotes one document vector and |z| indicates the cardinality of set z. In substance, centroid classifier makes a simple decision rule (formula (2)) that a given document should be assigned a particular class if the similarity (or distance) of this document to the centroid of the class is the largest (or smallest). This rule is based on a straightforward assumption: the documents in one category should share some similarities with each other. 1. INTRODUCTION In recent years, Error-Correcting Output Codes (ECOC) has been applied to boost the naïve bayes, decision tree and SVM classifier for text data [2-4]. Following this research direction, in this work, we explore the use of ECOC to enhance the performance of centroid classifier [1]. To the best of our knowledge, no previous work has been conducted on exactly this problem. The framework we adopted is to decompose one multiclass problem into multiple binary problems and then use centroid classifier to learn the individual binary classification problems. However, this kind of decomposition incurs considerable bias [6] for centroid classifier. In substance, the decision rule of centroid classifier is based on a straightforward assumption that the documents in one category should share some similarities with each other. However, this hypothesis is often violated by ECOC on the grounds that it ignores the similarities of original classes when disassembling one multi-class problem into multiple binary problems. In order to attack this problem, we use Model-Refinement [5] to reduce this so-called bias. The basic idea is to take advantage of misclassified examples in the training data to iteratively refine and adjust the centroids. The empirical evaluation [5] shows that Model-Refinement can dramatically reduce the bias. C= i 1 d c i d ci (1) d C c = arg max d C ci 2 i i 2 (2) 2. ERROR-CORRECTING OUTPUT CODING ECOC works by converting a multi-class supervised learning Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. For example, the single-topic documents involved with "sport" or "education" can meet with the presumption; whereas the hybrid documents involved with "sport" as well as "education" break this supposition. As such, ECOC based centroid classifier also break this hypothesis. This is because ECOC ignores the similarities of original classes when producing binary problems. In this scenario, many different classes are often merged into one category. For example, the class "sport" and "education" may be assembled into one class. As a result, the assumption will inevitably be broken. Let's take a simple multi-class classification task with 12 classes. After coding the original classes, we obtain the dataset as Figure 2. Class 0 consists of 6 original categories, and class 1 contains another 6 categories. Then we calculate the centroids of merged class 0 and merged class 1 using formula (1), and draw a Middle Line that is the perpendicular bisector of the line between the two centroids. 699 SIGIR 2007 Proceedings Poster Class 0 Middle Line Class 1 d classification tasks with a large number of categories. On the other hand, Model-Refinement has proved to be an effective approach to reduce the bias of base classifier, that is to say, it can dramatically boost the performance of base classifier. In our experiment, we use two corpora: 20NewsGroup 1 , and Industry Sector 2. For 20NewsGroup, we use a subset consisting of total categories and 19,446 documents; for Industry Sector, we use a subset called as Sector-48 consisting of 48 categories and in all 4,581 documents. For experiments involving SVM we employed SVMTorch. (www.idiap.ch/~bengio/projects/SVMTorch.html). We use Information Gain [7] to select 10,000 features. For simplicity, we use MR to stand for Model-Refinement. For ECOC, we use 63-bit BCH coding [3]. Model-Refinement (MR) iterates over the total training set 8 times. From table 1, we can observe that ECOC indeed brings significant bias for centroid classifier, which results in considerable decrease in accuracy. Especially on sector-48, the bias reduces the accuracy of centroid classifier from 0.7985 to 0.6422. On the other hand, the combination of ECOC and ModelRefinement makes a significant performance improvement over centroid classifier. On Newsgroup, it beats centroid classifier by 4 percents; on Sector-48, it beats centroid classifier by 11 percents. More encouraging, it yields better performance than SVM, especially on Sector-48. This improvement also indicates that Model-Refinement can effectively reduce the bias incurred by ECOC. Method Dataset Sector-48 NewsGroup Table 1: The Accuracy of different methods MR ECOC ECOC+ MR Centroid +Centroid +Centroid +Centroid 0.7985 0.8371 0.8671 0.8697 0.6422 0.8085 0.9122 0.8788 SVM 0.8948 0.8777 4. EXPERIMENT RESULTS Figure 2: Original Centroids of Merged Class 0 and Merged Class 1 C0 C1 According to the decision rule (formula (2)) of centroid classifier, the examples of class 0 on the right of the Middle Line will be misclassified into class 1. This is the mechanism why ECOC can bring bias for centroid classifier. In other words, the ECOC method conflicts with the assumption of centroid classifier to some degree. 3.2 Why Model-Refinement can reduce this bias? To decrease this kind of bias, we employ the Model-Refinement approach to adjust the class representative, i.e., the centroids. The basic idea of Model-Refinement is to make use of training errors to adjust class centroids so that the biases can be reduced gradually, and then the training-set error rate can also be reduced gradually. For example, if document d of class 0 is misclassified into class 1, both centroids C0 and C1 should be moved right by the following formulas (3-4) respectively, * C0 = C 0 + d (3) C1* = C1 - d (4) where (0<<1) is the Learning Rate which controls the step-size of updating operation. With this so-called move operation, C0 and C1 are both moving right gradually. At the end of this kind of move operation (see Figure 3), no example of class 0 locates at the right of Middle Line so no example will be misclassified. More details can be found in [5]. Class 0 Middle Line Class 1 d 5. CONCLUSION C*1 In this work, we examine the use of ECOC for improving centroid text classifier. The experimental results indicate that the combination of ECOC with Model-Refinement makes a wide margin performance improvement over traditional centroid classifier, and even performs comparably with SVM classifier. C*0 TRAINING 1 Load parameters, i.e., code length L and class K. 2 Create a L-bit code for the K classes. Figure 3: Refined Centroids of Merged Class 0 and Class 1 6. REFERENCES [1] [2] [3] [4] [5] [6] [7] 3 For each bit, train a binary centroid classifier over total training data. 4 Use Model-Refinement to adjust centroids. TESTING 1. Apply each of the L classifiers to the test example. 2. Assign the test example the class label with the largest votes. Figure 4: Combination of ECOC and Model-Refinement 3.3 The combination of ECOC and ModelRefinement for centroid classifier In this subsection, we present the outline (Figure 4) of combining ECOC and Model-Refinement for centroid classifier. In substance, the improved ECOC combines the strengths of ECOC and Model-Refinement. ECOC research in ensemble learning techniques has shown that it is well suited for E. Han and G. Karypis. Centroid-Based Document Classification Analysis & Experimental Result. PKDD. 2000. A. Berger. Error-correcting output coding for text classification. IJCAI. 1999. R. Ghani. Using error-correcting codes for text classification. ICML. 2000 R. Ghani. Combining labeled and unlabeled data for multiclass text categorization. ICML. 2002 S. Tan, X. Cheng, M. Ghanem, B. Wang, and H. Xu. A novel refinement approach for text categorization. CIKM. 2005 Y. Liu, Y. Yang and J. Carbonell. Boosting to Correct Inductive Bias in Text Classification. CIKM. 2002 Y. Yang and J. Pedersen. A Comparative Study on Feature Selection in Text Categorization. ICML. 1997 1 http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/wwkb. 2 http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/. 700