WWW 2007 / Poster Paper Topic: Search Efficient Training on Biased Minimax Probability Machine for Imbalanced Text Classification Xiang Peng Depar tment of Computer Science & Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong Irwin King Depar tment of Computer Science & Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong xpeng@cse.cuhk.edu.hk ABSTRACT The Biased Minimax Probability Machine (BMPM) constructs a classifier which deals with the imbalanced learning tasks. In this paper, we propose a Second Order Cone Programming (SOCP) based algorithm to train the model. We outline the theoretical derivatives of the biased classification model, and address the text classification tasks where negative training documents significantly outnumber the positive ones using the proposed strategy. We evaluated the learning scheme in comparison with traditional solutions on three different datasets. Empirical results have shown that our method is more effective and robust to handle imbalanced text classification problems. king@cse.cuhk.edu.hk this problem as a strict binary classification problem while ignore the fact that the number of uninterested documents is extremely larger than the interested ones. How to make the returned document set as accuracy as possible is a crucial problem. At the same time, in order to build a reliable classifier for text classification, we need to train the model with huge number of predefined documents, which is usually a very time consuming process. Thus, how to reduce the time required for training a reliable text classifier is a crucial obstacle for large scale text classification. This is particularly challenging for text classification of WWW documents given its nature of large volume. In this paper, we apply the model of Biased Minimax Probability Machine (BMPM) to the problem of imbalanced text classification, and propose a new training algorithm to tackle the complexity and accuracy issues in BMPM learning task. This model is transformed into a Second Order Cone Programming (SOCP) problem. Under this new proposed framework, the imbalanced text classification problem could be modelled and solved efficiently. The rest of this paper is organized as follows. Section II introduces the concept of BMPM. Section III presents an effective learning algorithm based on SOCP for effective training with BMPM. Section IV presents the results of our empirical study. Categories and Subject Descriptors 1.5.2 [Pattern Recognition]: Design Methodology--Classifier Design and Evaluation ; H.2.8 [Database Management]: Database Application--Data Mining General Terms Algorithm, Management, Experimentation Keywords Biased Classification, Second Order Cone Programming, Biased Minimax Probability Machine, Text Classification 2. BIASED MINIMAX PROBABILITY MACHINE We assume two random vectors x and y represent two classes of data with mean and covariance matrices as {x, x } and {y, y }, respectively in a two-category classification task, where x, y, x, y Rn , and x , y Rn×n . Biased Minimax Probability Machine (BMPM) attempts to determine the hyperplane aT z = b with aT z > b being considered as class x and aT z < b being judged as class y to separate the important class of data x with a maximal probability while keeping the accuracy of less important class of data y acceptable.1 We formulate this ob jective as follows: , ,b,a=0 1. INTRODUCTION With the rapid growth of text information on the World Wide Web (WWW), text classification has become one of the most important topic in both the community of research and engineering [1]. However there are two ma jor problems with current algorithms involving in text classification task. One key challenge is that almost all the algorithms treat the problem as a balanced classification task and they do not consider the imbalanced dataset matter, which means the number of negative documents is larger than the number of positive ones. Take the task of learning which news articles are of interest to a particular person reading Google News for example. The articles which the person interested may be just a small portion in the whole text database. Methods that filter and present only the ones that user finds interesting are highly desirable. Currently, most researchers treat Copyright is held by the author/owner(s). WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. max s.t. ¯ x(x,x ) ¯ y(y,y ) inf Pr{aT x b} , Pr{aT y b} , 0 , inf (1) where and represent the lower bounds of the accuracy for 1 The reader may refer to [2] for a more detailed and complete description. 1153 WWW 2007 / Poster Paper BM P MS OC P 81.42 ± 0.22 70.00 ± 0.00 83.10 ± 0.60 72.61 ± 0.84 77.85 ± 0.04 B M P MF P 80.35 ± 0.13 70.00 ± 0.00 81.07 ± 0.63 74.48 ± 0.69 77.70 ± 0.21 MP M 76.30 ± 0.28 76.30 ± 0.34 74.91 ± 0.61 75.20 ± 0.62 75.05 ± 0.37 SV M 73.23 ± 1.59 74.60 ± 0.47 73.90 ± 0.44 kN N 71.60 ± 0.38 69.40 ± 0.60 70.50 ± 0.55 Topic: Search T S Ax T S Ay T SA Table 1: Lower Bound and Test-Set Accuracy on the Reuter-21578 dataset (%) future data classification, namely, the worst-case accuracy. Meanwhile, 0 is a pre-specified positive constant which represents an acceptable accuracy for the less important class. (a) ROC for Reuters-21578 Dataset 1 0.9 0.8 1 0.95 0.9 (a) ROC for Reuters-21578 Dataset True Positive Rate 0.6 0.5 0.4 0.3 0.2 BMPMSOCP BMPMFP k-NN True Positive Rate 0.7 0.85 0.8 0.75 0.7 0.65 0.6 0.55 BMPMSOCP BMPMFP k-NN 3. EFFICIENT BMPM TRAINING 3.1 Motivation Most of recent studies on BMPM are usually based on the Fractional Programming problem (we name it B M P MF P ) which could be solved by Rosen Gradient method. However the problem reformulation has some crucial assumption when doing the transformation which would lead to failure of the model solution. Another issue is that when applying the Fractional Programming based B M P MF P into large realworld classification problems, it would be very sensitive to data dimension and very time consuming. 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 False Positive Rate False Positive Rate Figure 1: ROC curves on Reuters-21578 dataset: Full Range (Left), Crucial Part (Right) data collection and Enron Corpus. For all three datasets, the same data pre-processing and feature selection procedures are applied. Due to space limitations, we only present our results on Reuters-21578 dataset. Applying BMPM-based technique in text classification is 3.2 Proposed Strategy a very straightforward task, where we just need to assume Our main result is stated below. the interested documents to be the more important class (x) in the biased classification framework while assume the unTheorem 1. If x = y, then the minimax probability deinterested ones to be the less important class (y). For expercision problem (1) does not have a meaningful solution: the imental setting up, we employ Receiver Operating Characoptimal worst-case misclassification probability that we obteristic (ROC) analysis and Test Set Accuracy (TSA) as the tain is 1 - =1. Otherwise, an optimal hyperplane H (a , b ) performance measurements. The involved traditional algoexists, and can be determined by solving the convex optimizarithms are the Support Vector Machine (SVM), k -Nearest tion problem: Neighbor (k NN) and Minimax Probability Machine (MPM). Table 1 shows the experimental results of TSA perforT min t - a (x - y) t,a=0 mance evaluation, where we can see that our two BMPM 1 2 models achieve better performances than the other algo(2) s.t. x a 11, 1 rithms in most of the cases while the B M P MS OC P generally -0 2 y a t, 0 outperforms the B M P MF P method. Furthermore, It is observed from the ROC curves in Figand setting b to the value ure 1 that most parts of the ROC curve of BMPMs are above the corresponding curve of k NN along with the B M P MS OC P a a 0 T a = aT x- T a , b = aT y+ y x curve is above the one of B M P MF P , which demonstrate the 1 - 0 1- superiority of the BMPM models and our proposed B M P MS OC P algorithm. where a is an optimal solution of (1), and t R is a new optimization variable. Furthermore, if either x or y is 5. ACKNOWLEDGMENTS positive definite, the optimal hyperplane is unique. The work described in this paper is supported by a grant Lemma 1. The Second Order Cone Programming probfrom the Research Grants Council of the Hong Kong Special lem with linear ob jective function and norm constraints is Administrative Region, China (Pro ject No. CUHK4235/04E) a convex optimization problem and thus can be solved effiand is affiliated with the Microsoft-CUHK Joint Laboratory ciently. for Human-centric Computing and Interface Technologies. We omit the details of the proofs due to space limitations. 4. EXPERIMENTAL RESULTS We evaluated our proposed biased learning algorithm in comparison to the state-of-the-art approaches by conducting empirical comparisons on three standard datasets for text document classification: Reuters-21578 dataset, 20-Newsgroup [1] S. C. Hoi, R. Jin, and M. Lyu. Large-scale text categorization by batch mode active learning. In Proc. of World Wide Web Conference, pages 633­642, 2006. [2] K. Huang, H. Yang, I. King, M. Lyu, and L. Chan. Minimum error minimax probability machines. Journal of Machine Learning Research, 5:1253­1286, 2004. 6. REFERENCES 1154