SIGIR 2007 Proceedings Poster Improving Active Learning Recall via Disjunctive Boolean Constraints Emre Velipasaoglu Yahoo! Inc. Sunnyvale, CA 94089 USA Hinrich Schütze Institute for NLP University of Stuttgart Germany Jan O. Pedersen Yahoo! Inc. Sunnyvale, CA 94089 USA emrev@yahoo-inc.com ABSTRACT hs999@ifnlp.org 2. METHOD jpederse@yahoo-inc.com Active learning efficiently hones in on the decision boundary between relevant and irrelevant documents, but in the process can miss entire clusters of relevant documents, yielding classifiers with low recall. In this paper, we propose a method to increase active learning recall by constraining sampling to a document subset rich in relevant examples. Categories and Subject Descriptors: I.2.6 [Artificial Intelligence]: Learning; H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing. General Terms: Algorithms, Performance. Keywords: Practical text classification, active learning, missed cluster effect. The basic idea of our method is to automatically find a set of characteristic terms for recall enhancement by analyzing the labeled set after a number of iterations of AL. Terms that occur more frequently in the labeled set than in the pool are particularly interesting. These terms represent concepts that may co-occur with other terms to yield concepts relevant to missed clusters. We propose to use a disjunctive query of such terms to obtain a sample of unlabeled documents from the pool. This subset is empirically found to be richer in missed cluster documents than the unlabeled pool. We modify AL simply by constraining the candidate documents to this subset. A new set of characteristic terms is estimated for each subsequent iteration, and a disjunctive query is issued to define a new subset of unlabeled documents. We use a 2-stage method to estimate the set of characteristic terms based on the counts for each term in Table 1 and algorithm below. 2 2 1. Order terms ti by descending ti = t i Ati , Bti , C ti , Dti 2. Select first m terms: T stage -1 = {t i i m} 1. INTRODUCTION Machine learned text classifiers are commonly created by estimating the parameters of a statistical classification model on a labeled training set [7]. Active learning (AL) is an efficient training set creation method. AL starts with a seed set, iteratively samples unlabeled examples from the subspace that contains "informative" or "uncertain" documents, labels and adds them to the training set, and retrains the classifier thus redefining the region of uncertainty in each step [1]. It has been shown that AL is effective in creating high performance classifiers with a relatively small number of labeling decisions [2]. Unfortunately, AL can produce classifiers low in recall because the sampling scheme hones in very rapidly on the decision boundary of a classifier produced in an early iteration. If there is only one cluster of relevant documents, this procedure is likely to rapidly discriminate between relevant and non-relevant documents. However, if there are several discrete clusters of relevant documents it is possible that whole clusters will be missed, ruining recall if not precision. We refer to this phenomenon as the missed cluster effect. We have found that about a third of relevant documents in the Reuters corpus are in missed clusters after 100 iterations of AL [6]. This paper addresses the missed clusters using a bootstrap set of documents that can be obtained by AL. We propose a method to constrain the document space AL samples from in order to increase the population of relevant examples. This modification improves the likelihood that AL will sample from otherwise missed clusters, resulting in significant improvement to recall. ( ) 4. Select first n terms: T stage - 2 = {t i i n} 3. Order terms ti in Tstage-1 by ascending rt i = 1 - At i C ti ( ) In words, we select top m terms by descending chi-square statistic in stage-1, and top n terms within that set using the criterion that ratio of counts A and C is as close to 1 as possible in stage-2. Table 1. Counts used to estimate characteristic terms: At i Bt i C ti Dt i # docs in labeled portion of pool containing term ti. # docs in unlabeled portion of pool containing term ti. # docs in labeled portion of pool not containing term ti. # docs in unlabeled portion of pool not containing term ti. Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. Analysis of a case illustrates how the method works. For example, the GVIO category trained on all labeled examples has recall of 77%, whereas ordinary AL (i.e. uncertainty sampling as in [4]) only reached 54% in 200 iterations. Recall deficiency signals the possibility of missed clusters. At the 50th iteration, the rate of relevant examples in the subset of unlabeled documents selected by the disjunctive query of the characteristic terms is about 15 times that of the unlabeled pool. Further analysis shows that most of these relevant documents are false negatives. As a result AL has higher likelihood to sample the missed clusters when constrained to this subset of documents. 893 SIGIR 2007 Proceedings Poster 3. EXPERIMENTS The first half of the RCV1 corpus (400,001 documents) was randomly split into POOL and EVAL. Seed and query documents were drawn from POOL, and EVAL was used for evaluation. 43 categories with frequencies ranging from 0.004% to 47% were selected, representing a mix of "small" and "large" categories. Standard tf-idf vector space representation was used. Details of this experimental setup can be found in [6]. Linear SVMs were used as they are viewed as having close to optimal performance in text categorization [8]. We used uncertainty sampling as in [4]. For each category, we ran 200 iterations of AL starting with a seed set of 5 positive and 5 negative documents selected randomly from POOL. The first 50 iterations were ordinary AL (uncertainty sampling). After that, we constrained AL as explained above. In estimating the characteristic terms, we selected top m=100 terms in stage-1 and n=10 terms in the stage2. Table 2. Performances of ordinary and constrained AL. Category SEASIA I22470 I64600 ISLAM SPSAH I6540030 I25520 SURM I22300 I32550 I36102 I45000 GABON I97412 ERTRA I45300 NEPAL SENEG I8500031 PARA BOL I83940 I82002 E411 SLVNIA I81402 I42700 I13000 I42900 I65600 M131 M132 EEC GVIO DEN E51 ROM E212 CANA AUSTR GCAT CCAT USA + 5 6 7 7 10 12 13 13 15 15 15 17 20 22 28 30 30 34 37 41 43 48 49 56 56 57 63 64 64 64 64 65 66 66 67 67 68 69 81 83 101 105 106 Ordinary AL F P 0 0 0 0 0 0 0 0 93 100 43 86 38 100 70 100 26 100 20 100 12 100 31 63 73 100 24 94 68 93 9 87 94 99 67 89 26 69 77 92 70 84 20 84 22 93 63 72 89 98 4 67 55 85 37 86 60 82 32 85 76 93 77 87 76 92 65 83 88 97 36 84 93 99 77 92 78 95 88 97 88 90 84 83 84 93 R 0 0 0 0 88 29 23 54 15 11 6 21 57 13 53 5 90 53 16 66 60 11 12 55 81 2 41 24 47 20 65 70 65 54 81 23 87 66 67 81 85 86 77 Constrained AL (% Relative Gain) + F P R 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 -4 -1 -4 0 0 0 0 3 2 0 3 14 35 14 41 0 -1 0 -2 -7 -3 3 -7 4 1 0 1 6 11 -4 14 5 2 2 2 -5 -4 1 -7 21 9 11 9 5 3 -3 7 30 27 -5 40 5 -5 4 -10 11 11 -10 18 9 4 -6 11 11 1 2 0 24 6 -2 13 38 11 -8 28 9 1 0 1 25 26 -12 43 21 0 0 1 9 1 -2 3 4 5 -3 11 13 2 -2 6 -1 0 2 -3 -1 -1 -1 -1 -2 -2 -7 3 ordinary AL (including 5 relevant examples in the seed set). The performance numbers were 36%, 84% and 23% for F, P and R, respectively. Constrained AL produced 25% more relevant examples, increased F and R by 26% and 43%, respectively, and caused a 12% drop in P. Figure 1 shows the relative gain in F vs. the final number of relevant examples in ordinary AL. Average relative gain in recall is 5.1% (p-value = 0.008 in onesided paired t-test). F is up 3.2% (p-value = 0.008) and precision is down 0.6% but not significant (p-value = 0.1). There was no difference in performance between ordinary and constrained AL for small categories. Constrained AL caused a slight loss of accuracy for the largest 3 categories, GCAT, CCAT and USA, which have population rates 30%, 47% and 33%, respectively. It appears constraining by a disjunctive query of only 10 terms is too aggressive and unnecessary for these categories. A strong gain is observed for "medium" size categories, where the population rate ranged from 0.04% to 4.3%. In general, improvement in recall came at the expense of a small loss in precision. This is acceptable and even preferred in many practical text classification scenarios, since ordinary AL is heavily biased for high precision [6]. 40% 30% 20% 10% 0% -10% -20% 0 20 40 60 80 100 120 Figure 1. Number of relevant examples in ordinary AL vs. percent relative gain in F by constrained AL. 5. RELATED WORK In [3], a prior distribution over terms is utilized to increase recall. In [5], AL is extended by requesting explicit feedback about terms in interleaved steps. Unlike ours, these solutions require explicit information over term space. Our method would be useful where term space knowledge is not clear and can be complementary to methods that leverage domain knowledge explicitly. 6. REFERENCES [1] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Mach. Learn., 15(2):201­221, 1994. [2] S. Dasgupta. Analysis of a greedy active learning strategy. NIPS, 2004. [3] A. A. Dayanik, D. D. Lewis, D. Madigan, V. Menkov, A. [4] [5] [6] [7] [8] Genkin. Constructing informative prior distributions from domain knowledge in text classification. SIGIR, 2006. D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. SIGIR, 1994. H. Raghavan, O. Madani, R. Jones. InterActive Feature Selection. IJCAI, 2005. H. Schütze, E. Velipasaoglu, and J. Pedersen. Performance thresholding in practical text classification. CIKM, 2006. F. Sebastiani. Machine learning in automated text categorization. ACM Comp. Surveys, 34(1):1­47, 2002. Y. Yang and X. Liu. A reexamination of text categorization methods. SIGIR, 1999 4. RESULTS The number of relevant examples (+) at the end of 200 iterations, precision (P), recall (R) and F measured on EVAL are compared for ordinary and constrained AL. In Table 2, absolute performance is listed for ordinary AL and relative gain, calculated as (Y-X)/X, listed for constrained AL. For instance, the E51 category had 67 relevant examples at the end of 200 iterations of 894