Look Ma, No Hands: Analyzing the Monotonic Feature Abstraction for Text Classification
Doug Downey Electrical Engineering and Computer Science Department Northwestern University Evanston, IL 60208 ddowney@eecs.northwestern.edu Oren Etzioni Turing Center, Department of Computer Science and Engineering University of Washington Seattle, WA 98195 etzioni@cs.washington.edu

Abstract
Is accurate classification possible in the absence of hand-labeled data? This paper introduces the Monotonic Feature (MF) abstraction--where the probability of class membership increases monotonically with the MF's value. The paper proves that when an MF is given, PAC learning is possible with no hand-labeled data under certain assumptions. We argue that MFs arise naturally in a broad range of textual classification applications. On the classic "20 Newsgroups" data set, a learner given an MF and unlabeled data achieves classification accuracy equal to that of a state-of-the-art semi-supervised learner relying on 160 hand-labeled examples. Even when MFs are not given as input, their presence or absence can be determined from a small amount of hand-labeled data, which yields a new semi-supervised learning method that reduces error by 15% on the 20 Newsgroups data.

1

Introduction

Is accurate classification possible in the complete absence of hand-labeled data? A Priori, the answer would seem to be no, unless the learner has knowledge of some additional problem structure. This paper identifies a problem structure, called Monotonic Features (MFs), that enables the learner to automatically assign probabilistic labels to data. A feature is said to be monotonic when the probability of class membership increases monotonically with that feature's value, all else being equal. MFs occur naturally in a broad range of textual classification tasks. For example, it can be shown that Naive Bayes text classifiers return probability estimates that are monotonic in the frequency of a word--for the class in which the word is most common. Thus, if we are trying to discriminate between documents about New York and Boston, then we expect to find that the Naive Bayes feature measuring the frequency of "Giants" in the corpus is an MF for the class New York, and likewise for "Patriots" and Boston. In document classification, the name of the class is a natural MF--the more times it is repeated in a document, all other things being equal, the more likely it is that the document belongs to the class. We demonstrate this to be the case empirically in Section 4, extending the experiments of [7]. Similarly, information retrieval systems classify documents into relevant and irrelevant documents based, 1


in part, on the Term-Frequency-Inverse-Document-Frequency (TF-IDF) metric, and then proceed to rank the relevant documents. The term frequency component of this metric is a monotonic feature. The power of MFs is not restricted to textual classification. Consider Information Extraction (IE) where strings are extracted from sentences and classified into categories (e.g. City or Film) based on their proximity to "extraction patterns". For example, the phrase "cities such as" is an extraction pattern. Any proper noun immediately following this pattern is likely to denote a city, as in the phrase "cities such as Boston, Seattle, and New York" [8]. When classifying a proper noun, the number of times that it follows an extraction pattern in a corpus turns out to be a powerful MF. This observation is implicit in the combinatorial model of unsupervised IE put forth in [5]. Finally, MF-based techniques have been demonstrated to be effective for word-sense disambiguation using a set of manually-specified MFs [12]; this work was later extended to automatically derive MFs from resources like Wordnet [9]. Thus, MFs have been used implicitly in a broad range of textual classification tasks. This paper makes the MF abstraction explicit, provides a formal theory of MFs, an automatic method for explicitly detecting and utilizing MFs, and quantifies the method's benefits empirically. 1.1 Contribution Typically, MFs cannot serve directly as classifiers. Instead, this paper presents theoretical and empirical results showing that even relatively weak MFs can be used to induce a noisy labeling over examples, and these examples can then be used to train effective classifiers utilizing existing supervised or semi-supervised techniques. Our contributions are as follows: 1. We prove that the Monotonic Feature (MF) structure guarantees PAC learnability using only unlabeled data, and that MFs are distinct from and complementary to standard biases used in semisupervised learning, including the manifold and cluster assumptions. 2. We present a general technique, called M A, for employing MFs in combination with an arbitrary concept learning algorithm. We demonstrate experimentally that M A can outperform state-of-theart techniques for semi-supervised document classification, including Naive Bayes with Expectation Maximization (NB-EM), and Label Propagation, on the 20 Newsgroups data set [10]. The remainder of the paper is organized as follows. Section 2 formally defines our problem structure and the properties of monotonic features. Section 3 presents our theoretical results, and formalizes the relationship between the MF approach and previous work. We present experimental results in Section 4, and conclude with directions for future work.

2

Formal Framework

We consider a semi-supervised classification task, in which the goal is to produce a mapping from an instance space X consisting of d-tuples x = (x1 , . . . , xd ), to a binary output space Y = {0, 1}.1 We denote the concept class of mappings f : X  Y as C . We assume the following inputs: · A set of zero or more labeled examples DL = {(xi , yi )|i = 1 . . . n}, drawn i.i.d. from a distribution P (x, y ) for x  X and y  Y . · A set of zero or more unlabeled examples DU = {(xi )|i = 1 . . . u} drawn from the y marginal distribution P (x) = P (x, y ). · A set M  {1, . . . , d} of zero or more monotonic features for the positive class y = 1. The monotonic features have properties specified below. The goal of the classification task is to produce a mapping c  C that maximizes classification accuracy evaluated over a set of test examples drawn i.i.d. from P (x, y ).
1 For convenience, we restrict our formal framework to the binary case, but the techniques and analysis can be extended trivially to the multi-class case.

2


We further define CM  C as the concept class of binary classifiers that use only the monotonic features. Similarly, let C¬M  C indicate the concept class of binary classifiers using only the non-monotonic features. Monotonic features exhibit a monotonically increasing relationship with the probability that an example is a member of the positive class. More formally, we define monotonic features as follows: Definition 1 A monotonic feature for class y is a feature i  {1, . . . , d} for which the following three properties hold: · The domain of xi is fully ordered and discrete, and has finite support.2 · The conditional probability that an example is an element of class y = 1 increases strictly monotonically with the value of xi . That is, P (y = 1|xi = r) > P (y = 1|xi = r ) if r>r. · The monotonicity is non-trivial in that P (xi ) has positive probability for more than one feature value. That is, there exists r > r and > 0 such that P (xi = r), P (xi = r ) > . With this definition, we can state precisely the monotonic feature structure: Definition 2 For a learning problem from the input space X of d-tuples x = (x1 , . . . , xd ) to the output space Y , the monotonic feature structure (MFS) holds if and only if at least one of the features i  {1, . . . , d} is a monotonic feature for the positive class y = 1. When tasked with a learning problem for which the MFS holds, three distinct configurations of the input are possible. First, monotonic features may be known in the absence of labeled data (|M | > 0, DL = ). This is the setting considered in previous applications of monotonic features, as discussed in the introduction. Second, monotonic features may be unknown, but labeled data may be provided (M = , |DL | > 0); this corresponds to standard semi-supervised learning. In this case, the MFS can still be exploited by identifying monotonic features using the labeled data, as described in Section 4.1. Lastly, both monotonic features and labeled data may be provided (|M |, |DL | > 0). We provide algorithms for each case and evaluate each experimentally in Section 4.

3

Theory of Monotonic Features

This section shows that under certain assumptions, knowing the identity of a single monotonic feature is sufficient to PAC learn a target concept from only unlabeled data. Further, we prove that monotonic features become more informative relative to labeled examples as the feature set size increases. Lastly, we discuss and formally establish distinctions between the monotonic feature abstraction and other semi-supervised techniques. We start by introducing the conditional independence assumption, which states that the monotonic features are conditionally independent of the non-monotonic features given the class. Formally, the conditional independence assumption is satisfied iff P ({xi : i  I }|y , {xj : j  I }) = P ({xi : i  / I }|y ). We show that when the concept class C¬M is learnable in the PAC model with classification noise, and the conditional independence assumption holds, then knowledge of a single monotonic feature makes the full concept class C learnable from only unlabeled data. Our result builds on a previous theorem from [2], and requires the following definition: Definition 3 A classifier h  CM is weakly-useful iff there exists > 0 such that P (h(x) = 1)  a nd P (y = 1|h(x) = 1)  P (y = 1) + . Theorem 1 If the conditional independence assumption is satisfied and the concept class C¬M is learnable in the PAC model with classification noise, then given a single monotonic feature, C is learnable from unlabeled data only.
2 For convenience, we present our analysis in terms of discrete and finite monotonic features, but the results can be extended naturally to the continuous case.

3


Proof Sketch. The result follows from Theorem 1 in [2] and an application of Hoeffding bounds to show that the monotonic feature can be used to construct a weakly-useful classifier.3 T he next theorem suggests that monotonic features become preferable to labeled examples as the feature space increases in size. We compare the value of monotonic features and labeled examples in terms of information gain, defined below. For convenience, these results are presented using a feature space XB , in which all features are binary-valued. Definition 4 The information gain with respect to an unlabeled example's label y provided by a variable v is defined as the reduction in entropy of y when v is given, that is P (y |v )log P (y |v )/P (y ). Next, we define the two properties of the classification task that our theorem requires. Informally speaking, the first property states that the feature space does not have fully redundant features, whereas the second states that examples which are far apart have less dependent labels than those which are close together. We would expect these properties to hold for most tasks in practice. Definition 5 A distribution D on (XB , Y ) has bounded feature dependence if there exists F > 0 such that the conditional probability PD (xi |{xj = rj : j = i}) < 1 - F for all i and all sets {xj = rj : j = i} of assignments to one or more xj . Definition 6 A distribution D on (XB , Y ) has distance-diminishing information gain if the information gain of an example x with respect to the label of any neighboring example x is less than r KI I for I < 1, where r is the Hamming distance between x and x . The following theorem shows that whenever the above properties hold to a sufficient degree, the expected information gain from a labeled example falls as the size of the feature space increases. Theorem 7 For a learning problem governed by distribution D with bounded feature dependence I and distance-dimishing information gain, with F > I+1 , as the number of features d increases, the expected information gain provided by a labeled example about unlabeled examples' labels decreases to zero. However, the information gain from an MF xf with given relationship P (y |xf ) remains constant as d increases. The portion of Theorem 7 which concerns information gain of a labeled example is a version of the well-known "curse of dimensionality" [1], which states that the number of examples needed to estimate a function scales exponentially with the number of dimensions under certain assumptions. Theorem 7 differs in detail, however; it states the curse of dimensionality in terms of information gain, making possible a direct comparison with monotonic features. 3.1 Relation to Other Approaches In the introduction, we identified several learning methods that utilized Monotonic Features (MFs) implicitly, which was a key motivation for formalizing MFs. This section explains the ways in which MF-based classification is distinct from previous semi-supervised learning methods. When MFs are provided as input, they can be viewed as a kind of "labeled feature" studied in [6]. However, instead of "expectation regularization", we use the prior to probabilistically label examples. Thus, MFs can complement any concept learning algorithm, not just discriminative probabilistic models as in [6]. Moreover, in Section 4, we show that MFs can either obviate hand-labeled data, or can be estimated automatically from a small hand-labeled samples. Co-training [2] is a semi-supervised technique that also considers a partition of the feature space into two distinct "views." One might ask if monotonic feature classification is equivalent to cotraining with the monotonic features serving as one view, and the other features forming the other. However, co-training requires labeled data to train classifiers for each view, unlike monotonic feature classification which can operate without any labeled data. Thus, there are cases where an MF-based algorithm like M A is applicable, but co-training is not. Even when labeled data is available, co-training takes the partition of the feature set as input, whereas monotonic features can be detected automatically using the labeled data. Also, co-training is an
3

Proofs are omitted due to lack of space.

4


iterative algorithm in which the most likely examples of a class according to one view are used to train a classifier on the other view in a mutually recursive fashion. For a given set of monotonic features, however, iterating this process is ineffective, because the mostly likely examples of a class according to the monotonic feature view are fixed by the monotonicity property. 3.1.1 Semi-supervised Smoothness Assumptions

In this section, we prove that the MFS is distinct from certain smoothness properties typically assumed in previous semi-supervised learning methods, known as the cluster and manifold assumptions. The cluster assumption states that in the target concept, the boundaries between classes occupy relatively low-density regions of the distribution P (x). The manifold assumption states that the distribution P (x) is embedded on a manifold of strictly lower dimension than the full input space X. It can be shown that classification tasks with the MFS exist for which neither the cluster assumption nor the manifold assumption holds. Similarly, we can construct classification tasks exhibiting the manifold assumption, the cluster assumption, or their conjunction, but without the MFS. Thus, we state the following theorem. Theorem 8 The monotonic feature structure neither implies nor is implied by the manifold assumption, the cluster assumption, or their conjunction or disjuntion. In conclusion, it is clear that the MFS is a novel problem structure that is both conceptually and formally distinct from previously articulated structures including co-training, the cluster assumption, and the manifold assumption.

4

Experiments

This section reports on our experiments in utilizing MFs for text classification. As discussed in the introduction, MFs have been used implicitly by several classification methods in numerous tasks. Here we quantify their impact on the standard "20 newsgroups" dataset [10]. We show that MFs can be employed to perform accurate classification even without labeled examples, extending the results from [7] to a semi-supervised setting. Further, we also demonstrate that whether or not the identities of MFs are given, exploiting the MF structure by learning MFs can improve performance. 4.1 General Methods for Monotonic Feature Classification Here, we define a set of abstract methods for incorporating monotonic features into any existing learning algorithm. The first method, M A, is an abstraction of the MF word sense disambiguation algorithm first introduced in [12]. It is applicable when monotonic features are given but labeled examples are not provided. The second, M A - S S L , applies in the standard semi-supervised learning case when some labeled examples are provided, but the identities of the MFs are unknown and must be learned. Lastly, M A - B OT H applies when both labeled data and MF identities are given. When monotonic features are given to the algorithm without labeled data, M A proceeds as shown in Figure 1. M A labels the unlabeled examples DU as elements of class y = 1 iff some monotonic feature value xi for i  M exceeds a threshold  . The threshold is set using unlabeled data so as to maximize the minimum probability mass on either side of the threshold4 . This set of bootstrapped examples DL is then fed as training data into a supervised or semi-supervised algorithm (DL , DU ), and M A outputs the resulting classifier. In general, the M A schema can be instantiated with any machine learning algorithm . When MFs are unknown, but some labeled data is given, the M A - S S L (Figure 2) algorithm attemps to identify MFs using the labeled training data DL , and adds the most strongly monotonic features to the set M . Monotonicity strength can be measured in various ways; in our experiments below, we
4 This policy is suggested by the proof of Corollary 1, in which the only requirement of the threshold is that sufficient mass lies on each side.

5


MA(M , DU , ) 1. DL = Labeled examples (x, y ) such that y = 1 iff a xi >  for some i  M 2. Output (DL , DU )

Figure 1: Pseudocode for M A. The inputs are M , a set of monotonic features, DU , a set of unlabeled examples, and (L, U ), a supervised or semi-supervised machine learning algorithm which outputs a classifier given labeled data L and unlabeled data U . The threshold  is derived from the unlabeled data and M (see text).
M A - S S L(DL , DU , , M ) 1. M = the k strongest monotonic features in DL 2. DL = Examples from DU probabilistically labeled with M (M , DL , DU ) 3. Output (DL  DL , DU )

Figure 2: Pseudocode for M A - S S L. The inputs DU and  inputs are the same as those of M A (see Figure 1). The additional inputs include labeled data DL and a machine learning algorithm M (M , L, U ) which given labeled data L and unlabeled data U outputs a probabilistic classifier that uses only monotonic features M . k is a parameter of M A - S S L(see text). r rank each feature xi by the quantity f (y , xi ) = P (y , xi = r)r for each class y . M A - S S L then adds monotonic features to M in descending order of this value, up to a limit of k = 5 per class.5 M A - S S L then invokes a given machine learning algorithm M (M , DL , DU ) to learn a probabilistic classifier that employs only the monotonic features in M . M A - S S L uses this classifier to probabilistically label the examples in DU to form DL . M A - S S L then returns (DL , DU ) as in M A. A feature of M A - S S L is that if no monotonic features are identified, M A - S S L defaults to the underlying algorithm . When monotonic features are known and labeled examples are available, we run a derivative of M A S S L denoted as M A - B OT H . The algorithm is the same as M A - S S L , except that any given monotonic features are added to the learned set in Step 1 of Figure 2, and bootstrapped examples using the given monotonic features (from Step 1 in Figure 1) are added to DL . 4.2 Experimental Methodology and Baseline Methods The task we investigate is to determine from the text of a newsgroup post the newsgroup in which it appeared. We used bag-of-word features after converting terms to lowercase, discarding the 100 most frequent terms and all terms appearing only once. Below, we present results averaged over four disjoint training sets of variable size, using a disjoint test set of 5,000 documents and an unlabeled set of 10,000 documents. We compared the monotonic feature approach with two alternative algorithms, which represent two distinct points in the space of semi-supervised learning algorithms. The first, NB-EM, is a semisupervised Naive Bayes with Expectation Maximization algorithm [11]. The second, LP, is a semisupervised graph-based label propagation algorithm recently employed for text classification [3]. We found that on this dataset, the NB-EM algorithm substantially outperformed LP (providing a 41% error reduction in the experiments in Figure 3), so in Section 3 we compare our algorithms exclusively with NB-EM. When the identities of monotonic features are given, we obtained one-word monotonic features simply using the newsgroup name, with minor modifications to expand abbreviations. This methodology closely followed that of [7]. For example, the occurrence count of the term "politics" was a monotonic feature for the talk.politics.misc newsgroup. We also expanded the set of monotonic features to include singular/plural variants.
5 A sensitivity analysis revealed that varying k by up to 40% in either direction did not decrease performance of M A - S S L in the experiments in Section 4.3.

6


We employed Naive Bayes classifiers for both  and M . We weighted the set of examples labeled using the monotonic features (DL ) equally with the original labeled set, increasing the weight by the equivalent of 200 labeled examples when monotonic features are given to the algorithm. 4.3 Experimental Results The first question we investigate is what level of performance M A can achieve without labeled training data, when monotonic features are given. The results of this experiment are shown in Table 1. M A achieves accuracy on the 20-way classification task of 0.563. Another way to measure this accuracy is in terms of the number of labeled examples that a baseline semi-supervised technique would require in order to achieve comparable performance. We found that M A outperformed NBEM with up to 160 labeled examples. This first experiment is similar to that of [7], except that instead of evaluating against only supervised techniques, we use a more comparable semi-supervised baseline (NB-EM). Could the monotonic features, on their own, suffice to directly classify the test data? To address this question, the table also reports the performance of using the given monotonic features exclusively to label the test data (MF Alone), without using the semi-supervised technique . We find that the bootstrapping step provides large benefits to performance; M A has an effective number of labeled examples eight times more than that of MF Alone. Random Baseline 5% 0

Accuracy Labeled Example Equivalent

MF Alone 24% 20

MA 56% (2.33x) 160 (8x)

Table 1: Performance of M A when monotonic features are given, and no labeled examples are provided. M A achieves accuracy of 0.563, which is ten fold that of a Random Baseline classifier that assigns labels randomly, and more than double that of "MF Alone", which uses only the monotonic features and ignores the other features. M A's accuracy exceeds that of the NB-EM baseline with 160 labeled training examples, and is eight fold that of "MF Alone". The second question we investigate is whether the monotonic feature approach can improve performance even when the class name is not given. M A - S S L takes the same inputs as the NB-EM technique, without the identities of monotonic features. The performance of M A - S S L as the size of the labeled data set varies is shown in Figure 3. The graph shows that for small labeled data sets of size 100-400, M A - S S L outperforms NB-EM by an average error reduction of 15%. These results are statistically significant (p < 0.001, Fisher Exact Test). Lastly, when both monotonic features and labeled examples are available, M A - B OT H reduces error over the NB-EM baseline by an average of 31% across the training set sizes shown in Figure 3. 4.4 Analysis of Experimental Results Figure 3 shows that M A - S S L outperforms NB-EM, but one important question is whether this performance difference is in fact due to the presence of monotonic features. An alternative hypothesis is that the performance improvements are due instead to the utilization of a smaller hypothesis space in Step 2 in Figure 2; for smaller training sets, could the regularization inherent in performing feature selection improve generalization accuracy without relying on MFs? We tested this hypothesis by replacing M A - S S L's monotonic feature measure f (y , xi ) with a standard information gain measure, and learning an equal number of features distinct from the features selected by M A - S S L. We found that this information-gain based approach results in performance essentially equivalent to that of NB-EM, suggesting that a smaller hypothesis space does not account for the performance boost due to the monotonic features. For additional analysis of the experimental results, along with experiments in another domain, see [4]. 7


0.8

0.6 Accuracy

0.4

0.2

MA-SSL NB-EM MA-Both

0 0 200 400 600 800 Labeled Training Examples

Figure 3: Performance in document classification. M A - S S L reduces error over the NB-EM baseline by 15% for training sets between 100 and 400 examples, and M A - B OT H reduces error by 31% overall.

5

Conclusions

We have presented a general framework for utilizing Monotonic Features (MFs) to perform classification without hand-labeled data, or in a semi-supervised setting where monotonic features can be discovered from small numbers of hand-labeled examples. While our experiments focused on the 20 Newsgroups data set, we have complemented them with both a theoretical analysis, and by enumerating a wide variety of algorithms that have used MFs implicitly.

References
[1] R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton University Press, 1961. [2] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, pages 92­100, 1998. [3] J. Chen, D.-H. Ji, C. L. Tan, and Z.-Y. Niu. Semi-supervised relation extraction with label propagation. In HLT-NAACL, 2006. [4] D. Downey. Redundancy in Web-scale Information Extraction: Probabilistic Model and Experimental Results. PhD thesis, University of Washington, 2008. [5] D. Downey, O. Etzioni, and S. Soderland. A Probabilistic Model of Redundancy in Information Extraction. In Procs. of IJCAI, 2005. [6] G. Druck, G. Mann, and A. McCallum. Learning from labeled features using generalized expectation criteria. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, 2008. [7] A. Gliozzo, C. Strapparava, and I. Dagan. Investigating unsupervised learning for text categorization bootstrapping. In HLT '05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 129­136, Morristown, NJ, USA, 2005. Association for Computational Linguistics. [8] M. Hearst. Automatic Acquisition of Hyponyms from Large Text Corpora. In Procs. of the 14th International Conference on Computational Linguistics, pages 539­545, Nantes, France, 1992. [9] R. Mihalcea and D. I. Moldovan. AAAI/IAAI, pages 461­466, 1999. An automatic method for generating sense tagged corpora. In

[10] T. M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997. [11] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, 39(2/3):103­134, 2000. [12] D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Meeting of the Association for Computational Linguistics, pages 189­196, 1995.

8