Constructing Informative Prior Distributions from Domain Knowledge in Text Classification
David D. Lewis Aynur Dayanik DIMACS & Computer Science David D. Lewis Consulting Chicago, IL Rutgers University s06paper@daviddlewis.com Piscataway, NJ aynur@rutgers.edu Vladimir Menkov Aqsaqal Enterprises Penticton, B.C. Canada vmenkov@cs.indiana.edu ABSTRACT
Sup ervised learning approaches to text classification are in practice often required to work with small and unsystematically collected training sets. The alternative to sup ervised learning is usually viewed to b e building classifiers by hand, using a domain exp ert's understanding of which features of the text are related to the class of interest. This is exp ensive, requires a degree of sophistication ab out linguistics and classification, and makes it difficult to use combinations of weak predictors. We prop ose instead combining domain knowledge with training examples in a Bayesian framework. Domain knowledge is used to sp ecify a prior distribution for the parameters of a logistic regression model, and lab eled training data is used to produce a p osterior distribution, whose mode we take as the final classifier. We show on three text categorization data sets that this approach can rescue what would otherwise b e disastrously bad training situations, producing much more effective classifiers.

David Madigan DIMACS & Statistics Rutgers University Piscataway, NJ dmadigan@rutgers.edu

Alexander Genkin DIMACS, Rutgers University Piscataway, NJ alexgenkin@iname.com 1. INTRODUCTION
Numerous studies show that effective text classifiers can b e produced by sup ervised learning methods, including supp ort vector machines (SVMs) [12, 15, 34], regularized logistic regression [10, 34], and other approaches [15, 28, 34]. Most of these studies used thousands to tens of thousands of randomly selected training examples. In op erational text classification settings, however, small training sets are the rule, due to the exp ense and inconvenience of lab eling, or skepticism that efforts will b e adequately repaid. To learn from a handful of training examples, one must either use a sufficiently limited model class or some additional regularization p enalty to effectively constrain the classifiers reachable with a small amount of data. Otherwise overfitting (learning accidental prop erties of the training data) will yield p oor effectiveness on future data. On the other hand, strong constraints on the models limits the effectiveness of learnable classifiers. This situation can b e improved if one has advance knowledge of which classifiers are likely to b e good for the class of interest. In text categorization, for instance, such knowledge might come from category descriptions meant for manual indexers, reference materials on the topics of interest, lists of features chosen by a domain exp ert, or many other sources. Bayesian statistics provides a convenient framework for combining domain knowledge with training examples [3]. The approach produces a posterior distribution for the quantities of interest (e.g., regression coefficients). Per Bayes theorem, the p osterior distribution is prop ortional to the product of a prior distribution and the likelihood function. In applications with large numb ers of training examples, the likelihood dominates the prior. However, with small numb ers of training examples, the prior is influential and priors that reflect appropriate knowledge can provide improved predictive p erformance. In what follows we apply this approach with logistic regression as our model and text classification (in particular text categorization) as our application. We b egin by reviewing the use of logistic regression in text classification (Section 2), then discuss previous approaches to integrating domain knowledge in text classification (Section 3). Section 4 presents our Bayesian approach, which

Categories and Subject Descriptors
I.2.6 [Artificial Intelligence]: Learning; H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing

General Terms
Algorithms, Exp erimentation

Keywords
knowledge-based, maximum entropy, MAP estimation


Current affiliation Amazon.com

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00.

493


is simple and flexible. Section 5 describ es our exp erimental methods, while Section 6 presents our results. We find on three test categorization test collections, using three diverse sources of domain knowledge, that domain-sp ecific priors can yield large effectiveness improvements.

2.

BAYESIAN LOGISTIC REGRESSION

A logistic regression model is a linear model for the conditional log odds of a binary outcome, i.e. j j xij ) exp( exp( T xi ) j = p(yi = +1| , xi ) = 1 + exp( j xij ) 1 + exp( T xi ) where yi encodes the class of example i (p ositive = +1, negative = -1) and xij is the value of feature j for example i (e.g. a within document term weight). We assume that j runs from 0 to d, the numb er of features, and that xi0 = 1.0 for all i, i.e. the model has an intercept term. A logistic regression training algorithm chooses a vector of model parameters  that optimizes some appropriate criterion function on a set of training examples for which yi values are known. In the Bayesian MAP (maximum a p osteriori) approach to logistic regression [10, 20], the criterion function is the sum of the log likelihood of the data and the log of the prior distribution of the regression coefficients, l ( ) = (- in
=0

ln(1 + exp(- T xi yi )) + ln p( ),

where p( ) is the prior probability for parameter vector  , ^ and we output a value  (which may or may not b e unique) that maximizes l( ). The prior, p(), can b e any probability distribution over real-valued vectors. MAP estimation is neither necessarily sup erior nor inferior to other Bayesian approaches [29]. Logistic regression [8, 16, 20, 27, 34, 33] and the related probit regression [4] have b een widely used in text classification. Regularization to avoid overfitting has b een based on feature selection, early stopping of the fitting process, and/or a quadratic p enalty on the size of regression coefficients. The last of these, also called ridge logistic regression, can b e interpreted as MAP estimation where p( ) is a product of univariate Gaussians with mean 0 and a shared variance [20]. Recently, Genkin et al. [10] showed MAP estimation with univariate Laplace priors, i.e. a lasso [30] version of logistic regression, was effective for text categorization.

domain-informative text) may differ in characteristics such as length, nondomain vocabulary, and nontextual features from training documents, this approach is risky. Finally, some relevance feedback algorithms (e.g. Rocchio [2]) use a query to set initial values of some or all classifier parameters, which are then up dated by training data. This is a more flexible approach, but past algorithms have not dealt directly with negative predictors, strong predictors of uncertain p olarity, or predictors for which we have varying degrees of domain knowledge. Several pap ers have modified existing learning algorithms-- for naive Bayes [14, 17], logistic regression (fit with a b oostingstyle algorithm) [26], or SVMs [32]--to use domain knowledge in text categorization. All require users to convert knowledge ab out words into weighted training examples. Several heuristics are suggested for this weighting, but implicitly assume a substantial numb er of task documents (at least unlab eled ones) are available. A recent study [22] that upweights human-selected terms in SVM learning (by altering document vectors) is similar in spirit to our work, though in an active learning context. Closely related to using domain knowledge is mixing training data from different sources in sup ervised learning (domain adaptation or transfer). Cohen and Kudenko studied the learning of prop ositional classifiers for email filtering when the bulk of the training data was from a different source than the test data [6]. They found that forbidding the use of negative predictors improved effectiveness. Gabrilovich and Markovitch used a combination of feature generation, feature selection, and domain adaptation from a large web directory to improve classification of diverse documents [9]. Chelba and Acero [5] used out-of-task lab eled examples in logistic regression training of a text capitalizer, and used the resulting MAP estimate as the mode vector of a Bayesian prior for training with in-task examples. Our work has similarities to Chelba and Acero's, as well as to non-textual uses of Bayesian priors to incorp orate knowledge ([18], and citations therein).

4. USING DOMAIN KNOWLEDGE
Given the wide use in text classification of Gaussian priors, and the recent success of Laplace priors, we take these as our starting p oint. The univariate Gaussian and Laplace distributions each have two parameters, so a product of such distributions for d features and an intercept gives 2d + 2 hyp erparameters. The Gaussian parameters are the mean j 2 and the variance j . The Laplace parameters are the mean j , and the scale parameter j , corresp onding to a variance of 2/2 . For b oth distributions the mean j is also the mode, j and in this pap er we will refer to the mode and variance as the hyp erparameters for b oth the Gaussian and Laplace distributions. The mode sp ecifies the most likely value of j , while the variance sp ecifies how confident we are that j is near the mode. As for domain knowledge, our interest is in a range of p ossible clues as to which words are good predictors for a class. Focused lists of words generated sp ecifically for classification are of interest, but so are reference materials that provide noisier evidence. We refer to all these sources as "domain knowledge texts," and assume for simplicity there is exactly one domain knowledge text for each class (more can easily b e used). We call a set of such texts a "domain knowledge corpus." For a given class, we distinguish b etween two sets

3.

PRIOR WORK

Feature extraction is one use of domain knowledge (most famously in spam filtering [19]). Creating b etter features is good, but one would like to guide the learner to use them. Domain knowledge can also b e used to choose which features to use (feature selection). An old example is stopwords [25, 31], often deleted in content-based text classification, but sp ecifically included in authorship attribution. Another is relevance feedback, where words from the user query are usually required to app ear in the learned model [2, 25]. The downside of feature selection is that it cannot reduce the impact of a term without discarding it entirely. Relevance feedback may also treat textual queries as if they were p ositive examples for sup ervised learning (e.g. the "Ide Regular" algorithm [2]). Since a query (or other

494


of words. Knowledge words (KWs) are all those that occur in the domain knowledge text for the class of interest. Other words (OWs) are all words that occur in the training documents for a particular run, but are not KWs. Table 1 summarizes the methods discussed in this section.

variances for KWs a multiple of the variance,  2 , for OWs. This multiple is the same for all KWs in Method Var:
2 j = CDKRW   2 ,

4.1 Baselines
Text classification research using regularized logistic regression has usually set all prior modes to 0, and all prior variances to a common value (or has used the non-Bayesian equivalent). Some pap ers explore several values for the prior variance [16, 34], others use a single value but do not say how it was chosen [20, 33], and others choose the variance by cross-validation on the training set [10]. We used crossvalidation (Section 5.1.1) to choose a common prior variance for OWs. In our "No DK" baseline (Table 1), all words are OWs. Another simple baseline is to create X copies of the prior knowledge text for a class and add these copies to the training data as additional p ositive examples ("DK examples" in Table 1). We applied the same tokenization and term weighting (Section 5.3) to these artificial documents as to the usual training documents. We tested a range of values for X , but include results only for the b est value, X = 5.

but is prop ortional to our heuristic measure of term significance (Equation 1) in Method Var/TFIDF:
2 j = CDKRW × significance(tj , Q) ×  2 .

These methods use a prior mode of 0 for b oth OWs and KWs.

4.2.2 Mode-Setting Methods
Another view of a domain knowledge text is that it contains words which are likely to b e p ositive predictors of class memb ership, i.e. that KWs will tend to have parameter values greater than 0 in good logistic regression models for the class. Along these lines, Methods Mode and Mode/TFIDF make the prior mode for a KW greater than 0, in contrast to the mode of 0 used for OWs. Method Mode gives the prior for every KW the same mode: j = CDKRW , while Method Mode/TFIDF makes the prior modes prop ortional to term significance: j = CDKRW × significance(tj , Q). Both methods use cross-validation to choose a common variance for b oth OWs and KWs. While mode-setting may seem more natural than variancesetting, it carries more risks. If a term does not occur in the training data, then the MAP estimate for the corresp onding parameter is identically the prior mode. With nonzero prior modes and a tiny training set, we risk hardwiring many untested parameter choices into the final classifier.

4.2 Priors From Domain Knowledge
Our four methods for using domain knowledge to sp ecify class-sp ecific hyp erparameters b egin by giving OWs a prior with j = 0 and a common variance  2 chosen by cross-validation. KWs are then given more ability to affect classification by assigning them a larger prior mode or variance than OWs. All four methods use a heuristic constant CDKRW , the "domain knowledge relative weight", to control how much more influence KWs have. This constant can b e set manually or, as in our exp eriments, chosen by cross-validation on the training set (Section 5.1.1). Two of our methods use the entire set of domain knowledge texts in determining how significant each word in the text for a particular class is. As a heuristic measure of significance, we use TFIDF weighting (Section 5.3) within the domain knowledge corpus: significance(t, Q) = logtf(t, d) × idf(t), where · d is the domain knowledge text for class Q, · logtf(t, d) = 0 if term t does not occur in text d, or 1 + loge (tf(t, d)) if it does, where tf(t, d) is the numb er of occurrences of t in d, · idf(t) = loge ((NK + 1)/(df(t) + 1)). Here NK is the total numb er of domain knowledge texts used to compute IDF weights, and df(t) is the numb er of those documents that contain term t. We now describ e the methods. (1)

5. EXPERIMENTAL METHODS
In this section, we describ e our exp erimental approach to studying the use of domain knowledge in logistic regression.

5.1 Software and Algorithms
As discussed in Section 3, our interest was in domain knowledge techniques that can b e used with existing sup ervised learning algorithms. Here we discuss the particular implementations used in our exp eriments.

5.1.1 Logistic Regression
We trained and applied all logistic regression models using Version 2.04 of the BBR (Bayesian Binary Regression) package [10]1 . BBR supp orts Gaussian and Laplace priors with user-sp ecified modes and variances. With methods No DK and DK Examples we used prior modes of 0 and chose a common prior variance,  2 , from this set of p ossibilities: 0.25, 1, 2.25, 4, 6.25, 9, 12.25, 16, 20.25, 25, 30.25, 36, 42.25, 49, 56.25, 64, 100, 10000, 1000000, 100000000. The BBR fitting algorithm chose the prior variance that maximized the cross-validated p osterior predictive log-likelihood for each training set. For methods where priors are class-sp ecific, we used crossvalidation external to BBR to choose a pair (CDKRW ,  2 )
1

4.2.1 Variance-Setting Methods
One view is that KWs will tend to have more weight (i.e. parameter values farther from 0) than OWs in good logistic regression models for the class, but could b e either p ositive or negative predictors. That suggests the prior on a KW should have a larger variance than the prior on an OW. Methods Var and Var/TFIDF (Table 1) make the prior

http://www.stat.rutgers.edu/madigan/BBR/

495


Method No DK (baseline) DK examples Var

Description of the metho d [KWs] - none Like No DK, but treat the domain knowledge text for the class as X positive examples [KWs] mode: 0 2 variance: j = CDKRW × 2 , (CDKRW , 2 ) pair chosen by cross-validation [KWs] mode: 0 2 variance: j = CDKRW × significance(tj , Q) × 2 for term tj and class Q, and (CDKRW , 2 ) pair chosen by cross-validation [KWs] mode: j = CDKRW variance: 2 , (CDKRW , 2 ) pair chosen by cross-validation [KWs] mode: j = CDKRW × significance(tj , Q) variance: 2 , (CDKRW , 2 ) pair chosen by cross-validation

Var/TFIDF

Mode

Mode/TFIDF

Table 1: Summary of methods tested for incorporating domain knowledge into learning. CDKRW specifies the relative weight given to domain knowledge. In all cases OWs (nondomain words) and the intercept use a prior with mode 0 and variance  2 .

from the cross product of a set of values for CDKRW and the ab ove set of values for  2 . For Methods Var and Var/TFIDF, the CDKRW values tried were 2, 5, 10, 20, 50, 100, and 10000. For Methods Mode and Mode/TFIDF, the CDKRW values were 0.5, 1, 2, 3, 4, 5, 10, 20, 50, 100, and 10000. As with BBR's cross-validation, our external cross-validation chosen the pair that maximized cross-validated p osterior predictive log-likelihood on the training set.

5.1.2 Support Vector Machines
As a baseline to ensure that logistic regression was producing reasonable classifiers without domain knowledge, we trained supp ort vector machine (SVM) classifiers on all training sets. SVMs are one of the most robust and effective approaches to text categorization [12, 13, 15, 28]. In our exp eriments, we used Version 5.0 of SVM Light software [12, 13] 2 . All options were kept at their default values. Keeping the -c option at its default meant that SVM Light used the default choice (C = 1.0 for our cosine normalized examples) of the regularization parameter C . We also generated results with the regularization parameter chosen by cross-validation, but these were inferior and are not included here.

5.2 D atasets
Our text classification exp eriments used three public text categorization datasets for which publicly available domain knowledge texts were available. We chose, as our binary classification tasks, categories with a moderate to large numb er of p ositive examples. This enabled exp erimentation with different training set sizes.

Sub ject Headings) terms. We p osed as our text classification tasks predicting the presence or absence of selected MeSH headings. Documents. We split the Bio Articles documents into three 8-month segments. We used the first segment for the training and the last segment for testing. The middle segment was reserved for future purp oses and was not used in the exp eriments rep orted here. Training sets of various sizes were drawn from the training p opulation of 3,742 articles (p eriod: 2002-01-01 to 2002-08-31), and classifiers were evaluated on the test set of 4,175 articles (p eriod: 2003-0501 to 2003-12-31). Categories. We wanted a set of categories that were closely related to each other (to test the ability of domain knowledge to supp ort fine distinctions) and somewhat frequent on the particular biomedical journal articles we had available. MeSH organizes its headings into multiple tree structures, and we choose the A11 subtree (MeSH descriptor: "Cells") to work with. This subtree contains 310 distinct headings, and we chose to work with the 32 that were assigned to 100 or more of our documents. Note that when deciding whether a MeSH heading was assigned to a document, we stripp ed all subheadings from the category lab el. Domain Knowledge. Each MeSH heading has a detailed entry provided as an aid to b oth NLM manual indexers and users of MEDLINE. Figure 1 shows a p ortion of one such entry. We used as our domain knowledge text for a category all words from the MeSH Heading, Scope Note, Entry Term, See Also, and Previous Indexing fields. Entries were taken from the 2005 MeSH keyword hierarchy [1], downloaded in Novemb er 2004.

5.2.1 Bio Articles
This collection of full text biomedical articles was used in the TREC 2004 genomics track categorization exp eriments [11].3 The genomics track itself featured only few (and somewhat atypical) categorization tasks. However, b ecause all the articles are indexed in the National Library of Medicine's MEDLINE system, they have corresp onding MEDLINE records with manually assigned MeSH (Medical
2 3

5.2.2 ModApte Top 10
Our second dataset was the ModApte subset of the Reuters21578 test collection of newswire articles [15].4 Documents. The ModApte subset contains 9603 and 3299 Reuters news articles in the training set and test set, resp ectively. Categories. Following Wu and Srihari [32] (see b elow)
4 http://www.daviddlewis.com/resources/testcollections/ reuters21578/

http://svmlight.joachims.org/ http://trec.nist.gov/data/t13 genomics.html

496


MeSH Heading Tree Number Tree Number Annotation Scope Note

Entry Term See Also ... Unique ID

Neurons A08.663 A11.671 do not use as a substitute or synonym for BRAIN / cytol The basic cellular units of nervous tissue. Each neuron consists of a body, an axon, and dendrites. Their purpose is to receive, conduct, and transmit impulses in the NERVOUS SYSTEM. Nerve Cells Neural Conduction ... D009474

Figure 1: A portion of MeSH entry for the MeSH heading "Neurons".

Afghanistan ... Geography Location: Southern Asia, north of Pakistan ... International disputes: periodic disputes with Iran over Helmand water rights; Iran supports clients in country, private Pakistani and Saudi sources also are active; power struggles among various groups for control of Kabul, ... Government Name of country: conventional long form: Islamic State of Afghanistan conventional short form: Afghanistan ... Capital: Kabul

Figure 3: A portion of CIA WFB (1996 edition) entry for the category "Afghanistan".

Class earn acq money-fx grain crude trade interest wheat ship corn

Prior Knowledge cents cts net profit quarter qtr revenue rev share shr acquire acquisition company merger stake bank currency dollar money agriculture corn crop grain wheat usda barrel crude oil opec petroleum deficit import surplus tariff trade bank money lend rate wheat port ship tanker vessel warship corn

Figure 2: Keywords used as prior knowledge for the ModApte Top 10 collection [32].

our large (23, 149 document) training set. There were 189 such matches, from which we chose the 27 with names b eginning with the letter A or B to work with, reserving the rest for future use. Domain Knowledge. The domain knowledge text for each Region category was the corresp onding entry in the CIA World Factb ook 1996.6 Figure 3 shows a p ortion of the entry for "Afghanistan". The HTML source code of the CIA WFB was downloaded in June 2004. The formatting of the entries did not make it easy to omit field names and b oilerplate text. We instead simply deleted (in addition to HTML tags) all terms that occurred in 10% or more of the entries.

we used the 10 "Topic" categories with the largest numb er of p ositive training examples. Domain Knowledge. In their exp eriments on incorp orating prior knowledge into SVMs, Wu and Srihari [32] manually sp ecified short lists of high value terms for the top 10 Topic categories. We used those lists (Figure 2) as our domain knowledge texts. Note that due to the small numb er of these texts and their highly focused nature, IDF weights within the domain knowledge corpus had almost no impact, so methods Var/TFIDF and Mode/TFIDF b ehaved almost identically to methods Var and Mode, resp ectively.

5.3 Text Representation
Text from each training and test document was converted to a sparse numeric vector in SVM Light format (also used by BBR). The Bio Articles documents were in XML format. We concatenated the contents of the title (<atl>), sub ject (<docsubj>), and abstract (<abs>) elements and deleted all internal XML tags. For ModApte, we used the concatenation of text from the title (<TITLE>) and b ody (<BODY>) SGML elements of each article. For the RCV1 A-B Regions collection, we concatenated the contents of the headline (<HEADLINE>) and text (<TEXT>) XML element of each article. For all datasets, text processing used the Lemur7 utility ParseToFile. This p erformed case-folding, replaced punctuation with whitespace, and tokenized text at whitespace b oundaries. The Lemur index files were then converted to document vectors in SVM Light format. In processing text for the Bio Articles and the ModApte datasets, the Porter stemmer [21] supplied by Lemur and the SMART [23] stoplist were used in conjunction with the Lemur utility ParseToFile.8 For the RCV1-v2 dataset we used a convenient pre-existing set of document vectors we had prepared using Lemur without stemming or stopping. Domain knowledge text corp ora were processed in the same fashion as the corresp onding task documents. http://www.umsl.edu/services/govdocs/wofact96/ http://www-2.cs.cmu.edu/lemur 8 ftp://ftp.cs.cornell.edu/pub/smart/english.stop http://jmlr.csail.mit.edu/pap ers/volume5/lewis04a/ lyrl2004 rcv1v2 README.htm
7 6

5.2.3 RCV1 A-B Regions
The third dataset was based on RCV1-v2, a test collection of 804, 414 newswire articles [15]. 5 Documents. For efficiency reasons, we did not use the full set of 804, 414 documents. Our test set was the 120, 076 documents dated 20-Decemb er-1996 to 19-February-1997. For a large training set, we used the LYRL2004 ([15]) training set of 23,149 documents from 20-August-1996 to 31August-1996. Small training sets were drawn from a training p opulation of 264, 569 documents (20-August-1996 to 19Decemb er-1996). The remaining documents was set aside for future use. Categories. We selected a subset of the Reuters Region categories whose names exactly matched the names of geographical regions with entries in the CIA World Factb ook (see b elow) and which had one or more p ositive examples in http://www.ai.mit.edu/pro jects/jmlr/pap ers/volume5/ lewis04a/lyrl2004 rcv1v2 README.htm
5

or

497


Dataset Bio Articles ModApte Top 10 RCV1 A-B Regions

Type of DK MeSH scope notes manually selected words CIA WFB entries

DK Texts 22,995 10 189

Table 2: Type of domain knowledge texts and size of domain knowledge corpus for each of categorization datasets.

Method DK Examples (using domain knowledge texts as artificial p ositive examples) had little impact on any learning algorithm with these large training sets. The four methods using prior probability distributions had little impact on lasso logistic regression, but gave a substantial b enefit to ridge logistic regression on the two datasets with the lowest frequency categories.

6.2 Small Training Sets
Within document weights were computed using cosinenormalized TFIDF weighting [24]. The initial weight of term tj in document di , wij was ( N 1 + loge (fij ))loge nj+1 , if tj is present in di , +1 wij = 0, otherwise. Here N is the numb er of documents in the training p opulation, fij is the frequency of term tj in document di , and nj is the numb er of training p opulation documents containing term tj . Note, as in Section 4.2, that we use the Lookahead IDF variant of IDF weighting [7]. Cosine normalization was then applied to the TFIDF values. The small training sets available in practical text classification situations are produced in a variety of unsystematic ways, making it hard to define a "realistic" small training set. We present results on three definitions that exhibit the range of prop erties we have seen using other definitions.

6.2.1 500 Random Examples
In this exp eriment we selected random training sets of 500 examples from the training p opulation. The resulting training sets had 2 to 139 p ositive examples for categories in the Bio Articles collection, 9 to 184 p ositive examples for categories in the ModApte Top 10 collection, and 0 to 22 p ositive examples for categories in the RCV1 A-B Regions collection. Table 4 provides the results. Effectiveness is lower than with large training sets, and the effect of the differing class frequencies is obvious. Lasso logistic regression is notably more effective on the small training sets than SVMs and ridge logistic regression. Method DK Examples gave improvements on two of three collections, but hurt the third. The Bayesian prior based methods, in contrast, always improved logistic regression results. For ridge logistic regression, the improvement was up to 1500%.

5.4 Evaluation and Thresholding
We evaluated classification effectiveness using the F1 measure (harmonic mean of recall and precision) [15, 31], with macroaveraging (average of p er-category F1 values) across categories. Both BBR and SVM Light produce linear classifiers with thresholds intended to minimize error rate, so, after training, we tuned the thresholds to maximize observed F1 on the training data, while leaving other classifier parameters unchanged.

6.2.2 5 Positive and 5 Random Examples

6.

RESULTS

Our primary hyp othesis was that using domain knowledge texts would greatly improve classifier effectiveness when few training examples are available, and not hurt effectiveness with large training sets. We also b elieved, given the diverse and non-document-like forms of the domain knowledge texts, that using them to sp ecify prior distributions in a Bayesian framework was not only more natural, but more effective, than pretending they were additional training examples. Table 2 summarizes the typ es of domain knowledge used, and the numb er of domain knowledge texts used to compute significance values for the Var/TFIDF and Mode/TFIDF methods. The numb er of categories used in the exp eriments was 32, 10 and 27 for the Bio Articles, ModApte and RCV1 collections, resp ectively.

Op erational text classification tasks often originate with a handful of known p ositive examples. We simulated this by randomly selecting 5 p ositive examples of each class from the training p opulation, and adding 5 additional examples randomly selected from the remainder without knowledge of class lab els (Table 5). Since 5 p ositive examples is more than occurs in random samples of 500 examples for some classes, effectiveness is sometimes b etter and sometimes worse than in Table 4. Method DK Examples has a large impact with these tiny training sets, but the impact is sometimes good and sometimes bad. The prior based methods uniformly improve ridge regression (up to 130%) and usually improve lasso regression, though the risky Mode method hurts lasso substantially in two of the conditions.

6.2.3 5 Positive and 5 Closest Negative Examples
In a variation on the previous approach, we instead combined each of 5 random p ositive examples for each class with its nearest (based on highest dot product) negative neighb or. The theory was that someone attempting to quickly build a small training set might end up with p ositive and "near miss" examples. It is hard to know if this is true but, surprisingly, effectiveness (Table 6) was lower than when p ositives were supplemented with random examples (Table 5). In any case, we again see DK Examples having a large but unstable effect. The prior-based methods uniformly, sometimes greatly, improve ridge (up to 127%) and give small decrements (maximum 3.6%) to large improvements (maximum 79.7%) for lasso.

6.1 Large Training Sets
This exp eriment trained classifiers on each collection's large training set. Table 3 presents macroaveraged F1 results for the three test collections. As found elsewhere [10], SVMs and lasso logistic regression show similar effectiveness, and b oth dominate ridge logistic regression. We note that our macroaveraged F1 value for SVMs on ModApte Top 10 (86.55) is similar to that found by Wu & Srihari (approximately 83.5 on a non-random sample of 1,024 training examples, from the graph in Figure 3 [32]) and Joachims (82.5 with all 9,603 training examples, computed from his Figure 2 [12]).

498


Method No DK DK examples Var Var/TFIDF Mode Mode/TFIDF

Bio SVM 49.15 50.55

Articles lasso ridge 54.2 26.3 54.4 26.8 54.8 47.2 55.2 52.2 53.2 35.3 53.3 41.9

Mo dApte Top 10 SVM lasso ridge 86.55 84.1 82.9 86.55 84.3 82.1 84.8 82.8 84.6 83.8 84.2 82.7 83.6 83.1

RCV1 SVM 71.08 71.09

A-B Regions lasso ridge 62.9 42.2 64.2 42.3 66.4 58.6 70.8 68.9 59.2 47.1 64.5 62.9

Table 3: Macroaveraged F1 results for SVMs, lasso, and ridge logistic regression on three text categorization test collections using large training sets.
Method No DK DK examples Var Var/TFIDF Mode Mode/TFIDF Bio SVM 9.06 16.77 Articles lasso ridge 35.1 2.6 38.3 3.3 44.5 34.4 49.2 40.9 35.9 12.9 42.5 37.6 Mo dApte Top 10 SVM lasso ridge 69.24 72.5 37.6 72.34 72.5 42.7 74.8 73.1 74.8 71.0 76.3 69.6 76.6 73.4 RCV1 SVM 8.45 7.96 A-B Regions lasso ridge 23.1 3.3 21.2 2.7 32.9 23.0 40.8 33.0 23.8 7.6 31.6 32.2

Table 4: Macroaveraged F1 results for SVMs, lasso, and ridge logistic regression on three text categorization test collections using 500 random examples in training sets.
Method No DK DK examples Var Var/TFIDF Mode Mode/TFIDF Bio SVM 21.51 17.78 Articles lasso ridge 29.6 18.8 41.0 11.9 36.3 34.2 34.3 35.7 23.7 24.0 36.4 33.9 Mo dApte Top 10 SVM lasso ridge 36.53 42.7 27.1 34.52 61.2 22.3 61.7 62.2 61.3 61.5 57.1 62.2 58.5 62.1 RCV1 SVM 28.90 39.29 A-B Regions lasso ridge 52.1 23.0 47.2 38.7 47.4 37.1 50.7 53.0 34.7 27.2 51.5 48.8

Table 5: Macroaveraged F1 results for SVMs, lasso, and ridge logistic regression on three text categorization test collections using 5 positive and 5 random examples in training sets.
Method No DK DK examples Var Var/TFIDF Mode Mode/TFIDF Bio SVM 19.87 22.34 Articles lasso ridge 21.4 18.8 37.0 10.6 30.5 31.9 32.9 34.6 26.7 24.5 36.4 34.2 Mo dApte Top 10 SVM lasso ridge 33.41 34.4 33.0 32.99 55.9 23.2 34.0 60.4 47.3 58.9 61.8 58.7 61.4 58.5 RCV1 SVM 21.84 24.45 A-B Regions lasso ridge 30.6 23.0 25.8 35.5 37.4 37.3 34.1 47.7 29.5 27.8 53.0 52.2

Table 6: Macroaveraged F1 results for SVMs, lasso, and ridge logistic regression on three text categorization test collections using 5 positive and their 5 closest negative examples in training sets.

6.3 Analysis
Domain knowledge, in any form, generally had little effect with large training sets. The exception was ridge logistic regression, which was substantially improved on the two collections where some categories had few p ositives. Ridge regression p erformed surprisingly p oorly, given its p opularity. A caveat is that many ModApte and RCV1 Regions categories have a dominant single predictor, a situation that favors lasso. Treating domain texts as artificial training examples had an erratic impact, sometimes improving and sometimes substantially harming effectiveness. Converting domain texts to priors, on the other hand, almost always improved effectiveness (37 of 48 exp erimental conditions for lasso, and 48 of 48 for ridge from its p oor baseline). As exp ected, mode-setting was risky, with method Mode proving either the b est or, more commonly, the worst of the four prior setting meth-

ods 21 of 24 times. Where we had nontrivial domain corpus TFIDF weights (Bio Articles and RCV1 A-B Regions), they proved surprisingly useful. Var/TFIDF b eat Var in 14 of 16 such conditions, and Mode/TFIDF b eat Mode in 16 of 16. Other source of term quality information, such as stoplists or task-document IDFs, would likely prove useful as well. Under a view that domain knowledge should do no harm we recommend either Var/TFIDF, which reduced effectiveness vs. No DK in only 1 of 24 conditions (by 2.7%), or Mode/TFIDF, which reduced effectiveness in only 3 of 24 conditions (by a maximum of 1.7%). Both usually provided large improvements.

7. SUMMARY AND FUTURE WORK
We have presented an initial, but highly effective, strategy for combining domain knowledge with sup ervised learning for text classification using Bayesian logistic regression.

499


On three data sets, with three diverse sources of domain knowledge, we found large improvements in effectiveness, particularly when only small training sets are available. We are continuing this work in many directions, including exploring the impact of variability in the choice of b oth small training sets and domain knowledge texts. Beyond that, our research program is to recast many IR heuristics (stopword lists, stemming, term weighting, etc.) as appropriate priors, with the goal of using simple binary text representations, along with priors for which a somewhat sophisticated user could have meaningful numeric intuitions. Published statements such as "eating food X increases your chance of heart disease by Y%" are often based on parameters of fitted logistic regression models. It does not seem imp ossible to have intuitions of this form ab out words in text classification.

Acknowledgements
The work was supp orted by funds provided by the KD-D group for a pro ject at DIMACS on Monitoring Message Streams, funded through National Science Foundation grant EIA-0087022 to Rutgers University. The views expressed in this article are those of the authors, and do not necessarily represent the views of the sp onsoring agency.

8.

REFERENCES

[1] Medical Sub ject Headings ­ Home Page, 2005. http://www.nlm.nih.gov/mesh. [2] C. Buckley, G. Salton, and J. Allan. The effect of adding relevance information in a relevance feedback environment. In SIGIR '94. [3] B. Carlin and T. Louis. Bayes and Empirical Bayes Methods for Data Analysis. Chapman & Hall, London, 1996. [4] K. Chai, H. Chieu, and H. Ng. Bayesian online classifiers for text classification and filtering. In SIGIR '02, pages 97­104, 2002. [5] C. Chelba and A. Acero. Adaptation of maximum entropy capitalizer: Little data can help a lot. In EMNLP '04, 2004. [6] W. Cohen and D. Kudenko. Transferring and retraining learned information filters. In AAAI/IAAI '97, pages 583­590, 1997. [7] A. Dayanik, D. Fradkin, A. Genkin, P. Kantor, D. Lewis, D. Madigan, and V. Menkov. DIMACS at the TREC 2004 genomics track. In TREC '04, 2005. [8] N. Fuhr and U. Pfeifer. Combining model-oriented and description-oriented approaches for probabilistic indexing. In SIGIR '91, pages 45­56, 1991. [9] E. Gabrilovich and S. Markovitch. Feature generation for text categorization using world knowledge. In IJCAI '05, pages 1048­1053, 2005. [10] A. Genkin, D. Lewis, and D. Madigan. Large-scale Bayesian logistic regression for text categorization. Technometrics, 2006. To app ear. [11] W. Hersh, R. Bhuptira ju, L. Ross, A. Cohen, D. Kraemer, and P. Johnson. TREC 2004 genomics track overview. In TREC '04, 2004. [12] T. Joachims. Text categorization with supp ort vector machines: Learning with many relevant features. In ECML '98, pages 137­142, 1998.

[13] T. Joachims. Learning to Classify Text Using Support Vector Machines. Kluwer, 2002. [14] R. Jones, A. McCallum, K. Nigam, and E. Riloff. Bootstrapping for text learning tasks. In IJCAI '99 Workshop on Text Mining, 1999. [15] D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new b enchmark collection for text categorization research. JMLR, 5:361­397, April 2004. [16] F. Li and Y. Yang. A loss function analysis for classification methods in text categorization. In ICML '03, pages 472­479, 2003. [17] B. Liu, X. Li, W. Lee, and P. Yu. Text classification by lab eling words. In AAAI '04, 2004. [18] D. Madigan, J. Gavrin, and A. Raftery. Eliciting prior information to enhance the predictive p erformance of bayesian graphical models. Communications in Statistics - Theory and Methods, pages 2271­2292, 1995. [19] T. Meyer and B. Whateley. SpamBayes: Effective op en-source, Bayesian based, email classification system. In CEAS '04, 2004. [20] K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. In IJCAI'99 Workshop on Information Filtering, 1999. [21] M. Porter. An algorithm for suffix stripping. Program, 14(3):130­137, July 1980. [22] H. Raghavan, O. Madani, and R. Jones. Interactive feature selection. In IJCAI '05, pages 841­846, 2005. [23] G. Salton, editor. The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, 1971. [24] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. IPM, 24(5):513­523, 1988. [25] G. Salton and M. McGill. Introduction to modern information retrieval. McGraw-Hill, 1983. [26] R. Schapire, M. Rochery, M. Rahim, and N. Gupta. Incorp orating prior knowledge into b oosting. In ICML '02, 2002. [27] H. Schutze, D. Hull, and J. Pedersen. A comparison of classifiers and document representations for the routing problem. In SIGIR '95, pages 229­237, 1995. [28] F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1­47, 2002. [29] R. Smith. Bayesian and frequentist approaches to parametric predictive inference (with discussion). In Bayesian Statistics 6. Oxford Univ. Press, 1999. [30] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal Statistical Soc. B., 58:267­288, 1996. [31] C. Van Rijsb ergen. Information Retrieval. Butterworth-Heinemann, London, 2nd edition, 1979. [32] X. Wu and R. Srihari. Incorp orating prior knowledge with weighted margin supp ort vector machines. In KDD '04, pages 326 ­ 333, 2004. [33] J. Zhang and Y. Yang. Robustness of regularized linear classification methods in text categorization. In SIGIR'03, pages 190­197, 2003. [34] T. Zhang and F. Oles. Text categorization based on regularized linear classification methods. Information Retrieval, 4:5­31, 2001.

500