SIGIR 2007 Proceedings Poster Semantic Text Classification of Disease Reporting Yi Zhang and Bing Liu Department of Computer Science, University of Illinois at Chicago 851 South Morgan Street Chicago, IL 60607-0753 {yzhang3, liub}@cs.uic.edu ABSTRACT Traditional text classification studied in the IR literature is mainly based on topics. That is, each class or category represents a particular topic, e.g., sports, politics or sciences. However, many real-world text classification problems require more refined classification based on some semantic aspects. For example, in a set of documents about a particular disease, some documents may report the outbreak of the disease, some may describe how to cure the disease, some may discuss how to prevent the disease, and yet some others may include all the above information. To classify text at this semantic level, the traditional "bag of words" model is no longer sufficient. In this paper, we report a text classification study at the semantic level and show that sentence semantic and structure features are very useful for such kind of classification. Our experimental results based on a disease outbreak dataset demonstrated the effectiveness of the proposed approach. procedure had been very successful in treating 120 cholera patients around the country". Both sentences are on the topic of cholera. However, semantically they are quite different. The problem is how to separate sentences based on the required semantic aspects or categories, i.e., reporting a possible outbreak or not reporting a possible outbreak in this case. We note that sentences rather than documents are used here because a document contains a large number of sentences and each sentence has a quite different semantic meaning. Classifying at the document level is less meaningful. Classifying at sentence or passage level is more appropriate. In this paper, we focus on the sentence level. We should also note that a sentence can belong to many semantic categories. For example, the second sentence above can belong to such categories as "hospital research", "cholera treatment", and "success stories of the district hospital". To our knowledge, limited research has been done on semantic text classification, and yet it is very important for practical applications. This paper shows that both words used in sentences and sentence semantic characteristics are important. Our experimental results confirm that their combination produces more accurate classifiers than each of them alone. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval ­ search process; H.4.m [Information Systems Applications]: Miscellaneous General Terms Algorithms, Experimentation. 2. THE PROPOSED TECHNIQUE As indicated above, we use both words and semantic features for model building. Since words are used in the same way as in traditional classification, we will not discuss it further. Below, we only focus on the semantic features used in our task. Five categories of semantic features are extracted from the dependency tree of a sentence: center noun, center verb, adjective, modifiers of the center noun, and modifiers of the center verb. The dependency tree of a sentence is generated using MINIPAR [2]. In order to recognize infectious disease names, we complemented the standard MINIPAR data with infectious disease names. Figure 1 shows an example dependency tree (dependency trees have been used for paraphrasing in information extraction area [5]). This dependency tree is generated from sentence "Belgium has reported three cases of mad cow disease". Keywords Semantics, Text Classification. 1. INTRODUCTION In traditional topic-based text classification, the "bag of words" representation of text documents is often sufficient because a topic can usually be characterized by a set of topic-specific keywords [6]. However, for semantic text classification, single word or even n-gram representations are no longer sufficient. The system needs to capture some semantic characteristics of the text from different classes in order to perform more accurate classification. In this paper, we propose to combine the "bag of words" scheme and semantic features for semantic text classification. As a case study, we investigate the disease domain. That is, we want to classify sentences which report disease outbreaks and sentences that do not. For example, the following sentence reports a possible disease outbreak "10 people were diagnosed with cholera in the district hospital early today". The following sentence does not report an outbreak, "the district hospital reported today that their new cholera treatment Copyright is held by the author/owner(s). SIGIR'07, July 23-27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. Figure 1. An example dependency tree. 747 SIGIR 2007 Proceedings Poster In a dependency tree, arrow points from the parent node toward the node that the parent node governs. Note that the disease name "mad cow disease" is a named entity, so it is represented by a single node in spite of that it has three literal words. The center noun is the center noun word of the noun phrase containing an infectious disease name; the center verb is the verb governing the center noun. They form the basic skeleton of a sentence or phrase, and play the most important role in the sentence's semantic meaning. For example, in sentence "Belgium has reported three cases of mad cow disease", the noun phrase containing the infectious disease (mad cow disease) is "three cases of mad cow disease", so the center noun is "cases", and the center verb is "reported". The center noun word or the center verb word is not used directly as features; instead, we have manually compiled noun clusters and verb clusters. Each cluster contains a group of similar words. For example, a noun cluster contains "case", "report", and "instance". Each cluster is non-overlapping with other clusters. The cluster that a center noun or a center verb belongs to is actually used as its feature value. The adjective word below the center verb and above the center noun in a dependency tree is used as the third feature. For example, in "Thirteen employees are seriously ill with diarrhea", the center verb is "are" and the adjective word is "ill". Similarly, the adjective words are consolidated using precompiled adjective clusters. The adjective word is used together with the center verb to help refine the verb's meaning, especially when the center verb such as different forms of the verb "be" is too general to contain a specific meaning. Finally, the following modifiers to the center noun and the center verb are used: negative modifiers to the center noun such as "no" and "zero", negative modifiers to the center verb such as "not" and "never", and modifiers to the center verb showing a subjunctive mood such as "could" and "would". The subjunctive mood is commonly used in describing something not happened, but expected to happen, and negative modifiers are widely used to negate. Hence, these modifiers are often critical for determining whether a sentence is actually describing a fact, e.g., a disease outbreak, or it is just giving a hypothesis or negating a fact. If any feature described above is present in the dependency tree, it is included as a semantic feature. Note that a center noun is always present, and if an infectious disease name is a center noun itself, then a special value is used for the feature. There are some cases that one sentence can generate multiple instances of semantic features, which is because one sentence may mention multiple infectious disease names. runs of each algorithm with 90% as the training data and 10% as the testing data in each run. Sentence features (denoted by sentences in the table) are standard word terms, i.e., they are stemmed and with stopwords removed. We observe that sentence semantic features (denoted by S-features in the table) are very useful. Both SVM (with linear kernel) and NB (naïve Bayes) produce better results when sentences and semantic features are both used. NB using both sentences and S-features produces the best F1-score, 70.2%. Using sentences or S-features alone produce much lower F1-scores. Using S-feature alone, NB outperforms SVM. All tests reported in the table were done using Rainbow [3] and SVM-light [1], based on 1-gram (the traditional "bag of words" model) for sentences. We also experimented with 2, 3 and 4-grams with poorer results. Note that the combination of sentence features and S-features is done by simply appending Sfeatures from each sentence to the sentence (with equal weights). Table 1. Experimental results Vocabulary size Precision Recall F1-score SVM (S-features) SVM (sentences) SVM (sentences + S-features) NB (S-features) NB (sentences) NB (sentences + S-features) 667 6098 6765 667 6098 6765 0.588 0.567 0.650 0.644 0.656 0.704 0.437 0.615 0.669 0.497 0.669 0.702 0.500 0.590 0.659 0.557 0.662 0.702 4. CONCLUSION In this paper, we studied the semantic classification of disease outbreak sentences, which shed some light on the general issue of semantic text classification. Our preliminary results show that sentence semantic and structure features are useful in improving classification accuracy. In our future work, we plan to improve the accuracy further and also study the general problem. 5. ACKNOWLEDGEMENTS This project is funded by Great Lakes Protection Fund. We thank Karl Rockne for useful discussions. 6. REFERENCES [1] Joachims, T. Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning. Ed. Schölkopf, B., Burges, C. and Smola, A. MIT-Press, 1999. [2] Lin, D. and Pantel, P. Discovery of Inference Rules for Question Answering. Nat. Lang. Eng., vol.7-4, 2001. [3] McCallum, A. K. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. , 1996. [4] ProMED-mail. , 2007. [5] Shinyama, Y., and Sekine, S. Paraphrase Acquisition for Information Extraction. Proceedings of the Second International Workshop on Paraphrasing, 2003. [6] Yang, Y. and Liu, X. A re-examination of text categorization methods. SIGIR-1999. 3. EXPERIMENTS We now report the experimental results. Our dataset consists of sentences related to a set of infectious diseases. Some report outbreaks and some do not, but still contain various disease names of interest. The sentences are extracted from disease reporting documents obtained from ProMED-mail [4]. We manually labeled the sentences. Our experimental dataset consists of 604 emerging disease reporting (EDR) sentences and 1533 non-emerging disease reporting (Non-EDR) sentences. Table 1 gives the average results of precision, recall and F1-score of various techniques. The averages are obtained from 6 random 748