SIGIR 2007 Proceedings

Demonstration

Text Categorization for Streams
D. L. Thomas
School of Informatics, University of Wales, Bangor dthomas@informatics.bangor.ac.uk

W. J. Teahan
School of Informatics, University of Wales, Bangor wjt@informatics.bangor.ac.uk
identification, topic categorization and spam filtering. PPM uses a variable order Markov language modeling approach based on estimating probabilities of upcoming symbols on a finite prior context. An alternative to the probabilistic approach is to use simple count-based measures such as R-Measure [ 5 ] which is the normalized sum of the length of common substrings between the training and test streams, and C-Measure [ 6 ], a sum of the number of common substrings of a fixed length. Khemelev and Teahan [ 5 ] found that for the authorship categorization problem on the Reuters Collection (RCV1) streambased approaches (R-Measure and the PPM compression scheme) significantly outperformed the well-performed feature-based approach of SVM (89% accuracy versus 85%). However, results for the Reuters-21578 collection indicate that for topic categorization, feature-based approaches are superior. A streambased approach has the advantage that it does not need to perform any pre-processing of the data (such as word segmentation). The primary motivation behind jSCAT is to provide a workbench for rapid prototyping and evaluation of stream-based text categorization models and applications. The workbench implements two versions of the PPM algorithm, PPMC and PPMD, and various parameters can be set when training the models such as the order of the model, and the size of the alphabet. R-Measure and C-Measure have also been implemented; the fixed window size of the latter is another parameter that can be altered. A novelty of the system is that the implementation of all three models ­ PPM, R-Measure and C-Measure ­ which relies on a single common suffix tree data structure that is accessed by all three models. The workbench also allows this data structure to be visualized which provides a novel means to examine interactively the similarities between the training and test streams.

ABSTRACT

We describe a novel system for evaluating and performing streambased text categorization. Stream-based text categorization considers the text being categorized as a stream of symbols, which differs from the traditional feature-based approach which relies on extracting features from the text. The system implements characterbased languages models­specifically models based on the PPM text compression scheme­as well as count-based measures such as RMeasure and C-Measure. Use of the system demonstrates that all of these techniques outperform SVM, a feature-based classifier, at stream-related classification tasks such as authorship ascription.

I.2.7 [Artificial Intelligence]: Natural Language Processing ­ text analysis.

Categories and Subject Descriptors General Terms

Algorithms, Design, Experimenta ion. t

Keywords
Text categorization, PPM language models, R-Measure, C-Measure We are in the process of designing a Text Understanding Interface (TUI) in Java to be used for implementing and evaluating various natural language processing, text mining and text categorization applications. TUI will have several main components: · · · · jTMT ­ the text mining component based on the Text Mining Toolkit (TMT) written in C [ 1 ]; jSCat ­ the stream-based text categorization component described here; jGE ­ a Java implementation [ 2 ] of the Grammatical Evolution evolutionary algorithm (O'Neill 2001). jKA ­ a component for performing information retrieval in a peer to peer network based on knowledgeable agents and the QITEKAT Question Answering system [ 3 ].

1. INTRODUCTION

3. REFERENCES

[1] Cleary, J. G. and Teahan, W. J. An Open Interface for Probabilistic Models of Text, Data Compression Conference, Proceedings, Pages 522 ­ 566, 1999. [2] Georgiou, L. and Teahan, W. J. (2006) jGE - A Java implementation of Grammatical Evolution. 10th WSEAS International Conference on Systems, Athens, Greece, 2006. [3] Clifton, T. and Teahan, W. J. 2005 "Knowing Aboutness: Question Answering using a logic-based framework". In Proceedings ECIR 2005, Spain. [4] J. G. Cleary and I. H. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, 32(4):396-402, 1984. [5] Khemelev, D. and Teahan, W. J. A repetition based measure for verification of text collections and for text categorization, SIGIR '2003, July 28-August 1, 2003, Toronto, Canada. [6] Hunnisett, D. S. and Teahan, W. J. Context-based methods for text categorisation, SIGIR '04,pages 578-579.

Several of the components have already been completed with excellent results and the purpose of this paper, and of the demo, is to describe and showcase the stream-based text categorization component, jSCat.

2. jSCat ­ Text Stream Categorization System

Stream-based text categorization considers the text being categorized as a stream of symbols, which differs from the traditional feature-based approach which relies on extracting features from the text. Stream-based techniques have recently received attention in the literature, especially PPM [ 4 ] for many text categorization tasks such as authorship ascription, language
Copyr ight is held by the author/owner(s). SI GI R' 07, July 23­27, 2007, Ámsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007.

907