The Effect of OCR Errors on Stylistic Text Classification Sterling Stuar t Stein Linguistic Cognition Lab Computer Science Dept. Illinois Institute of Technology 3300 South Federal Street Chicago, IL 60616-3793 Shlomo Argamon Linguistic Cognition Lab Computer Science Dept. Illinois Institute of Technology 3300 South Federal Street Chicago, IL 60616-3793 Ophir Frieder Information Retrieval Lab Computer Science Dept. Illinois Institute of Technology 3300 South Federal Street Chicago, IL 60616-3793 stein @ ir.iit.edu ABSTRACT argamon @ iit.edu ophir @ ir.iit.edu Recently, interest is growing in non-topical text classification tasks such as genre classification, sentiment analysis, and authorship profiling. We study to what extent OCR errors affect stylistic text classification from scanned documents. We find that even a relatively high level of errors in the OCRed documents does not substantially affect stylistic classification accuracy. 1 1 (a) Section of original document image in.formation~ '~.m_~@_~ be material and desired by the consuming in.formation~ '~.m_~@_~ be material and desired by the consuming public'". Later, the Commission could only refer for support pu l c'". ater, the letters, w could nl re sam f pe ple and to ba ibatch Lo~.opinion Commissionritten, oby ythe fer e or osupport to a batc o~. same un ounded v eritten by had s bee rejec ed repeating hthe. opinion fletters,iwwpoint ,that the ~ ame npeople tandby repeatin th th s Com ission. i seew oi t tha Congress gand e. e ame munfounded tvilf pin n1965. t had~ been rejected by Congress and the Commission. itself in 1965. b t O al and trac ed t the information "may be (ma) eriCR-exdesirtedby ext consuming informat on "ma t be ommissio co de ired by th f co support public". iLater, y he Cmaterial nanduld sonly refer e or nsuming pu l c". L te op the letters, rould nl re sam f pe ple and to ba ibatch aof r, inionCommissionwcitten oby ythe fer e or osupport to a batc of opin unf unded vi writte by he been reject d b repeating hthe same ion oletters,ewpoint nthat thad same people eand y repeatin th t sa Co u ission i selw oi t tha Congress gand e he me mmnfounded tviefpin n1965. t had been rejected by Congress and the Commission itself in 1965. Categories and Subject Descriptors H.3.1 [Content Analysis and Indexing]: Linguistic processing; H.3.3 [Information Search and Retrieval]: Retrieval models; I.7.5 [Do cument Capture]: Optical Character Recognition General Terms: Experimentation Keywords: OCR, OCR errors, text classification (c) Hand-corrected text 1. INTRODUCTION Recently, interest has grown in non-topical text classification tasks such as genre classification, sentiment analysis, and authorship profiling. Research on these problems, like work on `classical' topic-based text analysis, has focused mainly on electronically produced digital documents. Realworld applications of automated stylistics in litigation, national security, and humanities scholarship require analysis of real paper documents which have been scanned and digitized via OCR. However, even the best OCR is not perfect, and introduces many transcription errors. We present results of the first study we know of to evaluate the performance of style-based text classification on a corpus of OCR-processed texts (OCR ), comparing classification accuracy to hand-corrected (Correct ) versions of the same texts. We study text classification by genre (research reports, memos, etc) in tobacco industry documents. The parallel question has been previously investigated for the case of topic-based information retrieval; Taghva and Coombs [1] found that a search engine could be made to work well over OCR documents by accounting for the types of errors that it introduces. They ran misspelled words through an OCRspecific spell-checker and indexed the returned words based on a function of their probabilities. Our evaluation is part of a larger pro ject to develop a text collection [3] and integrated prototype for complex docuCopyright is held by the author/owner(s). SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. Figure 1: A do cument as an original image, after OCR, and after b eing corrected. ment information processing (CDIP), dealing with scanned documents that contain non-textual items as well as printed text. Our initial results show that OCR errors, though many, have little to no effect on the classification of the type of text. 2. CORPUS The corpus used in this study is composed of documents from the Legacy Tobacco Documents Library (http://legacy.library.ucsf.edu/) which we are using to build our IIT CDIP testbed. Each scanned document was run through OCR; there are 646 documents whose OCRed text was hand-corrected. Each document has a variety of metadata, including the type of the document, such as "Memo" or "Scientific Report"; it is these categories that we will attempt to predict in stylistic classification. The form of these documents can be seen in Figure 1. In the raw data, there are many different such text-type labels. There is inconsistency in the labeling of each category, such as "Other Report" and "Report, Other". We combined these labels manually into 9 main text-types; documents of types that occurred fewer than 10 times in our corpus were removed. This left 326 total documents. The summary of the corpus can be seen in Figure 2. To measure the distance between OCR and Correct, we used Levenshtein distance [2] 701 Text typ e Advertisement Table, etc. Correspondence Published Doc Press Release Science Other Report Report Memo Total # Do cs 13 15 16 19 28 29 42 75 89 326 Avg. Dist. 0.60 0.61 0.28 0.35 0.14 0.51 0.31 0.28 0.32 0.33 0.2 Correct 0.15 0.1 0.05 0 -0.05 -0.1 -0.15 OCR -0.2 Press Re Corr Rep Other Re Me Publish S A T ed Doc cience (.51) dvertisement able, etc. (.62 port (.3mo (.32) lease (.1espondence ort (.28) (.60) ) (.35) 1) (.28) 4) Precision difference Recall difference Figure 2: Comp osition of the corpus. Distance b etween OCR and corrected versions was measured as the average edit distance b etween the texts as strings normalized by the do cument length, treating consecutive whitespace as a single space character. OCR Correct Figure 4: The difference in precision and recall of OCR and Correct for 2-gram features. Bars are group ed by text typ e and in order of increasing average distance. Positive means that Correct did b etter than OCR. Figure 3, all around 35-45%. Note that baseline accuracy for using the ma jority class would give 23%; so we are doing much better overall. The error bars shown are the standard error across cross-validation folds. More to the point, it is clear that text-type classification accuracy for OCR is not much lower than that for Correct, and is actually slightly higher for function words, but not significantly. In figure 4, we show the differences in precision and recall between OCR and Correct for the various document types. No clear pattern emerges, but our corpus is too small to make any definitive statements. 50 45 40 35 30 FW 2-grams 3-grams 4-grams All 5. DISCUSSION Figure 3: The 10-fold cross-validation accuracy with error bars for various feature sets. Note that considering error, there is no significant difference b etween OCR and Correct. normalized by the length of the correct version. It should be noted that the OCR is reasonably accurate for text in paragraphs, however it is easily confused by headers. Headers were not removed. We have found that stylistic classification accuracy is not significantly harmed, if at all, by OCR errors. In some of the cases, it even appeared slightly better. This illustrates how close the two data sets are, despite the errors. Even though OCR contained many character-level errors when compared to Correct, the accuracy of the stylistic classification was comparable. This result argues that computational stylistics should be applicable to scanned document collections without much modification, although further work will be needed to examine different sorts of stylebased text classification problems. 3. METHODOLOGY We applied a Support Vector Machine (SVM) learning method to build classification models. As input features, we used several types of numeric vectors, computed as the relative frequencies of textual attributes in each text. Probably the most common type of feature for stylistic text classification are function words, which were shown to be useful in many studies. Another type of feature that can be useful are character n-grams [4]. We compared results for both types of features separately. For function words, we used a predefined list of English function words and computed the per-word frequency of each function word in each text as input features. For ngram features, all n-grams (for n {2, 3, 4}) were counted, and the most common 1000 in the corpus overall were identified. Their counts were normalized for the length of each text and used as input features. For both OCR and Correct, these features were run through WEKA's SMO SVM [5] using the default settings with 10-fold cross-validation. Acknowledgments This work supported in part by an ARDA Challenge Grant. 6. REFERENCES [1] J. C. Kazem Taghva. Hairetes: A search engine for ocr documents. Intl. Workshop on Document Analysis Systems, pages 412­422, August 2002. [2] V. Levenshtein. Binary codes capable of correcting spurious insertions and deletions of ones. Problems of Information Transmission, 1:8­17, 1965. [3] D. D. Lewis, S. Argamon, G. Agam, O. Frieder, D. Grossman, and J. Heard. Building a test collection for complex document information processing. In SIGIR-06, 2006. [4] B. K. O. Uzuner. A comparative study of language models for book and author recognition. Springer-Verlag GmbH, page 969, 2005. [5] I. Witten, E. Frank, L. Trigg, M. Hall, G. Holmes, and S. Cunningham. Weka: Practical machine learning tools and techniques with java implementations, 1999. 4. RESULTS Overall results for different feature sets can be seen in 702