SIGIR 2007 Proceedings Poster Detecting Word Substitutions: PMI vs. HMM Dmitri Roussinov Arizona State University P.O Box 873606 Tempe, AZ, 85287 +14809658488 SzeWang Fong School of Computing Queen's University Kingston Ontario, Canada +16135336065 David Skillicorn School of Computing Queen's University Kingston Ontario, Canada +16135336065 dmitri.roussinov@asu.edu ABSTRACT fong@cs.queensu.ca skill@cs.queensu.ca Those who want to conceal the content of their communications can do so by replacing words that might trigger attention. For example, instead of writing "The bomb is in position", a terrorist may chose to write "The flower is in position." The substituted sentence would sound a bit "odd" for a human reader and it has been shown in prior research that such oddity is detectable by text mining approaches. However, the importance of each component in the suggested oddity detection approach has not been thoroughly investigated. Also, the approach has not been compared with such an obvious candidate for the task as Hidden Markov Models (HMM). In this work, we explore further oddity detection algorithms reported earlier, specifically those based on pointwise mutual information (PMI) and Hidden Markov Models (HMM). compare against more known approaches such as Hidden Markov Model used in [3] for a related task of detecting spam. The problem of detecting a word that is somehow out of context occurs in a number of settings. For example, speech recognition algorithms model the expected next word, and back up to a different interpretation when the next word becomes sufficiently unlikely [1]. Detecting words out of context has also been applied to the problem of spam detection. For example, SpamAssassin uses rules that will detect words such as `V!agra'. The problem is similar to detecting misspellings, except that the transformations have properties that preserve certain visual qualities rather than reflecting lexical formation errors. Lee and Ng [3] detect wordlevel manipulations typical of spam using Hidden Markov Models. The task of oddity detection can be also considered as the task of detecting words that are "out of context," which means surrounded by the words with which they typically do not cooccur. The task of detecting typical co-occurrences of words in the specific context was considered in [5]. Categories & Subject Descriptors: H.3.3 Information Systems, INFORMATION STORAGE AND RETRIEVAL, Information Search and Retrieval, Information filtering General Terms: Algorithms. Keywords Security informatics, text mining, text oddity, detecting substitutions. 2. MODELS CONSIDERED The pointwise mutual information measure (PMI) is used extensively in data mining, and was introduced into text mining by Turney [7]. The intuition behind applying the PMI measure is that if the target word is contextually appropriate (not substituted), then it should be a part of some stable phrase (expression). Such a stable phrase should occur on the Web (or a suitably large corpus) more often than random chance dictates. Consider a word of interest (target) and an adjacent region of the sentence. The pointwise mutual information of the pair "target + adjacent region" is given by: PMI = p(target + adjacent region)/(p(target)p(adjacent region)), Where p(x) is the probability of occurrence of the phrase x verbatim and the "+" sign shows concatenation. Although the PMI formula uses probabilities of occurrences, it has been commonly approximated by f(x) = number of occurrences on the Web [7], thus: PMI = f(target + adjacent region)/(f(target)f(adjacent region)), The PMI-based algorithm from [4]. calculates a family of pointwise mutual information measures using all the adjacent regions that are still within the same sentence. Adjacent regions can precede or follow the word of interest (target). The algorithm computes the maximum pointwise mutual information (max PMI) over all choices of adjacent regions of text, because being part of at least one stable phrase is enough to be cleared as not odd. E.g., with the sentence The review process must accomplish certain goal and target "review", PMI(review process must) = 5776, but 1. INTRODUCTION AND PRIOR WORK Terrorists and criminals must be aware of the possibility of interception whenever they communicate by phone or email. In particular, terrorists must be aware of systems such as Echelon that examine a very large number of messages and select some for further analysis based on a watch-list of significant words. When a word is replaced by a word of substantially different natural frequency, Skillicorn [6] showed that a different kind of potentially detectable signature is created. The approach however does not work when the frequencies of the use are the same. Since the data on the frequency of use of words is easily available (e.g. by querying Google), adversaries can chose replacement words with the same frequencies to avoid detection. Detecting this kind of substitution (oddity) has been proposed based on Internet textmining and pointwise mutual information (PMI) and other metrics, but not yet rigorously explored [4]. Specifically, it is still not known what heuristic decisions behind pointwise mutual information (PMI) -based algorithm are crucial and how it would Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. 885 SIGIR 2007 Proceedings Poster PMI(review process) = 9135, which is a stronger indication of not being odd. This approach only provides data when frequency of occurrence is not zero, which is not the case for many "very odd" combinations (e.g flower process) or long phrases. However, in addition, it estimates the upper bound of PMI as the following: PMIu = 1 / (f(target)f(adjacent region)) Using 1 was justified by the following consideration: when the probability of occurrence of a phrase is very small, its expected occurrence in the corpus f(phrase) is less than one, that is why a search engine would most likely to report f(phrase) = 0. Using 1 instead of 0 gives the upper bound of PMI in that case. For detection, the algorithm computes the minimum of all upper bounds min PMIu across adjacent regions because if any of the part of the sentence is odd, then the entire sentence is flagged as odd. Thus, the final decision about the oddity is performed according to the following logical rule: the target is flagged as odd only if (max PMI > 0 and max PMI < threshold) OR (max PMI = 0 and min PMIu < threshold). Thus, the rule has two different components, the importance of each has not been yet investigated. Hidden Markov Model (HMM) is popular in speech recognition. It estimates the probability of occurrence of a word based on the preceding adjacent region. Following similar considerations as above, we compute this probability as following: HMM = f(left adjacent region + target) / f(left adjacent region). Again, we computed the max HMM for all adjacent regions and check if it is below threshold for detection. have its first noun present in BNC list [2]. To simulate substitutions we replaced it with an adjacent word in the BNC frequency ranking for nouns. We tested each metric described above and also with each of the parts of PMI algorithm disabled. Figure 1 shows ROC curves obtained by varying the threshold. Since each detection approach has only one parameter (threshold), we did not expect overfitting to be an issue and decided not to run cross validation. The results clearly demonstrate that 1) complete PMI metric is better than HMM 2) both parts ("max PMI > 0" and "max PMI=0") of PMI algorithm are important. 3) Performance of HMM is comparable with PMI with max PMI > 0 only. 4. CONCLUSIONS, LIMITATIONS AND FUTURE RESEARCH DIRECTIONS We have tested how word substitutions within textual communication can be detected. Our technique allows us to automatically flag suspicious messages, so that they can be further investigated, either by a more sophisticated data-mining techniques or manually. Since our objective was to investigate the accuracy of detection, we were not concerned with real-time performance. The complete run of our test set took several hours, querying the underlying search engines being the bottleneck. In practice it can be resolved by 1) parallelizing on multiple systems or 2) obtaining direct access to the search engine index. For further work, it will be also interesting to investigate how the correlation between substitutions can be exploited to increase the accuracy and even to guess what original words were obfuscated. 5. REFERENCES [1] Bilmes, J.A. and Kirchho, K. Factored language models and generalized parallel backoff. In Proceedings of HLT/NACCL, 2003. [2] British National Corpus (BNC), 2004. www.natcorp.ox.ac.uk. [3] Lee, H. and Ng, A.Y. Spam deobfuscation using a Hidden Markov Model. In Proceedings of the Second Conference on Email and Anti-Spam, 2005. [4] S. Fong, D. Skillicorn, and D. Roussinov, "Measures to detect word substitution in intercepted communication," IEEE International Conference on Intelligence and Security Informatics, ISI 2006, San Diego, CA, USA, May 23-24, ser. LNCS 3975. [5] Roussinov, D., Zhao, L., and Fan, W. Mining context specific similarity relationships using the World Wide Web. In Proceedings of the 2005 Conference on Human Language Technologies, 2005. [6] Skillicorn, D.B. Beyond keyword filtering for message and conversation detection. In IEEE International Conference on Intelligence and Security Informatics (ISI2005), pages 231243. Springer-Verlag Lecture Notes in Computer Science LNCS 3495, May 2005. 3. EMPIRICAL EVALUATION Figure 1. ROC curves for the tested substitution detection algorithms. We involved the same testing methodology as in [4]: use MSN search engine as the source of frequency data and a random sample of 1400 sentences drawn from publicly available Enron data set of email messages. To be selected, each sentence had to [7] Turney, P. Mining the web for synonyms: PMI-IR versus LSA on TOEFL, ECML-2001. 886