Swordfish: An Unsupervised Ngram Based Approach to Morphological Analysis Chris Jordan Dalhousie University 6050 University Ave. Halifax, NS, Canada John Healy Dalhousie University 6050 University Ave. Halifax, NS, Canada Vlado Keselj Dalhousie University 6050 University Ave. Halifax, NS, Canada cjordan@cs.dal.ca ABSTRACT healy@cs.dal.ca vlado@cs.dal.ca Extracting morphemes from words is a nontrivial task. Rule based stemming approaches such as Porter's algorithm have encountered some success, however they are restricted by their ability to identify a limited numb er of affixes and are language dep endent. When dealing with languages with many affixes, rule based approaches generally require many more rules to deal with all the p ossible word forms. Deriving these rules requires a larger effort on the part of linguists and in some instances can b e simply impractical. We prop ose an unsup ervised ngram based approach, named Swordfish. Using ngram probabilities in the corpus, p ossible morphemes are identified. We look at two p ossible methods for identifying candidate morphemes, one using joint probabilities b etween two ngrams, and the second based on log odds b etween prefix probabilities. Initial results indicate the joint probability approach to b e b etter for English while the prefix ratio approach is b etter for Finnish and Turkish. Categories and Subject Descriptors I.2.7 [ARTIFICIAL INTELLIGENCE]: Natural Language Processing--Language parsing and understanding content. Morphemes can also b e used in sp eech recognition due to a generally strong relation b etween how words are pronounced and sp elt. Performing morphological analysis however is not always a simple task. Rule based approaches such as Porter's algorithm [1] have exp erienced some success in extracting the root morphemes from words however are limited in their abilities by the numb er of rules they have. Highly inflective or comp ounding languages, where words can have many different affixes, can require rule based approaches to have a vast numb er of rules to accommodate all the p ossible word forms. This requirement can make them unreasonable to develop or maintain. An unsup ervised approach to morphological analysis is attractive for such languages. An ideal unsup ervised algorithm needs no training data or user input making it language indep endent. Swordfish is one such algorithm that we have develop ed and present here. Using a language model for the ngrams in a corpus we are able to extract candidate morphemes from words it contains. It should b e noted that a given algorithm's p erformance will vary dep ending up on how suited its implicit assumptions are to the language. This is clearly illustrated in Table 1. General Terms Algorithms, Languages 2. PREVIOUS WORK There has b een some work done in the area of unsup ervised morphological analysis. Linguistica [3] and Morfessor [2] are two such approaches. Linguistica discovers morphemes by constructing signatures which are templates for likely word splits. Morfessor looks for morphemes by first considering the entire vocabulary as the set of p ossible morphemes and then recursively adds new p ossible morphemes to the lexicon by finding the most likely splits in terms. Swordfish, the approach presented here, builds a lexicon of all p ossible morphemes based on ngrams and then attempts to determine the most likely term splits. Keywords Ngrams, Morphemes 1. INTRODUCTION The basic units of meaning in a word are referred to as morphemes. Morphological analysis is the action of extracting morphemes from words. For example, the morphemes in the word "pretested" are: pre+test+ed. Identifying the morphemes that a word is comp osed of is valuable as it can help determine which words have semantically similar meanings. The practice of stemming, the identification of the root morpheme or stem in a word, is p opular in document retrieval. Using morphemes instead of complete words as indexing terms helps in decreasing vector space dimensionality for document retrieval, document clustering, and classification, while attempting to preserve an equivalent semantic Copyright is held by the author/owner(s). SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. 3. SWORDFISH There are two general steps in the Swordfish algorithm. The first step is to build a lexicon of ngrams from the corpus. This step is the most time consuming part of the algorithm. We use a modified Yamamoto­Church algorithm [5] that deals with word lists instead of regular text due to the nature of the data sets for the 2005 PASCAL Challenge Workshop on Unsup ervised segmentation [4]. We consider ngrams of all lengths, (i.e., all p ossible word substrings) and collect their corresp onding frequencies in the corpus. Using these ngram frequencies, a probabilistic model can b e 657 constructed using maximum likelihood estimates (MLE) as shown in Equation 1. freq(ngrami ) P (ngrami ) = P j freq(ngramj ) (1) Language English Algorithm Baseline Morfessor1.0 SwordfishJP Swordfish­ Prefix Baseline Morfessor1.0 SwordfishJP Swordfish­ Prefix Baseline Morfessor1.0 SwordfishJP Swordfish­ Prefix Precision 14.56 74.19 55.11 22.60 19.71 83.66 70.39 35.54 25.87 76.33 58.90 36.50 Recall 100.00 26.64 37.45 70.45 100.00 29.51 23.58 56.87 100.00 24.17 16.79 52.15 where P (ngrami ) is the probability of ngrami in the corpus probability model, calculated over all ngram sizes. This differs from the standard method of calculating ngram probabilities over a fixed size. The second step is to use this model to identify the most probable splits in a word. The resulting substrings are considered to b e the morphemes which a word is comprised of. There are multiple methods for identifying p ossible word splits. In this work we present two. The first is a joint probability approach which we will refer to as SwordfishJP. Motivated by the argument that groups of letters forming morphemes should app ear more indep endently than two arbitrary word parts, we use our model to find the highest joint probability of two complementary word parts. Under an assumption of indep endence, this gives us: (x , y ) F measure 25.42 39.20 44.60 34.22 32.92 43.63 35.33 43.74 41.11 36.71 26.13 42.95 Finnish Turkish Table 1: Comparison of algorithms We b elieve there to b e significant p otential in using ngram models as the cornerstone for p erforming unsup ervised morphological analysis. How to b est use the ngram probabilities in determining where word splits should occur requires further research. An area for p otential improvement is in constructing a morpheme lexicon. Currently, there is a significant amount of noise in the lexicon. Filtering out ngrams which are not morphemes may lead to b etter p erformance. It is imp ortant to note that the two different methods for determining where word splits occur presented in this work resulted in p erformance scores that varied across languages. This puts into question whether there exists an ideal language indep endent unsup ervised approach or if there exists different classes of languages which require different classes of algorithms. = arg max P (x)P (y ) xy = t (2) The second approach, referred to as SwordfishPrefix, examines the log odds b etween the probabilities of adjacent prefixes. In other words, we are looking at the fraction by which the probability decreases as the prefix for a term is b eing extended. The splitting condition is expressed by the log odds ratio in Equation 3. If the value calculated is larger than , a word split occurs and the smaller prefix b ecomes a candidate morpheme for the term. The remaining suffix is then treated as a term and is recursively examined for p ossible splits. For the purp oses of this work = 2.0 was used. This value was decided up on through a numb er of exp erimental runs. < arg max(log P (prefixn ) - log P (prefixn-1 )) n (3) where prefixn contains n characters. 6. ACKNOWLEDGMENTS This research was funded by the Natural Sciences and Engineering Research Council of Canada, NSERC, and Sun Microsystems. 4. RESULTS The 2005 PASCAL Challenge Workshop on Unsup ervised segmentation of words into morphemes [4] was the original motivation for Swordfish. This workshop provided participants with three corp ora each in a different language: English, Finnish, and Turkish. As well, a sample gold standard, a set of correctly morphological analyzed word forms, for each language and an evaluation script were given to aid in the development process. We rep ort here the precision, recall, and F­measure as calculated by this evaluation script over the sample gold standards in Table 1 for Swordfish, Morfessor 1.0 with no p erformance tuning parameters set, and a baseline run which shows the results of placing a split b etween every character. From initial insp ection based on the F­measure, we can see that SwordfishJP is the top p erformer for English, Morfessor 1.0 and SwordfishPrefix are for Finnish, and SwordfishPrefix is for Turkish. 7. REFERENCES [1] Baeza-Yates R. and Rib eiro-Neto B.: Modern Information Retrieval. Addison-Wesley 1999. [2] Creutz M. and Lagus K.: Unsup ervised Morpheme Segmentation and Morphology Induction from Text Corp ora Using Morfessor 1.0. Tech. rep. A81, Helsinki University of Technology, 2005. [3] Goldsmith J.: Unsup ervised learning of the morphology of a natural language. Comput. Linguist. 27 (2). 2001. 153­198 [4] Unsup ervised Segmentation of Words into Morphemes Challenge 2005. www.cis.hut.fi/morphochallenge2005. [5] Yamamoto M. and Church K.W.: Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Comput. Linguist. 27 (1). 2001. 1­30 5. CONCLUSIONS Swordfish is a purely unsup ervised ngram based approach to morphological analysis. It recursively splits words using ngram probabilities. As such, a ngram lexicon is constructed from the corpus where all p ossible morphemes are extracted. 658