SIGIR 2007 Proceedings Poster A Generic Framework for Machine Transliteration A Kumaran Tobias Kellner Multilingual Systems Research Microsoft Research India Bangalore, INDIA a.kumaran@microsoft.com Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: CLIR, Machine Transliteration General Terms Algorithms Keywords Machine Transliteration, Cross-language Information Retrieval 1. INTRODUCTION Machine Transliteration deals with the conversion of text strings from one orthography to another, while preserving the phonetics of the strings in the two languages. Transliteration is an imp ortant problem in machine translation or crosslingual information retrieval, as most prop er names and generic iconic terms are out-of-vocabulary words, and therefore need to b e transliterated. There are numerous methods explored in the literature for machine transliteration, ranging from rule-based techniques to statistical learning techniques. Here we focus our attention on language-indep endent techniques that p otentially can scale well with a large numb er of languages. In this pap er, we present a modular, statistical learning framework that lends itself for easy exp erimentation with transliteration tasks b etween a variety of different languages, in a language-indep endent manner. The workb ench includes a variety of comp onents ­ algorithms, data-sets and transliterations scripts ­ for a quick assembly of an effective transliteration system across langauges. We b elieve that such workb enches would b e imp ortant in an increasingly multilingual world, for building systems that span a numb er of languages, quickly and effectively. 2. TRANSLITERATION FRAMEWORK We present our machine transliteration framework based on a core algorithm modeled as a noisy channel [3]; the model p oses the transliteration problem as a noisy-channel, where the source string gets garbled into target string. The transliteration is learned by estimating the parameters of the distribution that maximizes the likelihood of observing the garbling seen in the training data. Subsequently, given a target language string t, we decode a p osteriori the most probable source language string s that gave raise to t, as: s = ar g maxs P (s|t) = ar g maxs P (t|s)P (s) The transliteration model P (t|s) is learnt from the training corpus and the P (s) is the language model for the source language strings. We segment the source and target strings as S = {s1 s2 . . . sn } and T = {t1 t2 . . . tm }, where si and ti are source and target language segments resp ectively, and estimate the P (t|s) approximately as: i P (ti |si ) P (t|s) Clearly, choosing automatically the right granularity for si and ti without language-sp ecific tools or information, is a ma jor challenge. Hence, we explore an exp ectation maximization approach, which exploits the only information we know ab out the alignment, that some prefix (or suffix) of the source string must map to some prefix (or suffix, resp ectively) of the target string, in each of the paired strings in the training set. We extract substrings of up to a sp ecified length from b oth ends of each of the paired strings, as the initial hyp othesis for our segment alignment; the frequencies of such alignments are used for estimating their alignment probabilities. Subsequently, we use viterbi algorithm to find the optimal alignment for each of the paired strings using the estimated alignment probabilities from the first step; the segment alignment probabilities are recalculated in each re-training step, based on the counts of segment pairs in the optimal alignments obtained in that step. At each re-training step, a test set is used to compute the transliteration accuracy, and the training is continued till the p oint when transliteration accuracy starts decreasing, due to over-fitting. The resulting transliteration model is used subsequently for that sp ecific language pair. Related Work Our work parallels the language-indep endent approaches discussed in [5, 1, 4]. These works are primarily based on source-channel models, and their results shown on sp ecific language pairs; in some cases the transliteration was modeled through the phonemic domain. Our framework is closest to that presented in [4], where the transliteration task was modeled as a source-channel [3] based on the orthographic n-grams of the source and target languages. However, we take an additional step making the framework modular and scalable, together with a scripting language for modeling training procedures. Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. 721 SIGIR 2007 Proceedings Poster Figure 1: Transliteration on a Language-pair 3. IMPLEMENTATION AND RESULTS We implemented a generic transliteration framework, in which different algorithmic modules or any pre- or p ostprocessing routines may b e integrated easily. In addition, we defined a scripting language that can b e used for defining a transliteration task as a script. The languages to b e handled and the relevant resource and data files may b e read in dynamically. A script engine executes a user-defined script, and as directed, record the statistics during the run. The quality of a transliteration measured as the fraction of correctly transliterated strings is recorded, parameterized by training size, exact or fuzzy matching (the reference and result strings match fuzzily when the edit-distance b etween them is 20% of the reference string size), recall in the top-n most-probable transliterated strings, etc. Single Language Pairs Figure 1 provides a sample output of a transliteration task from English to Arabic. The script is written to record the results for various parameters, such as, training data size, exact vs fuzzy, top-n results, etc. From a result as ab ove, several trends may b e observed easily: first, that the quality of transliteration ( 80% of the English names matched fuzzily with one of the top-10 results, when a training set of ab out 20, 000 name pairs); second, that the transliteration quality improves with the size of the training data, but only asymptotically ab ove 5, 000; third, fuzzy matches improve quality significantly, but introduces ambiguity in the results. However, introduction of a p ost-processing module that can resolve the ambiguity may b e employed to b oost the accuracy of the results (for example, we tried accessing the index of a search engine with f uz z y strings, which resulted in the correct identification of the transliterated string in vast ma jority of the cases. Diverse set of languages Figure 2 provides the results of transliteration exp eriments on a set of languages, namely, English, Arabic, Japanese and 2 Indic (Hindi and Tamil) languages. In this exp eriment, we exp erimented with a sp ecific transliteration algorithm (as presented in Section 2), on transliteration tasks from English to one of the 4 target languages, and in the reverse direction. A template script is used for each task and the results are generated for the exact and fuzzy matches in the top-10 results. The ob jective of this exp eriment is to profile the transliteration p erformance of a given algorithm on a diverse set of language pairs. The results indicate that the algorithm exhibited a varying degree of quality in each language pair: For example, transliteration accuracy from English to Arabic was 90%, to Indic languages were 70 - 80% and to Japanesewas 42%. Manual analysis indicate that the low quality of transliterating Indic languages were due to improp er segmentation, and that in Japanese was due to the weakly learned model Figure 2: Transliteration among a Set of Languages as most Japanese strings were concatenated strings (even when the corresp onding English string is, say, Sir Issac Newton). We introduced a pre-processing module to force the alignment b oundaries for Indic scripts to a pre-sp ecified set (by observing the fact that no alignment b oundaries can occur b etween a base character and a combining diacritic), which resulted in significant improvement in the transliteration accuracy. Though such additions are not generic, the workb ench allows us to exp eriment with language-sp ecific modules to b oost transliteration accuracy on sp ecific cases. 4. CONCLUSIONS In this pap er we outlined a generic framework for exp erimenting with transliteration tasks, in a modular manner. While our system displays a reasonable quality in transliteration, a shade b elow the state-of-the-art for sp ecific language pairs, its strength lies in the fact that new languages may b e added easily, and an effective transliteration mechanism may b e develop ed quickly by exp erimentation with reusable comp onents, a scripting language and languagesp ecific modules. We plan to release this system as a workb ench for p edagogical purp oses to let users explore various algorithms and training processes in a systematic manner on a wide variety of languages. In addition, it may provide a generic transliteration system for resource-p oor languages. 5. REFERENCES [1] Ab dulJaleel, N. and Larkey, L. Statistical transliteration for English-Arabic CLIR. CIKM, 2003. [2] Al-Onaizan, Y. and Knight, K. Machine transliteration of names in Arabic text. Comp. Approaches for Semitic Languages, 2002. [3] Brown, F. B. et al. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 1993. [4] Haizhou, L., Min, Z. and Jian, S. A joint source-channel model for machine transliteration. 42nd Meeting of ACL, 2004. [5] Virga, P. and Khudanpur, S. Transliteration of prop er names in cross-lingual information retrieval. Multiling. and Mixed-lang. Named Entity Recognition, 2003. 722