SIGIR 2007 Proceedings

Demonstration

Babel: A Machine Transliteration Workbench
A. Kumaran
Multilingual Systems Research Microsoft Research India Bangalore, INDIA

Tobias Kellner

a.kumaran@microsoft.com

Categories and Sub ject Descriptors: H.3.3 [Information Search & Retrieval]: Machine Transliteration, CLIR General Terms: Algorithms Keywords: Machine Transliteration, Cross-language Information Retrieval

1.

INTRODUCTION

Machine Transliteration deals with the conversion of text strings from one orthography to another, while preserving the phonetics of the strings in the two languages. Transliteration is an imp ortant problem in machine translation or cross-lingual information retrieval, as most prop er names and generic iconic terms are out-of-vocabulary words, and therefore need to b e transliterated. In this demo, we present Babel, a transliteration workb ench, with generic statistical learning algorithms and a scripting engine to model the transliteration process. We demonstrate quick assembly of necessary comp onents ­ algorithmic modules and training scripts ­ for systematic exp erimentation of transliteration tasks in a given pair of languages.

2.

TRANSLITERATION FRAMEWORK
with resp ect to parameters, such as, the training data size, exact vs fuzzy matching, etc. In addition, user defined preor p ost-processing routines may b e integrated easily, for any language-sp ecific tasks. A simple front-end is provided for any online transliteration tasks. In the figure, we show a sample output of a transliteration task, from English to Arabic. The results highlight several trends, such as the quality of transliteration with resp ect to the training size, the effective size of p otential transliteration for a resonable recall, the effect of allowing fuzzy matches to the transliteration tasks, etc.

The transliteration problem is modeled as a noisy channel in Babel, as in the p opular IBM Source-Channel models [1, 2, 3] for machine translation, with two sp ecific changes: first, using graphemes instead of words, and, second, using monotonic alignment algorithms on training data. We b ootstrap with an initial training using alignment probabilities estimated by the matching prefixes and suffixes of the paired strings in the training data, using an exp ectation maximization approach. Subsequently, we use viterbi algorithm to find the optimal alignments and iteratively refine the estimated model. Babel allows different algorithmic modules to b e added, and currently includes a scripting engine to model the transliteration tasks. The transliteration quality measures are computed on top-n probable transliteration returned by Babel,

3. CONCLUSION
In this demo, we show Babel, a generic workb ench for exp erimenting with transliteration tasks, in a modular manner. In such a framework, an effective transliteration system b etween a given pair of languages may b e put together quickly by existing reusable comp onents.
[1] Brown, F. B. et al. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 1993. [2] Al-Onaizan, Y. and Knight, K. Machine transliteration of names in Arabic text. Comp. Approaches for Semitic Languages, 2002. [3] Haizhou, L., Min, Z. and Jian, S. A joint source-channel mo del for machine transliteration. 42nd Meeting of ACL, 2004.

References

Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007.

899