Invited Applications Paper Climbing the Tower of Babel: Unsupervised Multilingual Learning Benjamin Snyder bsnyder@csail.mit.edu Regina Barzilay regina@csail.mit.edu MIT Computer Science & Artificial Intelligence Lab, 32 Vassar Street, Cambridge, MA 02139 USA Abstract For centuries, scholars have explored the deep links among human languages. In this paper, we present a class of probabilistic models that use these links as a form of naturally occurring supervision. These models allow us to substantially improve performance for core text processing tasks, such as morphological segmentation, part-of-speech tagging, and syntactic parsing. Besides these traditional NLP tasks, we also present a multilingual model for the computational decipherment of lost languages. Another difficulty for multilingual NLP is that languages exhibit wide variation in their underlying linguistic structure. A model that has been developed for one language may not account for the kinds of structure found in others. In fact, there exists an entire academic discipline devoted to studying and describing systematic cross-lingual variations in language structure, known as linguistic typology (Comrie, 1989). At first glance, it may seem that linguistic diversity would make developing intelligent text-processing tools for the world's languages a very daunting task. However, we argue that in fact it is possible to harness systematic linguistic diversity and use it to our advantage, utilizing a framework which we call multilingual learning. The goal of this enterprise is two-fold: · To induce more accurate models of individual language structure without any human annotation. · To induce accurate models of the relationships between languages. The multilingual learning framework is based on the hypothesis that cross-lingual variations in linguistic structure correspond to variations in ambiguity. As an example, consider the syntactically ambiguous English sentence: "I ate pasta with cheese." The prepositional phrase "with cheese" can be interpreted as attaching the noun "pasta" (meaning the pasta had cheese), or could be interpreted as attaching to the verb "ate" (meaning perhaps that the pasta was eaten by means of a cheese-based utensil). As humans, we know that the first of these is the only plausible interpretation, but there is nothing in the sentence itself to indicate the correct parse. In contrast, the parallel sentence in Japanese uses an explicit genitive marker to mark the fact that the word for "pasta" is being modified. This example is an instance of a more general phenomenon: what one language leaves implicit, and thus ambiguous for computers or humans, another will express directly through overt linguistic forms. In the framework of multilingual learning, we treat these vari- 1. Overview Electronic text is currently being produced at a vast and unprecedented scale across the languages of the world. Natural Language Processing (NLP) holds out the promise of automatically analyzing this growing body of text. However, over the last several decades, NLP research efforts have focused on the English language, often neglecting the thousands of other languages of the world (Bender, 2009). Most of these languages are currently beyond the reach of NLP technology due to several factors. One of these is simply the lack of the kinds of hand-annotated linguistic resources that have helped propel the performance of English language systems. For complex tasks of linguistic analysis, hand-annotated corpora can be prohibitively time-consuming and expensive to produce. For example, the most widely used annotated corpus in the English language, the Penn Treebank (Marcus et al., 1994), took years for a team of professional linguists to produce. It is unrealistic to expect such resources to ever exist for the majority of the world's languages. Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). Climbing the Tower of Babel that the number of induced values ranges from 11 (for pair of languages) to 17 (for eight languages). We evaluate our model on a parallel corpus of eight languages: Bulgarian, Czech, English, Estonian, Hungarian, Romanian, Serbian, and Slovene (Erjavec, 2004). We perform inference using Markov Chain Monte Carlo sampling and always test on held out monolingual data for each language. We ran our inference algorithm over all 255 subsets of the eight languages in our corpus, so we could examine the average change in performance as the number of languages increases. In the monolingual scenario, our model reduces to the Bayesian HMM of Goldwater & Griffiths (2007). When a complete part-of-speech dictionary1 is available and our model is trained using eight languages, average tag prediction accuracy increases from 91.1% for monolingual models to 95%. In more realistic cases, where the tag dictionary is restricted to only frequently occurring words, we see even larger gaps between monolingual and multilingual performance. In one such scenario, where dictionary entries are only available for words occurring more than five times in the corpus, average multilingual performance increases to 82.8% from the monolingual baseline of 74.8%. As seen in Figure 3, accuracy gains steadily as languages are added to the mix. English: Arabic: Hebrew: Aramaic: in my land fy ar - y b - ar - y b - ar- y Figure 4. Morphological segmentation and alignment. They also use an identical suffix (-y) to represent the first person possessive pronoun (my). These similarities in form should guide the model by constraining the space of joint segmentations and alignments. The corresponding English phrase lacks this resemblance to its Semitic counterparts. However, in this as in many cases, no segmentation is required for English as all the morphemes are expressed as individual words. For this reason, English should provide a strong source of disambiguation for highly inflected languages, such as Arabic and Hebrew. More generally speaking, our model exploits the fact that each language distributes morphemes across words in a unique pattern. Note that morphemes expressed in one language often have no counterpart at all in some other languages, so morphemes must be allowed to remain unaligned. The technical difficulty when compared to the part-ofspeech model of Section 2 is that the units of alignment now depend on the results of the model's segmentation predictions. Whereas before we could treat word-level alignments as fixed and observed (as the result of preprocessing with standard NLP word-alignment tools), we must now fold alignment uncertainty into the morphology model itself. We start with a sketch of the probabilistic process posited by our model for the generation of short bilingual phrases (see Figure 5 for an accompanying example). First, the numbers of unaligned language-specific morphemes (m and n), and the number of aligned morpheme pairs (k) are drawn from a Poisson distribution. These are the number of morphemes that will ultimately compose the bilingual parallel phrase. Next, the morphemes are drawn from the appropriate distributions: m and n morphemes are respectively drawn from language-specific morpheme distributions E and F , and k bilingual morpheme pairs are drawn from A. The resulting morphemes for each language are finally ordered and fused into words. As in the previous section, the scope of cross-lingual connections (now in the form of aligned morpheme pairs) is not known a priori. Indeed, even the number of morphemes in each language is not known in 3. Morphological Segmentation In the task of morphological analysis, the goal is to segment words into morphemes, the smallest units of meaning (e.g. "misunderstanding" segments into three morphemes: "mis understand ing"). While the morphology of English is fairly simple, many languages exhibit a richer and more productive set of morphological patterns. In the unsupervised setting, morphological segmentation consists of finding recurrent prefix and suffix patterns which allow a more compact representation of the many possible derived word forms. Our multilingual model for this task automatically induces a segmentation and morpheme alignment from a multilingual (unannotated) corpus of short parallel phrases. For example, given parallel phrases meaning in my land in English, Arabic, Hebrew, and Aramaic, we wish to segment and align morphemes as shown in Figure 4. This example illustrates the potential benefits of unsupervised multilingual morphological analysis. The three Semitic languages use cognates (words derived from a common ancestor) to represent the word land. i.e. entries indicating the set of potential parts-ofspeech for each word 1 Climbing the Tower of Babel (i) (ii) (iii) Figure 6. A pair of trees (i) and two possible alignment trees. In (ii), no empty spaces are inserted, but the order of one of the original tree's siblings has been reversed. In (iii), only two pairs of nodes have been aligned (indicated by arrows) and many empty spaces inserted. through a similar comparison to the English sentence. However, even in this simplest of sentence pairs, we notice syntactic divergence. While the English sentence uses the simple transitive verb "climbed" to express the fact that John completed his climb of Everest, the verb in the Hindi/Urdu sentence takes the postpositional argument "Everest on." The syntactic divergence in real-life examples becomes only more severe. The key challenge then is representational. We need to parse both sentences with possibly quite divergent trees, while recognizing shared syntactic structure. In effect, we seek to produce two loosely bound trees: node-to-node alignments need only be used where repeated bilingual patterns can be discerned in the data. We achieve this loose binding of trees by adapting unordered tree alignment (Jiang et al., 1995) to a probabilistic setting. Under this formalism, any two trees can be aligned using an alignment tree. The alignment tree embeds the original two trees within it: each node is labeled by a pair (x, y), (, y), or (x, ) where x is a node from the first tree, y is a node from the second tree, and is an empty space. The individual structure of each tree must be preserved under the embedding with the exception of sibling order (to allow variations in phrase and word order). The flexibility of this formalism can be demonstrated by two extreme cases: (1) an alignment between two trees may actually align none of their individual nodes, instead inserting an empty space for each of the original two trees' nodes. (2) if the original trees are isomorphic to one another, the alignment may match their nodes exactly, without inserting any empty spaces. See Figure 6 for an example. An additional benefit of this formalism is computational: The marginalized probability over all possible alignments for any two trees can be efficiently computed with a dynamic program in bi-linear time in the size of the two trees. We formulated a generative Bayesian model which seeks to explain sentence- and word-aligned parallel sentences through a combination of bilingual and monolingual syntactic parameters. Our model views each bilingual pair of sentences as having been probabilistically generated as follows: First an alignment tree is drawn uniformly from the set of all such trees. This alignment tree specifies the structure of each of the two individual trees, as well as the pairs of nodes which are aligned and those which are not aligned (i.e. paired with a ). For each pair of aligned nodes, a corresponding pair of sentence constituents are jointly drawn from a bilingual distribution. For unaligned nodes (i.e. nodes paired with a in the alignment tree), a single sentence constituent is drawn, in this case from a language-specific distributions. Finally word-level alignments are drawn based on the structure of the alignment tree. To perform inference under this model, we use a Metropolis-Hastings within-Gibbs sampler. We sample pairs of trees and then compute marginalized probabilities over all possible alignments using dynamic programming. We tested the effectiveness of our bilingual grammar induction model on three corpora of parallel text: English-Korean, English-Urdu and English-Chinese. The model is trained using bilingual data with automatically induced word-level alignments, but is tested on purely monolingual data for each language. In all cases, our model outperforms a state-of-the-art baseline: the Constituent Context Model (CCM) (Klein & Manning, 2002), sometimes by substantial margins. On average, over all the testing scenarios that we studied, our model achieves an absolute increase in Fmeasure of 8.8 points, and a 19% reduction in error Climbing the Tower of Babel monolingual test data. We believe this to be a realistic scenario for a large number of the world's languages, as parallel texts are widely available. Finally, in Section 5, we considered the special case of lost language decipherment, where parallel text is not present, but information about a closely related language is available. For future work, we pose the following two questions: (i) Can multilingual learning be used to triangulate the information content of sentences in multiple languages? (ii) Can knowledge of linguistic typology (and universal features of language) be used to induce more accurate unsupervised models, even without the use of parallel text? Marcus, M.P., Santorini, B., and Marcinkiewicz, M.A. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics, 19(2): 313­330, 1994. Merialdo, Bernard. Tagging english text with a probabilistic model. Computational Linguistics, 20(2): 155­171, 1994. Naseem, Tahira, Snyder, Benjamin, Eisenstein, Jacob, and Barzilay, Regina. Multilingual part-ofspeech tagging: two unsupervised approaches. Journal of Artificial Intelligence Research, 36(1):341­ 385, 2009. ISSN 1076-9757. Och, Franz Josef and Ney, Hermann. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19­51, 2003. Robinson, Andrew. Lost Languages: The Enigma of the World's Undeciphered Scripts. McGraw-Hill, 2002. Snyder, Benjamin and Barzilay, Regina. Unsupervised multilingual learning for morphological segmentation. In Proceedings of the ACL/HLT, pp. 737­745, 2008a. Snyder, Benjamin and Barzilay, Regina. Cross-lingual propagation for morphological analysis. In Proceedings of the AAAI, pp. 848­854, 2008b. Snyder, Benjamin, Naseem, Tahira, Eisenstein, Jacob, and Barzilay, Regina. Unsupervised multilingual learning for POS tagging. In Proceedings of EMNLP, pp. 1041­1050, 2008. Snyder, Benjamin, Naseem, Tahira, and Barzilay, Regina. Unsupervised multilingual grammar induction. In Proceedings of the ACL, pp. 73­81, 2009a. Snyder, Benjamin, Naseem, Tahira, Eisenstein, Jacob, and Barzilay, Regina. Adding more languages improves unsupervised multilingual part-of-speech tagging: a bayesian non-parametric approach. In Proceedings of the NAACL, pp. 83­91, 2009b. References Bender, Emily M. Linguistically naĻ != language ive independent: why NLP needs linguistic typology. In Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics, pp. 26­32, Morristown, NJ, USA, 2009. Association for Computational Linguistics. Charniak, Eugene and Carroll, Glen. Two experiments on learning probabilistic dependency grammars from corpora. In Proceedings of the AAAI Workshop on Statistically-Based NLP Techniques, pp. 1­13, 1992. Comrie, Bernard. Language universals and linguistic typology: Syntax and morphology. Oxford: Blackwell, 1989. Erjavec, T. MULTEXT-East version 3: Multilingual morphosyntactic specifications, lexicons and corpora. In Fourth International Conference on Language Resources and Evaluation, LREC, volume 4, pp. 1535­1538, 2004. Ferguson, T.S. A Bayesian analysis of some nonparametric problems. The annals of statistics, 1:209­230, 1973. Goldwater, Sharon and Griffiths, Thomas L. A fully Bayesian approach to unsupervised part-of-speech tagging. In Proceedings of the ACL, pp. 744­751, 2007. Jiang, T., Wang, L., and Zhang, K. Alignment of trees ­ an alternative to tree edit. Theoretical Computer Science, 143(1):137­148, 1995. Klein, Dan and Manning, Christopher D. A generative constituent-context model for improved grammar induction. In Proceedings of the ACL, pp. 128­ 135, 2002.