Assignment 2: Translation Probabilities
Due: Thursday 3/3 (11:59pm)
Machine translation is the general term for attempt to turn text in one language (e.g. English) into another (e.g. French). Statistical machine translation uses large collections of parallel documents to learn statistical rules about how to turn one language into another.
This course is not about machine translation, but this assignment will compute probabilities that are used in machine translation systems. If you're interested in learning more, you can take a look at these slides; we're going to be doing the M-step of IBM Model One (if that's gibberish to you, that's okay; we'll explain exactly what we mean).
Lexical translation probabilities
Translation systems are built on top of parallel texts, where you have the exact same text written in multiple languages. An example of such a corpus are the transcripts of the European Union parliament, which by law have to be translated in many languages.
Machine translation systems need to learn the probability of translating one word to another. We can learn the probability of translations by looking at how many times a word in the source language is translated into various words in the target language.
In the files provided here, we have the English words aligned with French words in the Europarl corpus. (You might be asking how we get the alignments - it's a chicken and egg problem that we're ignoring for the purpos of this assignment).
What the data look like
Each record in the data located in /umd-lin/jbg/data/europarl/ looks like:
englishword1::frenchword1 englishword2::frenchword2 englishword3::frenchword3
as you can see, the alignment is already done for you. Your goal is to compute the probability of computing an English word into every French word.
To complete this assignment, you will need to:
- Fill in the missing pieces of the two classes:
- (Optional, but encouraged) Make sure that these classes pass the unit tests located in the same directory. The tests may make slightly different assumptions than you did; that's okay. You don't have to pass all the tests, you may want to change the tests to match your assumptions and make sure you pass those tests. It's much easier to debug on unit tests than on the cluster.
- Write a driver to set the mapper and reducer. Consider if you need other classes (e.g. partitioner, combiner). Run the algorithm locally on a small dataset to make sure it works.
- Run the algorithm on the Europarl dataset. It should take no longer than 10 minutes; if it takes longer than that, you've likely done something inefficient.
When you're done with these, answer the following questions about your results:
- How many unique English words are there?
- For the English word "house," how many French words is it translated into?
- What is the probability of "house" being translated as:
- What are examples of non-identical, unambiguous words that appear more than once? (e.g. "whites" is translated as "blancs" every time it appears, so "whites" is an example. "airways" is always translated as "airways", but it's the same in English and French, so it is not an example.)
- What's an example of a word that is translated a different way every time it appears? (e.g. "hogwash" (nonsense) is translated as both "aberration" and "transparente" the two times it appears).
This assignment is due by 11:59pm, Thursday 3/3. Please send us (both Jordan and Yingying) an email with "[CCC Assignment 2]" as the subject. In the body of the email put answers to the questions above and attach the any source code that you modified or created.
Note: The Google/IBM cluster is a shared resource accessible by many. Any impropriety on the cluster will be taken very seriously. This includes tampering or attempting to tamper with another student's results, attempting to pass another student's result as one's own, etc. See the Code of Academic Integrity or the Student Honor Council for more information.
- Don't ignore the unit tests. They aren't just busywork.
- Think about what keys go to which reducers and in what order. Build checks (e.g. by throwing an assumption) that enforce your assumptions to make sure they hold.
- Your output should be probabilities and should have all the properties of probabilities (e.g. non-negative, sum to 1). If they don't, you have problems.
- The "order inversion" paradigm will likely be helpful.
- For the questions, there's no reason you can't write seperate programs to calculate the answers.