CS 723 Project 3: Multilingual Lexical Semantics

Table of Contents


In contrast to P1, where most of the info was here, this doc is now just a schematic and the main details are in the relevant .py files. We're doing parsing :).

The code for this project consists of several Python files, some of which you will need to read and understand in order to complete the assignment, and some of which you can ignore. You can download all the code and supporting files (including this description) as a tar archive (this one is kind of big due to the large amount of included data).

Files you'll edit:
model1.py Where you'll implement part 1: some word alignment code for model 1.
wsd.py Where you'll implement part 2: multilingual word sense disambiguation
Files you might want to look at:
wsddata.py I/O functions for wsd.
util.py A handful of useful utility functions: these will undoubtedly be helpful to you, so take a look!

Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder. However, the correctness of your implementation -- not the autograder's output -- will be the final judge of your score. If necessary, we will review and grade assignments individually to ensure that you receive due credit for your work.

Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. If you do, we will pursue the strongest consequences available to us.

Getting Help: You are not alone! If you find yourself stuck on something, contact the course staff for help. Office hours, class time, and Piazza are there for your support; please use them. If you can't make our office hours, let us know and we will schedule more. We want these projects to be rewarding and instructional, not frustrating and demoralizing. But, we don't know when or how to help unless you ask. One more piece of advice: if you don't know what a variable is, print it out.

Model 1 (50%)

The starting point for multilingual sense disambiguation is good word alignments. Although we'll use better word alignments later for the rest of the project, it's a good exercise to make your own. A partial implementation of model 1 for word alignment (yes, it was eons ago) via EM is in model1.py. You just need to complete the implementation of alignSentencePair (the E step) and addFractionalCounts (the M step). You can test your implementation on simpleTestCorpus by:

>>> ttable = uniformTTableInitialization(simpleTestCorpus)

>>> alignSentencePair(ttable, simpleTestCorpus[0][0], simpleTestCorpus[0][1])
{0: {0: 0.4285714285714286, 1: 0.4285714285714286}, 1: {0: 0.57142857142857151, 1: 0.57142857142857151}}

>>> ttable = runEM(simpleTestCorpus)
initializing ttable
iteration  1
iteration  2
iteration  3
iteration  4
iteration  5

>>> ttable
{'maison': {'blue': 0.040600760360324728, 'house': 0.65471204567223051, 'the': 0.30468719396744476}, 'bleue': {'blue': 0.75282245129000946, 'house': 0.15850320441808033, 'the': 0.088674344291910373}, 'fleur': {'the': 0.21965510839974961, 'flower': 0.78034489160025033}, 'la': {'blue': 0.012053407685164087, 'house': 0.19436855696394276, 'the': 0.77915593275151462, 'flower': 0.014422102599378655}}
Once you've done this, you can try it on a larger dataset (this will take a bit of time):

>>> corpus = readCorpus('Science.en', 'Science.fr', truncateWordLength=5)
>>> ttable = runEM(corpus)
initializing ttable
iteration  1
iteration  2
iteration  3
iteration  4
iteration  5

>>> printTTable(ttable, 'ttable')
The learned ttable is now in the file called 'ttable'. To make things faster and take less memory (and actually often improve alignment results), all words have been truncated to their first five characters, and rare words have been replaced with a special token. Nonetheless, you should be able to find good translations, especially if you sort by probability:
% cat ttable | sort -k3,3gr | head -n25
>       >       0.838112087855
4       4       0.823374681637
14      14      0.820770215094
therm   therm   0.816144226466
m.      m.      0.814681758648
sp.     sp.     0.81435868871
%       %       0.812722919592
+       +       0.810530878559
ampli   ampli   0.810411006822
two     deux    0.808976740271
x       x       0.80830483925
chrom   chrom   0.806986813053
±       ±       0.801427543776
biolo   biolo   0.79999616645
tetra   tétr    0.797754300239
group   group   0.797657487426
weak    faibl   0.796534397246
hydra   hydra   0.795408016543
criti   criti   0.795115143764
molec   molé    0.794800891851
tripl   tripl   0.794478665507
days    jours   0.794338171627
Some of these are "obvious" because of spelling similarities, but remember that model1 doesn't use spelling! Plus others (two/deux, weak/faibl, days/jours) are not from spelling. Good job!!!

Multilingual WSD (50%)

Your second task is to do translation-sense disambiguation by implementing features that will be useful for this task. If you look at "Science.ambig" you'll see some ambiguous french words that you need to pick the correct translation of given context. Note that some of these are boring morphological distinctions, while some are interesting WSD distinctions (like alteration being either adjustment or damage, depending on context).

The basic setup is that you will be give a French word in context (the entire document context) from this list of French words and you need to pick the correct translation. To do so, you can define features of the French context that you believe will help disambiguate (for instance, the word "money" in the context of "bank" might help disambiguate what type of bank you're talking about). You can also define features of the English translation (perhaps its suffix, so you know what part of speech you're translating into). Finally, you can define features of the pair of the French word and English word. For instance, string edit distance might be useful to pick up cognates (eg., ajustement/adjusting or homologue/homologous).

To get you started, I have some very stupid features implemented in simpleEFeatures, simpleFFeatures and simplePairFeatures. Before implementing anything else, you should be able to get about 70% accuracy at this task (don't be impressed: you get this more or less just by choosing the most frequent sense, which is actually a VERY HARD baseline to beat!):

% python wsd.py
reading data from  Science.tr
reading data from  Science.de
collecting translation table
generating classification data
executing:  /home/hal/bin/vw -k -c --passes 10 -q st --power_t 0.5 --csoaa_ldf m -d wsd_vw.tr -f wsd_vw.model --quiet
executing:  /home/hal/bin/vw -t -q st --csoaa_ldf m -d wsd_vw.tr -i wsd_vw.model -r wsd_vw.tr.rawpredictions --quiet
executing:  /home/hal/bin/vw -t -q st --csoaa_ldf m -d wsd_vw.te -i wsd_vw.model -r wsd_vw.te.rawpredictions --quiet
training accuracy = 0.705672039753
testing  accuracy = 0.701263393571
In order for this to work, you need to be sure that VW is installed on your system, and that it's pointed to correctly at the top of the file wsddata.py.

Once you implement the suggested features I gave you, you should do markedly better:

% python wsd.py
training accuracy = 0.863248899123
testing  accuracy = 0.75707660323
30% of your grade on this part is based on being able to replicate that. The rest, as you probably expect by now, is to beat this baseline by coming up with, and implementing, useful features. For every half a percent that you get over my baseline, you'll get one percent toward the remaining 20%. In addition, the top team will get 5%, the second 4%, third 3%, fourth 2% and fifth 1%. If it turns out no one gets full credit, I'll adjust these thresholds in your favor (but never against your favor).

As usual, you should develop on the dev data and then hand in your results on the test data. Implement your fancy features in the complex functions, and then run on Science.te (by changing the main definition) and upload the output stored in wsd_output. You can view your progress on the leaderboard. Extra credit: If you're feeling adventurous, you can also work on movie subtitles data instead of scientific articles (but don't get mad at me that they swear a lot in that data). You should download the Subtitles data separately and you can hand in (optionally) your output on subs for extra credit. This is purely optional. If you beat my baseline by at least 1%, you get 5% extra credit. First place team gets 5% more, second 3% and third 2%. Note that this data is much larger, and the documents are much longer (they are whole movies!) so you might run in to scaling problems.