The code for this project consists of several Python files, some of which you will need to read and understand in order to complete the assignment, and some of which you can ignore. You can download all the code and supporting files (including this description) as a tar archive (this one is kind of big due to the large amount of included data).
Files you'll edit: | |
model1.py |
Where you'll implement part 1: some word alignment code for model 1. |
wsd.py |
Where you'll implement part 2: multilingual word sense disambiguation |
Files you might want to look at: | |
wsddata.py |
I/O functions for wsd. |
util.py |
A handful of useful utility functions: these will undoubtedly be helpful to you, so take a look! |
Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder. However, the correctness of your implementation -- not the autograder's output -- will be the final judge of your score. If necessary, we will review and grade assignments individually to ensure that you receive due credit for your work.
Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. If you do, we will pursue the strongest consequences available to us.
Getting Help: You are not alone! If you find yourself stuck on something, contact the course staff for help. Office hours, class time, and Piazza are there for your support; please use them. If you can't make our office hours, let us know and we will schedule more. We want these projects to be rewarding and instructional, not frustrating and demoralizing. But, we don't know when or how to help unless you ask. One more piece of advice: if you don't know what a variable is, print it out.
>>> ttable = uniformTTableInitialization(simpleTestCorpus) >>> alignSentencePair(ttable, simpleTestCorpus[0][0], simpleTestCorpus[0][1]) {0: {0: 0.4285714285714286, 1: 0.4285714285714286}, 1: {0: 0.57142857142857151, 1: 0.57142857142857151}} >>> ttable = runEM(simpleTestCorpus) initializing ttable iteration 1 iteration 2 iteration 3 iteration 4 iteration 5 >>> ttable {'maison': {'blue': 0.040600760360324728, 'house': 0.65471204567223051, 'the': 0.30468719396744476}, 'bleue': {'blue': 0.75282245129000946, 'house': 0.15850320441808033, 'the': 0.088674344291910373}, 'fleur': {'the': 0.21965510839974961, 'flower': 0.78034489160025033}, 'la': {'blue': 0.012053407685164087, 'house': 0.19436855696394276, 'the': 0.77915593275151462, 'flower': 0.014422102599378655}}Once you've done this, you can try it on a larger dataset (this will take a bit of time):
>>> corpus = readCorpus('Science.en', 'Science.fr', truncateWordLength=5) >>> ttable = runEM(corpus) initializing ttable iteration 1 iteration 2 iteration 3 iteration 4 iteration 5 >>> printTTable(ttable, 'ttable')The learned ttable is now in the file called 'ttable'. To make things faster and take less memory (and actually often improve alignment results), all words have been truncated to their first five characters, and rare words have been replaced with a special token. Nonetheless, you should be able to find good translations, especially if you sort by probability:
% cat ttable | sort -k3,3gr | head -n25 > > 0.838112087855 4 4 0.823374681637 14 14 0.820770215094 therm therm 0.816144226466 m. m. 0.814681758648 sp. sp. 0.81435868871 % % 0.812722919592 + + 0.810530878559 ampli ampli 0.810411006822 two deux 0.808976740271 x x 0.80830483925 chrom chrom 0.806986813053 ± ± 0.801427543776 biolo biolo 0.79999616645 tetra tétr 0.797754300239 group group 0.797657487426 weak faibl 0.796534397246 hydra hydra 0.795408016543 criti criti 0.795115143764 molec molé 0.794800891851 tripl tripl 0.794478665507 days jours 0.794338171627 ...Some of these are "obvious" because of spelling similarities, but remember that model1 doesn't use spelling! Plus others (two/deux, weak/faibl, days/jours) are not from spelling. Good job!!!
% python wsd.py reading data from Science.tr reading data from Science.de collecting translation table generating classification data executing: /home/hal/bin/vw -k -c --passes 10 -q st --power_t 0.5 --csoaa_ldf m -d wsd_vw.tr -f wsd_vw.model --quiet executing: /home/hal/bin/vw -t -q st --csoaa_ldf m -d wsd_vw.tr -i wsd_vw.model -r wsd_vw.tr.rawpredictions --quiet executing: /home/hal/bin/vw -t -q st --csoaa_ldf m -d wsd_vw.te -i wsd_vw.model -r wsd_vw.te.rawpredictions --quiet training accuracy = 0.705672039753 testing accuracy = 0.701263393571In order for this to work, you need to be sure that VW is installed on your system, and that it's pointed to correctly at the top of the file wsddata.py. Once you implement the suggested features I gave you, you should do markedly better:
% python wsd.py ... training accuracy = 0.863248899123 testing accuracy = 0.7570766032330% of your grade on this part is based on being able to replicate that. The rest, as you probably expect by now, is to beat this baseline by coming up with, and implementing, useful features. For every half a percent that you get over my baseline, you'll get one percent toward the remaining 20%. In addition, the top team will get 5%, the second 4%, third 3%, fourth 2% and fifth 1%. If it turns out no one gets full credit, I'll adjust these thresholds in your favor (but never against your favor). As usual, you should develop on the dev data and then hand in your results on the test data. Implement your fancy features in the complex functions, and then run on Science.te (by changing the main definition) and upload the output stored in wsd_output. You can view your progress on the leaderboard. Extra credit: If you're feeling adventurous, you can also work on movie subtitles data instead of scientific articles (but don't get mad at me that they swear a lot in that data). You should download the Subtitles data separately and you can hand in (optionally) your output on subs for extra credit. This is purely optional. If you beat my baseline by at least 1%, you get 5% extra credit. First place team gets 5% more, second 3% and third 2%. Note that this data is much larger, and the documents are much longer (they are whole movies!) so you might run in to scaling problems.