The code for this project consists of several Python files, some of which you will need to read and understand in order to complete the assignment, and some of which you can ignore. You can download all the code and supporting files (including this description) as a tar archive.
Files you'll edit: | |
generate.py |
The place where you'll put your code for part I (generation). |
analyze.py |
The place where you'll put your code for part II (analysis). |
bengali.py |
The place where you'll put your code for part III (segmenting an unknown language). |
Files you might want to look at: | |
FSM.py |
Code for interfacing python with carmel. |
util.py |
A handful of useful utility functions: these will undoubtedly be helpful to you, so take a look! |
Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder. However, the correctness of your implementation -- not the autograder's output -- will be the final judge of your score. If necessary, we will review and grade assignments individually to ensure that you receive due credit for your work.
Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. If you do, we will pursue the strongest consequences available to us.
Getting Help: You are not alone! If you find yourself stuck on something, contact the course staff for help. Office hours, class time, and Piazza are there for your support; please use them. If you can't make our office hours, let us know and we will schedule more. We want these projects to be rewarding and instructional, not frustrating and demoralizing. But, we don't know when or how to help unless you ask. One more piece of advice: if you don't know what a variable is, print it out.
>>> runTest() panic -> panic panic+ed -> paniced panic+ing -> panicing panic+s -> panics picnic -> picnic picnic+ed -> picniced picnic+ing -> picnicing picnic+s -> picnics ... frolic -> frolic frolic+ed -> froliced frolic+ing -> frolicing frolic+s -> frolicsIf you need help with regular expression syntax in Python, see here. Note that you will be graded not just on the words in this list, but perhaps other words. (We will not test you on other suffixes though.)
>>> fst = FSM(isTransducer=True) >>> fst.addEdge('start', 'start', '.', '.') >>> fst.addEdge('start', 'start', None, 'a') >>> fst.addEdge('start', 'start', None, 'e') >>> fst.addEdge('start', 'start', None, 'i') >>> fst.addEdge('start', 'start', None, 'o') >>> fst.addEdge('start', 'start', None, 'u') >>> fst.setInitialState('start') >>> fst.setFinalState('start') >>> runFST([fst], ['dragon'], maxNumPaths=10) executing: ~hal/bin/carmel -rIQEb -k 10 .tmp.fst.0 .tmp.fst.strings > .tmp.output Input line 1: "d" "r" "a" "g" "o" "n" (7 states / 8 arcs) Derivations found for all 1 inputs Viterbi (best path) product of probs=1, probability=2^0 per-input-symbol-perplexity(N=6)=2^-0 per-line-perplexity(N=1)=2^-0 [['drgn', 'drgon', 'dragn', 'dragon']]We first create a FSM telling it that it's a transducer. We then add a bunch of edges. The first edge we add is from a state named "start" to itself. The "'.', '.'" is syntactic sugar saying "accept any character (a-z) and emit the same character." There are then edges for each vowel, where we read nothing (None means "epsilon") and produce a vowel. Finally, we set the initial and final states and the run it on the string "dragon", returning at most 10 paths. In this case, the paths are exactly what we want. You can ignore the rest of the carmel output :). One thing to keep in mind is the order of composition: the strings will be applied on the RIGHT of the FST, which is why we map epsilon TO vowel, rather than the other way around. We can also make an acceptor that will only accept two strings: "dragn" and "drgon" (just for fun). We could do this by hand, but that would be tedious, so there's a bit of helper code:
>>> fsa = FSM() >>> fsa.addEdgeSequence('start', 'end', 'dragn') >>> fsa.addEdgeSequence('start', 'end', 'drgon') >>> fsa.setInitialState('start') >>> fsa.setFinalState('end') >>> runFST([fsa], ['drgn', 'dragn', 'drgon', 'dragon'], maxNumPaths=10) executing: ~hal/bin/carmel -rIQEb -k 10 .tmp.fst.0 .tmp.fst.strings > .tmp.output Input line 1: "d" "r" "g" "n" (0 states / 0 arcs) Empty or invalid result of composition with transducer ".tmp.fst.0". Input line 2: "d" "r" "a" "g" "n" (8 states / 7 arcs reduce-> 6/5) Input line 3: "d" "r" "g" "o" "n" (8 states / 7 arcs reduce-> 6/5) Input line 4: "d" "r" "a" "g" "o" "n" (0 states / 0 arcs) Empty or invalid result of composition with transducer ".tmp.fst.0". No derivations found for 2 of 4 inputs Viterbi (best path) product of probs=1, probability=2^0 per-input-symbol-perplexity(N=20)=2^-0 per-line-perplexity(N=2)=2^-0, excluding 2 0 probabilities (i.e. real ppx is infinite). [[], ['dragn'], ['drgon'], []]The final value shows that it only accepted the middle two. We can run these together (composition) as:
>>> runFST([fsa, fst], ['dragon'], maxNumPaths=10) executing: ~hal/bin/carmel -rIQEb -k 10 .tmp.fst.0 .tmp.fst.1 .tmp.fst.strings > .tmp.output Input line 1: "d" "r" "a" "g" "o" "n" (7 states / 8 arcs) (14 states / 14 arcs reduce-> 12/12) Derivations found for all 1 inputs Viterbi (best path) product of probs=1, probability=2^0 per-input-symbol-perplexity(N=6)=2^-0 per-line-perplexity(N=1)=2^-0 [['drgon', 'dragn']]Voila! If you want to see what the FSMs look like, you can open the .tmp.fst.# files. Now, on to the real fun. You job is to implement, in analyze.py, finite state maches that mimic what you did using regular expressions. Most of this is written for you. The two things you have to do are correctly build the source model and, just like before, implement is the "ck" case in the channel model. Once you've done this, you should be able to run the simpleTest function (I have removed Carmel output for clarity):
>>> simpleTest() ==== Trying source model on strings 'ace+ed' ==== ==== Result: [['ace+ed']] ==== ==== Trying source model on strings 'panic+ing' ==== ==== Result: [['panic+ing']] ==== ==== Generating random paths for 'aced', using only channel model ==== ==== Result: [['a+ced', 'ace+ed', 'ac+ed', 'a+ced', 'ace+ed', 'aced', 'ac+ed', 'a+ced', 'ace+d', 'a+ced']] ==== ==== Disambiguating a few phrases: aced, panicked, paniced, sprucing ==== ==== Result: [['ace+ed'], ['panic+ed'], ['panic+ed'], ['spruce+ing']] ====Again, you'll be tested on more words, but not more suffixes.
>>> out = runTest(channel=buildSegmentChannelModel) before training, P/R/F = (0.5191740412979351, 0.8321513002364066, 0.63941871026339681) ... (16 iterations) ... after training, P/R/F = (1.0, 0.0, 0.0) >>> out = runTest(source=bigramSourceModel) before training, P/R/F = (0.89473684210526316, 0.040189125295508277, 0.076923076923076927) ... (1 iteration) ... after training, P/R/F = (0.76000000000000001, 0.40425531914893614, 0.52777777777777779) >>> out = runTest(channel=buildSegmentChannelModel, source=bigramSourceModel) before training, P/R/F = (0.63418803418803416, 0.87706855791962179, 0.73611111111111116) ... (16 iterations) ... after training, P/R/F = (0.76000000000000001, 0.40425531914893614, 0.52777777777777779)What this suggests is that the segment channel model isn't all that it's cracked up to be. This part of the assignment is worth 50% of your grade. 40% is for correctly implementing these two functions. The last 10% is for how well you can do on the test data. You are welcome to do whatever you want to build a finite state machine for solving this problem. You should implement these in fancySourceModel and fancyChannelModel. You can run then by saying:
>>> output = runTest(source=fancySourceModel,channel=fancyChannelModel)Once you're happy, you can try running on the test data, and saving your output to disk.
>>> output = runTest(devFile="bengali.test",source=fancySourceModel,channel=fancyChannelModel) >>> saveOutput('bengali.test.predictions', output)This test predictions file is what you should hand in. Every night at midnight we'll evaluate your test predictions and post your score on the leaderboard, along with a baseline submission using the segment channel model and bigram source model. For every point of F that you beat this baseline by, you'll get 1% added to your grade. Additionally, the first place team will get another 5%, the second place team will get another 4%, the third place team will get another 3% and anyone who beats the baseline by at least 0.01 will get an almost free 2%.