Files you'll edit: | |
gd.py |
Where you will put your gradient descent implementation. |
linear.py |
This is where your generic "regularized linear classifier" implementation will go. |
Files you might want to look at: | |
binary.py |
Our generic interface for binary classifiers (actually works for regression and other types of classification, too). |
cfdata.py |
Includes (in python format) all the collaborative filtering (course recommendation) data. |
datasets.py |
Where a handful of test data sets are stored. |
fileMaker.py |
Main helper code for generating files for input to megam, fastdt and libsvm. |
mlGraphics.py |
A few useful plotting commands |
pixelExtractor.py |
Basic pixel extraction code. |
runClassifier.py |
A few wrappers for doing useful things with classifiers, like training them, generating learning curves, etc. |
util.py |
A handful of useful utility functions: these will undoubtedly be helpful to you, so take a look! |
wordExtractor.py |
Basic bag of words extraction code. |
data/* |
All the datasets that we'll use. |
What to submit: You will handin all of the python files listed above under "Files you'll edit" as well as a partners.txt file that lists the names and uids (first four digits) of all members in your team. Finally, you'll hand in a writeup.pdf file that answers all the written questions in this assignment (denoted by WU#: in this .html file).
Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder. However, the correctness of your implementation -- not the autograder's output -- will be the final judge of your score. If necessary, we will review and grade assignments individually to ensure that you receive due credit for your work.
Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. If you do, we will pursue the strongest consequences available to us.
Getting Help: You are not alone! If you find yourself stuck on something, contact the course staff for help. Office hours, class time, and the mailing list are there for your support; please use them. If you can't make our office hours, let us know and we will schedule more. We want these projects to be rewarding and instructional, not frustrating and demoralizing. But, we don't know when or how to help unless you ask. One more piece of advice: if you don't know what a variable is, print it out.
>>> gd.gd(lambda x: x**2, lambda x: 2*x, 10, 10, 0.2) (1.0034641051795872, array([ 100. , 36. , 18.5153247 , 10.95094653, 7.00860578, 4.72540613, 3.30810578, 2.38344246, 1.75697198, 1.31968118, 1.00694021]))You can see that the "solution" found is about 1, which is not great (it should be zero!), but it's better than the initial value of ten! If yours is going up rather than going down, you probably have a sign error somewhere! We can let it run longer and plot the trajectory:
>>> x, trajectory = gd.gd(lambda x: x**2, lambda x: 2*x, 10, 100, 0.2) >>> x 0.003645900464603937 >>> plot(trajectory)It's now found a value close to zero and you can see that the objective is decreasing by looking at the plot. WU1: Find a few values of step size where it converges and a few values where it diverges. Where does the threshold seem to be? WU2: Come up with a non-convex univariate optimization problem. Plot the function you're trying to minimize and show two runs of gd, one where it gets caught in a local minimum and one where it manages to make it to a global minimum. (Use different starting points to accomplish this.) If you implemented it well, this should work in multiple dimensions, too:
>>> x, trajectory = gd.gd(lambda x: linalg.norm(x)**2, lambda x: 2*x, array([10,5]), 100, 0.2) >>> x array([ 0.0036459 , 0.00182295]) >>> plot(trajectory)Our generic linear classifier implementation is in linear.py. The way this works is as follows. We have an interface LossFunction that we want to minimize. This must be able to compute the loss for a pair Y and Yhat where, the former is the truth and the latter are the predictions. It must also be able to compute a gradient when additionally given the data X. This should be all you need for these. There are three loss function stubs: SquaredLoss (which is implemented for you!), LogisticLoss and HingeLoss (both of which you'll have to implement. My suggestion is to hold off implementing the other two until you have the linear classifier working. The LinearClassifier class is a stub implemention of a generic linear classifier with an l2 regularizer. It is unbiased so all you have to take care of are the weights. Your implementation should go in train, which has a handful of stubs. The idea is to just pass appropriate functions to gd and have it do all the work. See the comments inline in the code for more information. Once you've implemented the function evaluation and gradient, we can test this. We'll begin with a very simple 2D example data set so that we can plot the solutions. We'll also start with no regularizer to help you figure out where errors might be if you have them. (You'll have to import mlGraphics to make this work.)
>>> h = linear.LinearClassifier({'lossFunction': linear.SquaredLoss(), 'lambda': 0, 'numIter': 100, 'stepSize': 0.5}) >>> runClassifier.trainTestSet(h, datasets.TwoDAxisAligned) Training accuracy 0.91, test accuracy 0.86 >>> h w=array([ 2.73466371, -0.29563932]) >>> mlGraphics.plotLinearClassifier(h, datasets.TwoDAxisAligned.X, datasets.TwoDAxisAligned.Y)Note that even though this data is clearly linearly separable, the unbiased classifier is unable to perfectly separate it. If we change the regularizer, we'll get a slightly different solution:
>>> h = linear.LinearClassifier({'lossFunction': linear.SquaredLoss(), 'lambda': 10, 'numIter': 100, 'stepSize': 0.5}) >>> runClassifier.trainTestSet(h, datasets.TwoDAxisAligned) Training accuracy 0.9, test accuracy 0.86 >>> h w=array([ 1.30221546, -0.06764756])As expected, the weights are smaller. Now, we can try different loss functions. Implement logistic loss and hinge loss. Here are some simple test cases:
>>> h = linear.LinearClassifier({'lossFunction': linear.SquaredLoss(), 'lambda': 10, 'numIter': 100, 'stepSize': 0.5}) >>> runClassifier.trainTestSet(h, datasets.TwoDDiagonal) Training accuracy 0.98, test accuracy 0.86 >>> h w=array([ 0.33864367, 1.28110942])THE FOLLOWING IS WRONG:
>>> h = linear.LinearClassifier({'lossFunction': linear.HingeLoss(), 'lambda': 1, 'numIter': 100, 'stepSize': 0.5}) >>> runClassifier.trainTestSet(h, datasets.TwoDDiagonal) Training accuracy 0.98, test accuracy 0.86 >>> h w=array([ 0.84385774, 3.13132617])THE CORRECT VERSION IS:
>>> h = linear.LinearClassifier({'lossFunction': linear.HingeLoss(), 'lambda': 1, 'numIter': 100, 'stepSize': 0.5}) >>> runClassifier.trainTestSet(h, datasets.TwoDDiagonal) Training accuracy 0.98, test accuracy 0.86 >>> h w=array([ 1.17110065, 4.67288657])END OF CORRECTION WU3: Why does the logistic classifier produce much larger weights than the others, even though they all get basically the same classification performance?
% python wordExtractor.py megam data/train.comp.graphics.txt data/train.comp.windows.x.txt > train.megamwe can do the same to generate test data:
% python wordExtractor.py megam data/test.comp.graphics.txt data/test.comp.windows.x.txt > test.megamHere, the arguments are the desired file output type, the data for class -1 and the data for class +1. We can now train our classifier:
% megam -fvals binary train.megam > model.megamThe -fvals argument tells it that each feature has a corresponding feature value (if not given, it assumes features are binary). The resulting weights are stored in model.megam. It should have run for one hundred iterations and achieved a training error of 0.00085. We can now make predictions on the test data:
% megam -fvals -predict model.megam binary test.megam > predictions.megamYou should get a test error rate of 17.7%. If you inspect the weights file, you should find a bias of 0.329 and different weights for the different words. For instance, "graphics" should have a weight of about 1.09 and "windows" should have a weight of -0.079. WU4: What are the five features with largest positive weight and what are the five features with largest negative weight? Do these seem "right" based on the task? Next, we'll do the same for decision trees:
% python wordExtractor.py fastdt data/train.comp.graphics.txt data/train.comp.windows.x.txt > train.fastdt % python wordExtractor.py fastdt data/test.comp.graphics.txt data/test.comp.windows.x.txt > test.fastdtAnd train:
% FastDT -maxd 3 train.fastdt > model.fastdt % FastDT -load model.fastdt test.fastdt > predictions.fastdtHere, you should get a test error rate of 21.5%. If you inspect the model.fastdt file, you can see the tree printed in a format quite similar to ours from P1. WU5: Draw the tree. How do the selected features compare to the features from the logistic regression model? Which features seem "better" and why? If you use a depth 10 tree, how well do you do on test data? Finally, we'll do support vector machines. It's pretty much the same as before:
% python wordExtractor.py libsvm data/train.comp.graphics.txt data/train.comp.windows.x.txt > train.libsvm % python wordExtractor.py libsvm data/test.comp.graphics.txt data/test.comp.windows.x.txt > test.libsvmNow, we can train our svm.
% svm-train -t 0 train.libsvm model.libsvm % svm-predict test.libsvm model.libsvm predictions.libsvmWe should be informed that we got an accuracy of about 78.5%. Warning: One of the really annoying things about libsvm is that features have to be numeric, rather than strings. This means we maintain a dictionary (stored in libsvm-dictionary) that maps string features to numeric ids. This is automatically created and read whenever you generate libsvm data. However, when you switch between tasks, or change your feature representations or whatever, you'll probably want to delete this dictionary file. In general, if you follow the rubric of "delete file, then generate train data, then generate test data", you'll be safe. If you want to interpret the libsvm models, you'll need to look at the dictionary to figure out what the different features are. All of the models we looked at in the warm up have different hyperparameters. For megam, the hyperparameter is the regularization coefficient, set by "-lambda ###" just like in P1. For FastDT, the hyperparameter is the depth of the tree, set by "-maxd". For libSVM, it is the value of "C", set by "-c ###". WU6: Using comp.graphics versus comp.windows.x, plot training and test error curves for each of the algorithms. For megam, use lambda values of 2^x for x in -5, -4, ..., 4, 5. For FastDT, use depths 1 through 20. For libsvm, use C values of 2^x for x in -5, -4, ..., 4, 5. Before actually running these experiments, what do you expect to happen? What actually does happen? Next, let's switch to some digits data. We have three digits: 1, 2 and 3, in the obviously-named files in the data directory. These just store pixel values. Use pixelExtractor.py to make training data, for example:
% python pixelExtractor.py megam data/train.digit1 data/train.digit2 > train.megam % python pixelExtractor.py megam data/test.digit1 data/test.digit2 > test.megam % megam -fvals binary train.megam > model.megam % megam -fvals -predict model.megam binary test.megam > predictions.megamYou should get a 7% error rate. WU7: Comparing the performance of the three different algorithms on the two tasks (text categorization versus digit recognition), which one(s) perform best on one and which on the other? Why?