CMSC 726 Project 2: Complex Classification

Table of Contents

Introduction

This project has three main components. The first is, in a sense, a continuation of P1 to build linear classifiers under various loss functions using (sub)gradient descent. The second part is about using external tools to solve complex classifiation problems (multiclass classification, ranking and collective classification). The purpose of this project is to help you understand the trade-offs between expressive features, model complexity (and regularization) and the learning model. You will work with three classification libraries, one for logistic regression (megam), one for decision trees (fastdt) and one for support vector machines (libsvm). These are all installed on the junkfood machines in ~hal/bin, or you can install them on your own computer if you'd like. You may/should download all the P2 files here.

Files you'll edit:
gd.py Where you will put your gradient descent implementation.
linear.py This is where your generic "regularized linear classifier" implementation will go.
Files you might want to look at:
binary.py Our generic interface for binary classifiers (actually works for regression and other types of classification, too).
cfdata.py Includes (in python format) all the collaborative filtering (course recommendation) data.
datasets.py Where a handful of test data sets are stored.
fileMaker.py Main helper code for generating files for input to megam, fastdt and libsvm.
mlGraphics.py A few useful plotting commands
pixelExtractor.py Basic pixel extraction code.
runClassifier.py A few wrappers for doing useful things with classifiers, like training them, generating learning curves, etc.
util.py A handful of useful utility functions: these will undoubtedly be helpful to you, so take a look!
wordExtractor.py Basic bag of words extraction code.
data/* All the datasets that we'll use.

What to submit: You will handin all of the python files listed above under "Files you'll edit" as well as a partners.txt file that lists the names and uids (first four digits) of all members in your team. Finally, you'll hand in a writeup.pdf file that answers all the written questions in this assignment (denoted by WU#: in this .html file).

Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder. However, the correctness of your implementation -- not the autograder's output -- will be the final judge of your score. If necessary, we will review and grade assignments individually to ensure that you receive due credit for your work.

Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. If you do, we will pursue the strongest consequences available to us.

Getting Help: You are not alone! If you find yourself stuck on something, contact the course staff for help. Office hours, class time, and the mailing list are there for your support; please use them. If you can't make our office hours, let us know and we will schedule more. We want these projects to be rewarding and instructional, not frustrating and demoralizing. But, we don't know when or how to help unless you ask. One more piece of advice: if you don't know what a variable is, print it out.

Gradient Descent and Linear Classification [20%]

To get started with linear models, we will implement a generic gradient descent methods. This should go in gd.py, which contains a single (short) function: gd This takes five parameters: the function we're optimizing, it's gradient, an initial position, a number of iterations to run, and an initial step size.

In each iteration of gradient descent, we will compute the gradient and take a step in that direction, with step size eta. We will have an adaptive step size, where eta is computed as stepSize divided by the square root of the iteration number (counting from one).

Once you have an implementation running, we can check it on a simple example of minimizing the function x^2:

>>> gd.gd(lambda x: x**2, lambda x: 2*x, 10, 10, 0.2)
(1.0034641051795872, array([ 100.        ,   36.        ,   18.5153247 ,   10.95094653,
          7.00860578,    4.72540613,    3.30810578,    2.38344246,
          1.75697198,    1.31968118,    1.00694021]))
You can see that the "solution" found is about 1, which is not great (it should be zero!), but it's better than the initial value of ten! If yours is going up rather than going down, you probably have a sign error somewhere!

We can let it run longer and plot the trajectory:

>>> x, trajectory = gd.gd(lambda x: x**2, lambda x: 2*x, 10, 100, 0.2)
>>> x
0.003645900464603937
>>> plot(trajectory)
It's now found a value close to zero and you can see that the objective is decreasing by looking at the plot.

WU1: Find a few values of step size where it converges and a few values where it diverges. Where does the threshold seem to be?

WU2: Come up with a non-convex univariate optimization problem. Plot the function you're trying to minimize and show two runs of gd, one where it gets caught in a local minimum and one where it manages to make it to a global minimum. (Use different starting points to accomplish this.)

If you implemented it well, this should work in multiple dimensions, too:

>>> x, trajectory = gd.gd(lambda x: linalg.norm(x)**2, lambda x: 2*x, array([10,5]), 100, 0.2)
>>> x
array([ 0.0036459 ,  0.00182295])
>>> plot(trajectory)
Our generic linear classifier implementation is in linear.py. The way this works is as follows. We have an interface LossFunction that we want to minimize. This must be able to compute the loss for a pair Y and Yhat where, the former is the truth and the latter are the predictions. It must also be able to compute a gradient when additionally given the data X. This should be all you need for these.

There are three loss function stubs: SquaredLoss (which is implemented for you!), LogisticLoss and HingeLoss (both of which you'll have to implement. My suggestion is to hold off implementing the other two until you have the linear classifier working

. The LinearClassifier class is a stub implemention of a generic linear classifier with an l2 regularizer. It is unbiased so all you have to take care of are the weights. Your implementation should go in train, which has a handful of stubs. The idea is to just pass appropriate functions to gd and have it do all the work. See the comments inline in the code for more information.

Once you've implemented the function evaluation and gradient, we can test this. We'll begin with a very simple 2D example data set so that we can plot the solutions. We'll also start with no regularizer to help you figure out where errors might be if you have them. (You'll have to import mlGraphics to make this work.)

>>> h = linear.LinearClassifier({'lossFunction': linear.SquaredLoss(), 'lambda': 0, 'numIter': 100, 'stepSize': 0.5})
>>> runClassifier.trainTestSet(h, datasets.TwoDAxisAligned)
Training accuracy 0.91, test accuracy 0.86
>>> h
w=array([ 2.73466371, -0.29563932])
>>> mlGraphics.plotLinearClassifier(h, datasets.TwoDAxisAligned.X, datasets.TwoDAxisAligned.Y)
Note that even though this data is clearly linearly separable, the unbiased classifier is unable to perfectly separate it.

If we change the regularizer, we'll get a slightly different solution:

>>> h = linear.LinearClassifier({'lossFunction': linear.SquaredLoss(), 'lambda': 10, 'numIter': 100, 'stepSize': 0.5})
>>> runClassifier.trainTestSet(h, datasets.TwoDAxisAligned)
Training accuracy 0.9, test accuracy 0.86
>>> h
w=array([ 1.30221546, -0.06764756])
As expected, the weights are smaller.

Now, we can try different loss functions. Implement logistic loss and hinge loss. Here are some simple test cases:

>>> h = linear.LinearClassifier({'lossFunction': linear.SquaredLoss(), 'lambda': 10, 'numIter': 100, 'stepSize': 0.5})
>>> runClassifier.trainTestSet(h, datasets.TwoDDiagonal)
Training accuracy 0.98, test accuracy 0.86
>>> h
w=array([ 0.33864367,  1.28110942])
THE FOLLOWING IS WRONG:
>>> h =  linear.LinearClassifier({'lossFunction': linear.HingeLoss(), 'lambda': 1, 'numIter': 100, 'stepSize': 0.5})
>>> runClassifier.trainTestSet(h, datasets.TwoDDiagonal)
Training accuracy 0.98, test accuracy 0.86
>>> h
w=array([ 0.84385774,  3.13132617])
THE CORRECT VERSION IS:
>>> h =  linear.LinearClassifier({'lossFunction': linear.HingeLoss(), 'lambda': 1, 'numIter': 100, 'stepSize': 0.5})
>>> runClassifier.trainTestSet(h, datasets.TwoDDiagonal)
Training accuracy 0.98, test accuracy 0.86
>>> h
w=array([ 1.17110065,  4.67288657])
END OF CORRECTION WU3: Why does the logistic classifier produce much larger weights than the others, even though they all get basically the same classification performance?

Warm Up with ML Tools [10%]

Our first task is to ensure that you are able to successfully run the three classifiers and to make sure you understand the appropriate file formats. We'll start with a text classification example. This data is drawn from the twenty newsgroups data set, but we'll only look at four newsgroups: comp.graphics, comp.windows.x, rec.sports.baseball and rec.sports.hockey. These are stored as train/test text files, where each line corresponds to a post and all new-lines have been replaced with tabs.

We've provided a simple feature extractor (wordExtractor.py) for the text that first lower-cases everything, removes all non alphabetic characters (except spaces) and then treats each word as a feature. To generate data for megam to distinguish between comp.graphics and comp.windows.x, run:

% python wordExtractor.py megam data/train.comp.graphics.txt data/train.comp.windows.x.txt > train.megam

we can do the same to generate test data:

% python wordExtractor.py megam data/test.comp.graphics.txt data/test.comp.windows.x.txt > test.megam

Here, the arguments are the desired file output type, the data for class -1 and the data for class +1.

We can now train our classifier:

% megam -fvals binary train.megam > model.megam

The -fvals argument tells it that each feature has a corresponding feature value (if not given, it assumes features are binary). The resulting weights are stored in model.megam. It should have run for one hundred iterations and achieved a training error of 0.00085. We can now make predictions on the test data:

% megam -fvals -predict model.megam binary test.megam > predictions.megam

You should get a test error rate of 17.7%.

If you inspect the weights file, you should find a bias of 0.329 and different weights for the different words. For instance, "graphics" should have a weight of about 1.09 and "windows" should have a weight of -0.079.

WU4: What are the five features with largest positive weight and what are the five features with largest negative weight? Do these seem "right" based on the task?

Next, we'll do the same for decision trees:

% python wordExtractor.py fastdt data/train.comp.graphics.txt data/train.comp.windows.x.txt > train.fastdt
% python wordExtractor.py fastdt data/test.comp.graphics.txt data/test.comp.windows.x.txt > test.fastdt

And train:

% FastDT -maxd 3 train.fastdt > model.fastdt
% FastDT -load model.fastdt test.fastdt > predictions.fastdt

Here, you should get a test error rate of 21.5%. If you inspect the model.fastdt file, you can see the tree printed in a format quite similar to ours from P1.

WU5: Draw the tree. How do the selected features compare to the features from the logistic regression model? Which features seem "better" and why? If you use a depth 10 tree, how well do you do on test data?

Finally, we'll do support vector machines. It's pretty much the same as before:

% python wordExtractor.py libsvm data/train.comp.graphics.txt data/train.comp.windows.x.txt > train.libsvm
% python wordExtractor.py libsvm data/test.comp.graphics.txt data/test.comp.windows.x.txt > test.libsvm

Now, we can train our svm.

% svm-train -t 0 train.libsvm model.libsvm
% svm-predict test.libsvm model.libsvm predictions.libsvm

We should be informed that we got an accuracy of about 78.5%.

Warning: One of the really annoying things about libsvm is that features have to be numeric, rather than strings. This means we maintain a dictionary (stored in libsvm-dictionary) that maps string features to numeric ids. This is automatically created and read whenever you generate libsvm data. However, when you switch between tasks, or change your feature representations or whatever, you'll probably want to delete this dictionary file. In general, if you follow the rubric of "delete file, then generate train data, then generate test data", you'll be safe. If you want to interpret the libsvm models, you'll need to look at the dictionary to figure out what the different features are.

All of the models we looked at in the warm up have different hyperparameters. For megam, the hyperparameter is the regularization coefficient, set by "-lambda ###" just like in P1. For FastDT, the hyperparameter is the depth of the tree, set by "-maxd". For libSVM, it is the value of "C", set by "-c ###".

WU6: Using comp.graphics versus comp.windows.x, plot training and test error curves for each of the algorithms. For megam, use lambda values of 2^x for x in -5, -4, ..., 4, 5. For FastDT, use depths 1 through 20. For libsvm, use C values of 2^x for x in -5, -4, ..., 4, 5. Before actually running these experiments, what do you expect to happen? What actually does happen?

Next, let's switch to some digits data. We have three digits: 1, 2 and 3, in the obviously-named files in the data directory. These just store pixel values. Use pixelExtractor.py to make training data, for example:

% python pixelExtractor.py megam data/train.digit1 data/train.digit2 > train.megam
% python pixelExtractor.py megam data/test.digit1 data/test.digit2 > test.megam
% megam -fvals binary train.megam > model.megam
% megam -fvals -predict model.megam binary test.megam  > predictions.megam

You should get a 7% error rate.

WU7: Comparing the performance of the three different algorithms on the two tasks (text categorization versus digit recognition), which one(s) perform best on one and which on the other? Why?

Reductions for Multiclass Classification [30%]

In this section, you will explore the differences between three multiclass-to-binary reductions: one-versus-all (OVA), all-versus-all (AVA) and a tree-based reduction (TREE). You may implement these in whatever language you want, but it should reduce to either one of the binary classifiers used above (libSVM, megam or FastDT) or the Python BinaryClassifier class. If you do the latter, obviously you should do it in Python :).

For comparison, if you reduce to libSVM, the default multiclass implementation in libSVM (if you give it multiclass data) is AVA, so you can compare to that.

WU8: For each of the three reductions, run your classifier on the text classification problem with four classes. For the tree reduction, make the first split {graphics,windows} versus {baseball,hockey}. Tune your hyperparameters as well as you can and report the best results you can for each of the three. Which one wins? Which one was easiest to tune hyperparameters for?

WU9: Change the structure of the tree classifier so that the first split is {graphics,baseball} versus {windows,hockey}. (Thus, the hard decision is first, and the easy decisions come second.) Return hyperparameters well. Does this work better or worse than the previous split, and why?

Hand in your code in multiclass.tgz.

Ranking or Collective Classification [40%]

In this part, you may choose to either solve a ranking problem (using the ranking algorithm from the book) or a collective classification problem (again, using the algorithm from the book). However, it is up to you to find a data set on which to run this algorithm. So either pick something you've used before that fits into one of these two problems, or pick something from one of the data set repositories listed below, or talk to Hal if you need some hints (talk to him soon!). If you choose the ranking task, please answer WU10a; if you choose the collective classification task, please answer WU10b. Submit all your code in complex.tgz.

No matter which one you do, you may reduce to any system you want of the ones used in this project (linear.py, libSVM, megam, FastDT). Some of these have features that might be useful for the different tasks. For instance, libSVM and megam support multiclass classification internally, so for collective classification, if the labels on each node in the graph are multiclass, then you can just reduce to multiclass rather than all the way down to binary (which you may find easier). Additionally, megam supports different weights (costs) on examples (search for "$$$WEIGHT" in the documentation), which is useful for ranking. So choose wisely or you'll end up with lots of extra work!

Some dataset repositories that may or may not have appropriate data:

For both tasks, you will probably have to create some of your own features. You should do enough of this that you can get reasonable performance, but don't kill yourself trying to get the best performance imaginable.

WU10a: You've chosen ranking! First, implement the naive ranking algorithm (Algs 16 and 17) from the book. Then, implement the more complex ranking algorithm (Algs 18 and 19) from the book. Compare their performance. (Note that if your ranking problem isn't bipartite, you'll have to force it to be bipartite to make the naive algorithm work: just do something that you think is reasonable to do this.) How have you defined the cost function (omega) in the complex model? In all cases, measure your performance according to whatever metric you like the best, but it should not be zero/one loss: it should be something more appropriate for ranking (F-measure, area under the curve, etc.). Report on your experience.

WU10b: You've chosen collective classification! Implement the stacking algorithm (Algs 20 and 21) from the book. Apply this to your problem, and plot the accuracy of your classifier as a function of the number of levels in the stack. Do you observe that stacking helps? I.e., does some layer >1 perform better than layer 1? If not, perhaps you're not using sufficiently helpful features between the layers. Does the stack ever overfit? Plot your training error versus your test error as a function of the number of layers, and if you observe massive overfitting, you might need to do cross-validation to attenuate this. Report on your experience.

For both of these, I expect about a 1-2 page writeup, including appropriate figures.