CS 422 Project 2: Multiclass classification and Linear models

Introduction
Judging Classifier Goodness [20%]
Reductions for Multiclass Classification [40%]
Gradient Descent and Linear Classification [40%]
Extra Credit: Collective Classification [EC: 20%]

Introduction

This project has two main components. The first part is about using reductions to solve complex classifiation problems (multiclass classification). The second is, in a sense, a continuation of P1 to build linear classifiers under various loss functions using (sub)gradient descent. You may/should download all the P2 files here.

Files you'll edit:
`gd.py`	Where you will put your gradient descent implementation.
`linear.py`	This is where your generic "regularized linear classifier" implementation will go.
`multiclass.py`	This is where your generic "regularized linear classifier" implementation will go.
Files you might want to look at:
`binary.py`	Our generic interface for binary classifiers (actually works for regression and other types of classification, too).
`datasets.py`	Where a handful of test data sets are stored.
`mlGraphics.py`	A few useful plotting commands
`runClassifier.py`	A few wrappers for doing useful things with classifiers, like training them, generating learning curves, etc.
`util.py`	A handful of useful utility functions: these will undoubtedly be helpful to you, so take a look!
`data/*`	All the datasets that we'll use.

What to submit: You will handin all of the python files listed above under "Files you'll edit" as well as a partners.txt file that lists the names and uids (first four digits) of all members in your team. Finally, you'll hand in a writeup.pdf file that answers all the written questions in this assignment (denoted by WU#: in this .html file).

Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder. However, the correctness of your implementation -- not the autograder's output -- will be the final judge of your score. If necessary, we will review and grade assignments individually to ensure that you receive due credit for your work.

Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. If you do, we will pursue the strongest consequences available to us.

Getting Help: You are not alone! If you find yourself stuck on something, contact the course staff for help. Office hours, class time, and the mailing list are there for your support; please use them. If you can't make our office hours, let us know and we will schedule more. We want these projects to be rewarding and instructional, not frustrating and demoralizing. But, we don't know when or how to help unless you ask. One more piece of advice: if you don't know what a variable is, print it out.

Judging Classifier Goodness [20%]

This is all about LAB4. WU1 (10%): hand in your answers to C, D and E from LAB4. Please include plots for the questions that ask about plots.

WU2 (10%): On the sentiment data, use FastDT to train a decision tree of all possible depths from 1 to 20. Use the development data to choose an optimal depth, call it d*. What development error do you get for d*? Which other depths are not statistically significantly worse than d*? Use the ttest with a 95% significance level to answer this question. Please write a couple sentences describing what you did to evaluate this, as well as what your answer is.

Reductions for Multiclass Classification [40%]

In this section, you will explore the differences between three multiclass-to-binary reductions: one-versus-all (OVA), all-versus-all (AVA) and a tree-based reduction (TREE). This is largely based on LAB5, and the evaluation will be on the wine data (note that there's actually a famous dataset called "wine"... the one we're using has nothing to do with that!).

First, you must implement AVA and the tree based reduction (the multiclass.py file that comes with this project is identical to the one from the lab, except the existence of the extra class for trees). See the lab for test cases of this. Second, you must implement a tree-based reduction. Most of train is given to you, but predict you must do all on your own. I've provided a tree class to help you:

>>> t = makeBalancedTree(range(6))
>>> t
[[0 [1 2]] [3 [4 5]]]
>>> t.isLeaf
False
>>> t.getLeft()
[0 [1 2]]
>>> t.getLeft().getLeft()
0
>>> t.getLeft().getLeft().isLeaf
True

WU3 (10%): From LAB5, Answer A, B, C, D.

WU4 (10%): Using decision trees of constant depth for each classifier (but you choose it as well as you can!), train AVA, OVA and Tree (using balanced trees) for the wine data. Which does best?

WEC (5%): Build a better tree (any way you want) other than the balanced binary tree. Fill in your code for this in getMyTreeForWine, which defaults to a balanced tree.

Gradient Descent and Linear Classification [40%]

To get started with linear models, we will implement a generic gradient descent methods. This should go in gd.py, which contains a single (short) function: gd This takes five parameters: the function we're optimizing, it's gradient, an initial position, a number of iterations to run, and an initial step size.

In each iteration of gradient descent, we will compute the gradient and take a step in that direction, with step size eta. We will have an adaptive step size, where eta is computed as stepSize divided by the square root of the iteration number (counting from one).

Once you have an implementation running, we can check it on a simple example of minimizing the function x^2:

>>> gd.gd(lambda x: x**2, lambda x: 2*x, 10, 10, 0.2)
(1.0034641051795872, array([ 100.        ,   36.        ,   18.5153247 ,   10.95094653,
          7.00860578,    4.72540613,    3.30810578,    2.38344246,
          1.75697198,    1.31968118,    1.00694021]))

You can see that the "solution" found is about 1, which is not great (it should be zero!), but it's better than the initial value of ten! If yours is going up rather than going down, you probably have a sign error somewhere!

We can let it run longer and plot the trajectory:

>>> x, trajectory = gd.gd(lambda x: x**2, lambda x: 2*x, 10, 100, 0.2)
>>> x
0.003645900464603937
>>> plot(trajectory)

It's now found a value close to zero and you can see that the objective is decreasing by looking at the plot.

WU5 (5%): Find a few values of step size where it converges and a few values where it diverges. Where does the threshold seem to be?

WU6 (5%): Come up with a non-convex univariate optimization problem. Plot the function you're trying to minimize and show two runs of gd, one where it gets caught in a local minimum and one where it manages to make it to a global minimum. (Use different starting points to accomplish this.)

If you implemented it well, this should work in multiple dimensions, too:

>>> x, trajectory = gd.gd(lambda x: linalg.norm(x)**2, lambda x: 2*x, array([10,5]), 100, 0.2)
>>> x
array([ 0.0036459 ,  0.00182295])
>>> plot(trajectory)

Our generic linear classifier implementation is in linear.py. The way this works is as follows. We have an interface LossFunction that we want to minimize. This must be able to compute the loss for a pair Y and Yhat where, the former is the truth and the latter are the predictions. It must also be able to compute a gradient when additionally given the data X. This should be all you need for these.

There are three loss function stubs: SquaredLoss (which is implemented for you!), LogisticLoss and HingeLoss (both of which you'll have to implement. My suggestion is to hold off implementing the other two until you have the linear classifier working

. The LinearClassifier class is a stub implemention of a generic linear classifier with an l2 regularizer. It is unbiased so all you have to take care of are the weights. Your implementation should go in train, which has a handful of stubs. The idea is to just pass appropriate functions to gd and have it do all the work. See the comments inline in the code for more information.

Once you've implemented the function evaluation and gradient, we can test this. We'll begin with a very simple 2D example data set so that we can plot the solutions. We'll also start with no regularizer to help you figure out where errors might be if you have them. (You'll have to import mlGraphics to make this work.)

>>> h = linear.LinearClassifier({'lossFunction': linear.SquaredLoss(), 'lambda': 0, 'numIter': 100, 'stepSize': 0.5})
>>> runClassifier.trainTestSet(h, datasets.TwoDAxisAligned)
Training accuracy 0.91, test accuracy 0.86
>>> h
w=array([ 2.73466371, -0.29563932])
>>> mlGraphics.plotLinearClassifier(h, datasets.TwoDAxisAligned.X, datasets.TwoDAxisAligned.Y)

Note that even though this data is clearly linearly separable, the unbiased classifier is unable to perfectly separate it.

If we change the regularizer, we'll get a slightly different solution:

>>> h = linear.LinearClassifier({'lossFunction': linear.SquaredLoss(), 'lambda': 10, 'numIter': 100, 'stepSize': 0.5})
>>> runClassifier.trainTestSet(h, datasets.TwoDAxisAligned)
Training accuracy 0.9, test accuracy 0.86
>>> h
w=array([ 1.30221546, -0.06764756])

As expected, the weights are smaller.

Now, we can try different loss functions. Implement logistic loss and hinge loss. Here are some simple test cases:

>>> h = linear.LinearClassifier({'lossFunction': linear.SquaredLoss(), 'lambda': 10, 'numIter': 100, 'stepSize': 0.5})
>>> runClassifier.trainTestSet(h, datasets.TwoDDiagonal)
Training accuracy 0.98, test accuracy 0.86
>>> h
w=array([ 0.33864367,  1.28110942])

>>> h =  linear.LinearClassifier({'lossFunction': linear.HingeLoss(), 'lambda': 1, 'numIter': 100, 'stepSize': 0.5})
>>> runClassifier.trainTestSet(h, datasets.TwoDDiagonal)
Training accuracy 0.98, test accuracy 0.86
>>> h
w=array([ 1.17110065,  4.67288657])

WU7 (5%): Why does the hinge classifier produce much larger weights than squared loss, even though they all get the same classification performance?

WU8 (5%): For each of the loss functions, train a model on the binary version of the wine data (called WineDataBinary) and evaluate it on the test data. You should use lambda=1 in all cases. Which works best? For that best model, look at the learned weights. Find the words corresponding to the weights with the greatest positive value and those with the greatest negative value (this is like LAB3). Hint: look at WineDataBinary.words to get the id-to-word mapping. List the top 5 positive and top 5 negative and explain.

Extra Credit: Collective Classification [EC: 20%]

TBA soon!