- Introduction
- Gradient Descent and Linear Classification
*[20%]* - Warm Up with ML Tools
*[10%]* - Reductions for Multiclass Classification
*[30%]* - Ranking or Collective Classification
*[40%]*

Files you'll edit: | |

`gd.py` |
Where you will put your gradient descent implementation. |

`linear.py` |
This is where your generic "regularized linear classifier" implementation will go. |

Files you might want to look at: | |

`binary.py` |
Our generic interface for binary classifiers (actually works for regression and other types of classification, too). |

`cfdata.py` |
Includes (in python format) all the collaborative filtering (course recommendation) data. |

`datasets.py` |
Where a handful of test data sets are stored. |

`fileMaker.py` |
Main helper code for generating files for input to megam, fastdt and libsvm. |

`mlGraphics.py` |
A few useful plotting commands |

`pixelExtractor.py` |
Basic pixel extraction code. |

`runClassifier.py` |
A few wrappers for doing useful things with classifiers, like training them, generating learning curves, etc. |

`util.py` |
A handful of useful utility functions: these will undoubtedly be helpful to you, so take a look! |

`wordExtractor.py` |
Basic bag of words extraction code. |

`data/*` |
All the datasets that we'll use. |

**What to submit:** You
will handin
all of the python files listed above under "Files you'll edit" as
well as a `partners.txt` file that lists the **names**
and **uids** (first four digits) of all members in your team.
Finally, you'll hand in a `writeup.pdf` file that answers all
the written questions in this assignment (denoted by **WU#:** in
this `.html` file).

**Evaluation:** Your code will be autograded for
technical correctness. Please *do not* change the names of any
provided functions or classes within the code, or you will wreak havoc
on the autograder. However, the correctness of your implementation --
not the autograder's output -- will be the final judge of your score.
If necessary, we will review and grade assignments individually to
ensure that you receive due credit for your work.

**Academic Dishonesty:** We will be checking your code
against other submissions in the class for logical redundancy. If you
copy someone else's code and submit it with minor changes, we will
know. These cheat detectors are quite hard to fool, so please don't
try. We trust you all to submit your own work only; *please*
don't let us down. If you do, we will pursue the strongest
consequences available to us.

**Getting Help:** You are not alone! If you find
yourself stuck on something, contact the course staff for help.
Office hours, class time, and the mailing list are there for your
support; please use them. If you can't make our office hours, let us
know and we will schedule more. We want these projects to be
rewarding and instructional, not frustrating and demoralizing. But,
we don't know when or how to help unless you ask. One more piece of
advice: if you don't know what a variable is, print it out.

In each iteration of gradient descent, we will compute the gradient
and take a step in that direction, with step size `eta`. We
will have an *adaptive* step size, where `eta` is computed
as `stepSize` divided by the square root of the iteration
number (counting from one).

Once you have an implementation running, we can check it on a simple
example of minimizing the function `x^2`:

>>> gd.gd(lambda x: x**2, lambda x: 2*x, 10, 10, 0.2) (1.0034641051795872, array([ 100. , 36. , 18.5153247 , 10.95094653, 7.00860578, 4.72540613, 3.30810578, 2.38344246, 1.75697198, 1.31968118, 1.00694021]))You can see that the "solution" found is about 1, which is not great (it should be zero!), but it's better than the initial value of ten! If yours is going up rather than going down, you probably have a sign error somewhere!

We can let it run longer and plot the trajectory:

>>> x, trajectory = gd.gd(lambda x: x**2, lambda x: 2*x, 10, 100, 0.2) >>> x 0.003645900464603937 >>> plot(trajectory)It's now found a value close to zero and you can see that the objective is decreasing by looking at the plot.

**WU1:** Find a few values of step size where it converges and a
few values where it diverges. Where does the threshold seem to
be?

**WU2:** Come up with a *non-convex* univariate optimization
problem. Plot the function you're trying to minimize and show two
runs of `gd`, one where it gets caught in a local minimum and
one where it manages to make it to a global minimum. (Use different
starting points to accomplish this.)

If you implemented it well, this should work in multiple dimensions, too:

>>> x, trajectory = gd.gd(lambda x: linalg.norm(x)**2, lambda x: 2*x, array([10,5]), 100, 0.2) >>> x array([ 0.0036459 , 0.00182295]) >>> plot(trajectory)Our generic linear classifier implementation is in

There are three loss function stubs: `SquaredLoss` (which is
implemented for you!), `LogisticLoss` and `HingeLoss`
(both of which you'll have to implement. My suggestion is to hold off
implementing the other two until you have the linear classifier
working

.
The `LinearClassifier` class is a stub implemention of a
generic linear classifier with an l2 regularizer. It
is *unbiased* so all you have to take care of are the weights.
Your implementation should go in `train`, which has a handful
of stubs. The idea is to just pass appropriate functions
to `gd` and have it do all the work. See the comments inline
in the code for more information.

Once you've implemented the function evaluation and gradient, we can
test this. We'll begin with a very simple 2D example data set so that
we can plot the solutions. We'll also start with *no
regularizer* to help you figure out where errors might be if you
have them. (You'll have to import `mlGraphics` to make this
work.)

>>> h = linear.LinearClassifier({'lossFunction': linear.SquaredLoss(), 'lambda': 0, 'numIter': 100, 'stepSize': 0.5}) >>> runClassifier.trainTestSet(h, datasets.TwoDAxisAligned) Training accuracy 0.91, test accuracy 0.86 >>> h w=array([ 2.73466371, -0.29563932]) >>> mlGraphics.plotLinearClassifier(h, datasets.TwoDAxisAligned.X, datasets.TwoDAxisAligned.Y)Note that even though this data is clearly linearly separable, the

If we change the regularizer, we'll get a slightly different solution:

>>> h = linear.LinearClassifier({'lossFunction': linear.SquaredLoss(), 'lambda': 10, 'numIter': 100, 'stepSize': 0.5}) >>> runClassifier.trainTestSet(h, datasets.TwoDAxisAligned) Training accuracy 0.9, test accuracy 0.86 >>> h w=array([ 1.30221546, -0.06764756])As expected, the weights are

Now, we can try different loss functions. Implement logistic loss and hinge loss. Here are some simple test cases:

>>> h = linear.LinearClassifier({'lossFunction': linear.SquaredLoss(), 'lambda': 10, 'numIter': 100, 'stepSize': 0.5}) >>> runClassifier.trainTestSet(h, datasets.TwoDDiagonal) Training accuracy 0.98, test accuracy 0.86 >>> h w=array([ 0.33864367, 1.28110942])

>>> h = linear.LinearClassifier({'lossFunction': linear.HingeLoss(), 'lambda': 1, 'numIter': 100, 'stepSize': 0.5}) >>> runClassifier.trainTestSet(h, datasets.TwoDDiagonal) Training accuracy 0.98, test accuracy 0.86 >>> h w=array([ 0.84385774, 3.13132617])

>>> h = linear.LinearClassifier({'lossFunction': linear.HingeLoss(), 'lambda': 1, 'numIter': 100, 'stepSize': 0.5}) >>> runClassifier.trainTestSet(h, datasets.TwoDDiagonal) Training accuracy 0.98, test accuracy 0.86 >>> h w=array([ 1.17110065, 4.67288657])

We've provided a simple feature extractor (`wordExtractor.py`)
for the text that first lower-cases everything, removes all non
alphabetic characters (except spaces) and then treats each word as a
feature. To generate data for megam to distinguish between
comp.graphics and comp.windows.x, run:

% python wordExtractor.py megam data/train.comp.graphics.txt data/train.comp.windows.x.txt > train.megam

we can do the same to generate test data:

% python wordExtractor.py megam data/test.comp.graphics.txt data/test.comp.windows.x.txt > test.megam

Here, the arguments are the desired file output type, the data for class -1 and the data for class +1.

We can now train our classifier:

% megam -fvals binary train.megam > model.megam

The `-fvals` argument tells it that each feature has a
corresponding feature value (if not given, it assumes features are
binary). The resulting weights are stored in `model.megam`.
It should have run for one hundred iterations and achieved a training
error of 0.00085. We can now make predictions on the test data:

% megam -fvals -predict model.megam binary test.megam > predictions.megam

You should get a test error rate of 17.7%.

If you inspect the weights file, you should find a bias of 0.329 and different weights for the different words. For instance, "graphics" should have a weight of about 1.09 and "windows" should have a weight of -0.079.

**WU4:** What are the five features with largest positive weight
and what are the five features with largest negative weight? Do these
seem "right" based on the task?

Next, we'll do the same for decision trees:

% python wordExtractor.py fastdt data/train.comp.graphics.txt data/train.comp.windows.x.txt > train.fastdt % python wordExtractor.py fastdt data/test.comp.graphics.txt data/test.comp.windows.x.txt > test.fastdt

And train:

% FastDT -maxd 3 train.fastdt > model.fastdt % FastDT -load model.fastdt test.fastdt > predictions.fastdt

Here, you should get a test error rate of 21.5%. If you inspect the model.fastdt file, you can see the tree printed in a format quite similar to ours from P1.

**WU5:** Draw the tree. How do the selected features compare to
the features from the logistic regression model? Which features seem
"better" and why? If you use a depth 10 tree, how well do you do on
test data?

Finally, we'll do support vector machines. It's pretty much the same as before:

% python wordExtractor.py libsvm data/train.comp.graphics.txt data/train.comp.windows.x.txt > train.libsvm % python wordExtractor.py libsvm data/test.comp.graphics.txt data/test.comp.windows.x.txt > test.libsvm

Now, we can train our svm.

% svm-train -t 0 train.libsvm model.libsvm % svm-predict test.libsvm model.libsvm predictions.libsvm

We should be informed that we got an accuracy of about 78.5%.

** Warning:** One of the really annoying things about libsvm
is that features have to be numeric, rather than strings. This means
we maintain a dictionary (stored in

All of the models we looked at in the warm up have different
hyperparameters. For megam, the hyperparameter is the regularization
coefficient, set by "`-lambda ###`" just like in P1. For
FastDT, the hyperparameter is the depth of the tree, set by
"`-maxd`". For libSVM, it is the value of "C", set by "`-c
###`".

**WU6:** Using comp.graphics versus comp.windows.x, plot training
and test error curves for each of the algorithms. For megam, use
lambda values of 2^x for x in -5, -4, ..., 4, 5. For FastDT, use
depths 1 through 20. For libsvm, use C values of 2^x for x in -5, -4,
..., 4, 5. Before actually running these experiments, what do you
expect to happen? What actually does happen?

Next, let's switch to some digits data. We have three digits: 1, 2 and 3, in the obviously-named files in the data directory. These just store pixel values. Use pixelExtractor.py to make training data, for example:

% python pixelExtractor.py megam data/train.digit1 data/train.digit2 > train.megam % python pixelExtractor.py megam data/test.digit1 data/test.digit2 > test.megam % megam -fvals binary train.megam > model.megam % megam -fvals -predict model.megam binary test.megam > predictions.megam

You should get a 7% error rate.

**WU7:** Comparing the performance of the three different
algorithms on the two tasks (text categorization versus digit
recognition), which one(s) perform best on one and which on the other?
Why?

For comparison, if you reduce to libSVM, the default multiclass implementation in libSVM (if you give it multiclass data) is AVA, so you can compare to that.

**WU8:** For each of the three reductions, run your classifier on
the text classification problem with four classes. For the tree
reduction, make the first split {graphics,windows} versus
{baseball,hockey}. Tune your hyperparameters as well as you can and
report the best results you can for each of the three. Which one
wins? Which one was easiest to tune hyperparameters for?

**WU9:** Change the structure of the tree classifier so that the
first split is {graphics,baseball} versus {windows,hockey}. (Thus,
the hard decision is first, and the easy decisions come second.)
Return hyperparameters well. Does this work better or worse than the
previous split, and why?

Hand in your code in `multiclass.tgz`.

No matter which one you do, you may reduce to any system you want of
the ones used in this project (`linear.py`, libSVM, megam,
FastDT). Some of these have features that might be useful for the
different tasks. For instance, libSVM and megam support multiclass
classification internally, so for collective classification, if the
labels on each node in the graph are multiclass, then you can just
reduce to multiclass rather than all the way down to binary (which you
may find easier). Additionally, megam supports different weights
(costs) on examples (search for "`$$$WEIGHT`" in the
documentation), which is useful for ranking. So choose wisely or
you'll end up with lots of extra work!

Some dataset repositories that may or may not have appropriate data:

- http://data.gov/
- http://thedatahub.org/
- http://richard.cyganiak.de/2007/10/lod/
- http://archive.ics.uci.edu/ml/

**WU10a:** You've chosen ranking! First, implement the naive
ranking algorithm (Algs 16 and 17) from the book. Then, implement the
more complex ranking algorithm (Algs 18 and 19) from the book.
Compare their performance. (Note that if your ranking problem isn't
bipartite, you'll have to force it to be bipartite to make the naive
algorithm work: just do something that you think is reasonable to do
this.) How have you defined the cost function (omega) in the complex
model? In all cases, measure your performance according to whatever
metric you like the best, but it should *not* be zero/one loss:
it should be something more appropriate for ranking (F-measure, area
under the curve, etc.). Report on your experience.

**WU10b:** You've chosen collective classification! Implement the
stacking algorithm (Algs 20 and 21) from the book. Apply this to your
problem, and plot the accuracy of your classifier as a function of the
number of levels in the stack. Do you observe that stacking helps?
I.e., does some layer >1 perform better than layer 1? If not, perhaps
you're not using sufficiently helpful features between the layers.
Does the stack ever overfit? Plot your training error versus your
test error as a function of the number of layers, and if you observe
massive overfitting, you might need to do cross-validation to
attenuate this. Report on your experience.

For both of these, I expect about a 1-2 page writeup, including appropriate figures.