CS 726 Project 1: Basic classification

Table of Contents

Introduction

In this project, we will learn to predict whether students are likely to take the Artificial Intelligence course (CS 421) or the Computer Graphics course (CS 427) at UMD. We will begin by writing some very simple (and crummy!) predictors, just to help familiarize you with our environment (which will be reused multiply throughout the semester, so it will be good to get used to it now!). We will then move on to slightly more complex prediction models, such as decision trees, the perceptron and linear classifiers trained with gradient descent. We will also look at how these models fare on different types of problems, but there will be much more of that in project 2.

The code for this project consists of several Python files, some of which you will need to read and understand in order to complete the assignment, and some of which you can ignore. You can download all the code and supporting files (including this description) as a tar archive.

Files you'll edit:
dumbClassifiers.py This contains a handful of "warm up" classifiers to get you used to our classification framework.
dt.py Will be your simple implementation of a decision tree classifier.
knn.py This is where your nearest-neighbor classifier implementation will go.
perceptron.py This is where your perceptron classifier implementation will go.
Files you might want to look at:
binary.py Our generic interface for binary classifiers (actually works for regression and other types of classification, too).
datasets.py Where a handful of test data sets are stored.
util.py A handful of useful utility functions: these will undoubtedly be helpful to you, so take a look!
runClassifier.py A few wrappers for doing useful things with classifiers, like training them, generating learning curves, etc.
mlGraphics.py A few useful plotting commands

What to submit: You will handin all of the python files listed above under "Files you'll edit" as well as a partners.txt file that lists the names and uids of all members in your team. Finally, you'll hand in a writeup.pdf file that answers all the written questions in this assignment (denoted by WU#: in this .html file).

Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder. However, the correctness of your implementation -- not the autograder's output -- will be the final judge of your score. If necessary, we will review and grade assignments individually to ensure that you receive due credit for your work.

Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. If you do, we will pursue the strongest consequences available to us.

Getting Help: You are not alone! If you find yourself stuck on something, contact the course staff for help. Office hours, class time, and Piazza are there for your support; please use them. If you can't make our office hours, let us know and we will schedule more. We want these projects to be rewarding and instructional, not frustrating and demoralizing. But, we don't know when or how to help unless you ask. One more piece of advice: if you don't know what a variable is, print it out.

Warming Up to Classifiers (10%)

Let's begin our foray into classification by looking at some very simple classifiers. There are three classifiers in dumbClassifiers.py, one is implemented for you, the other two you will need to fill in appropriately.

The already implemented one is AlwaysPredictOne, a classifier that (as its name suggest) always predicts the positive class. We're going to use the TennisData dataset from datasets.py as a running example. So let's start up python and see how well this classifier does on this data. You should begin by importing util, datasets, binary and dumbClassifiers. Also, be sure you always have from numpy import * and from pylab import * activated.

>>> h = dumbClassifiers.AlwaysPredictOne({})
>>> h
AlwaysPredictOne
>>> h.train(datasets.TennisData.X, datasets.TennisData.Y)
>>> h.predictAll(datasets.TennisData.X)
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])
Indeed, it looks like it's always predicting one!

Now, let's compare these predictions to the truth. Here's a very clever way to compute accuracies (WU1: why is this computation equivalent to computing classification accuracy?):

>>> mean((datasets.TennisData.Y > 0) == (h.predictAll(datasets.TennisData.X) > 0))
0.6428571428571429
That's training accuracy; let's check test accuracy:
>>> mean((datasets.TennisData.Yte > 0) == (h.predictAll(datasets.TennisData.Xte) > 0))
0.5
Okay, so it does pretty badly. That's not surprising, it's really not learning anything!!!

Now, let's use some of the built-in functionality to help do some of the grunt work for us. You'll need to import runClassifier.

>>> runClassifier.trainTestSet(h, datasets.TennisData)
Training accuracy 0.642857, test accuracy 0.5
Very convenient!

Now, your first implementation task will be to implement the missing functionality in AlwaysPredictMostFrequent. This actually will "learn" something simple. Upon receiving training data, it will simply remember whether +1 is more common or -1 is more common. It will then always predict this label for future data. Once you've implemented this, you can test it:

>>> h = dumbClassifiers.AlwaysPredictMostFrequent({})
>>> runClassifier.trainTestSet(h, datasets.TennisData)
Training accuracy 0.642857, test accuracy 0.5
>>> h
AlwaysPredictMostFrequent(1)
Okay, so it does the same as AlwaysPredictOne, but that's because +1 is more common in that training data. We can see a difference if we change to a different dataset: CFTookAI is a classification problem where we try to predict whether a student has taken AI based on the other classes they've taken.
>>> runClassifier.trainTestSet(dumbClassifiers.AlwaysPredictOne({}), datasets.CFTookAI)
Training accuracy 0.515, test accuracy 0.42
>>> runClassifier.trainTestSet(dumbClassifiers.AlwaysPredictMostFrequent({}), datasets.CFTookAI)
Training accuracy 0.515, test accuracy 0.42
Since the majority class is "1", these do the same here.

The last dumb classifier we'll implement is FirstFeatureClassifier. This actually does something slightly non-trivial. It looks at the first feature (i.e., X[0]) and uses this to make a prediction. Based on the training data, it figures out what is the most common class for the case when X[0] > 0 and the most common class for the case when X[0] <= 0. Upon receiving a test point, it checks the value of X[0] and returns the corresponding class. Once you've implemented this, you can check it's performance:

>>> runClassifier.trainTestSet(dumbClassifiers.FirstFeatureClassifier({}), datasets.TennisData)
Training accuracy 0.714286, test accuracy 0.666667
>>> runClassifier.trainTestSet(dumbClassifiers.FirstFeatureClassifier({}), datasets.CFTookAI)
Training accuracy 0.515, test accuracy 0.42
>>> runClassifier.trainTestSet(dumbClassifiers.FirstFeatureClassifier({}), datasets.CFTookCG)
Training accuracy 0.545, test accuracy 0.49
(Here, CFTookCG is like CFTookAI but for computer graphics rather than artificial intelligence.)

As we can see, this does better again on TennisData, but doesn't really help on AI.

Decision Trees (40%)

Our next task is to implement a decision tree classifier. There is stub code in dt.py that you should edit. Decision trees are stored as simple data structures. Each node in the tree has a .isLeaf boolean that tells us if this node is a leaf (as opposed to an internal node). Leaf nodes have a .label field that says what class to return at this leaf. Internal nodes have: a .feature value that tells us what feature to split on; a .left tree that tells us what to do when the feature value is less than 0.5; and a .right tree that tells us what to do when the feature value is at least 0.5. To get a sense of how the data structure works, look at the displayTree function that prints out a tree.

Your first task is to implement the training procedure for decision trees. We've provided a fair amount of the code, which should help you guard against corner cases. (Hint: take a look at util.py for some useful functions for implementing training. Once you've implemented the training function, we can test it on simple data:

>>> h = dt.DT({'maxDepth': 1})
>>> h
Leaf 1

>>> h.train(datasets.TennisData.X, datasets.TennisData.Y)
>>> h
Branch 6
  Leaf 1.0
  Leaf -1.0
This is for a simple depth-one decision tree (aka a decision stump). If we let it get deeper, we get things like:
>>> h = dt.DT({'maxDepth': 2})
>>> h.train(datasets.TennisData.X, datasets.TennisData.Y)
>>> h
Branch 6
  Branch 7
    Leaf 1.0
    Leaf 1.0
  Branch 1
    Leaf -1.0
    Leaf 1.0

>>> h = dt.DT({'maxDepth': 5})
>>> h.train(datasets.TennisData.X, datasets.TennisData.Y)
>>> h
Branch 6
  Branch 7
    Leaf 1.0
    Branch 2
      Leaf 1.0
      Leaf -1.0
  Branch 1
    Branch 7
      Branch 2
        Leaf -1.0
        Leaf 1.0
      Leaf -1.0
    Leaf 1.0
Now, you should go implement prediction. This should be easier than training! We can test by:
>>> runClassifier.trainTestSet(dt.DT({'maxDepth': 1}), datasets.TennisData)
Training accuracy 0.714286, test accuracy 1
>>> runClassifier.trainTestSet(dt.DT({'maxDepth': 2}), datasets.TennisData)
Training accuracy 0.857143, test accuracy 1
>>> runClassifier.trainTestSet(dt.DT({'maxDepth': 3}), datasets.TennisData)
Training accuracy 0.928571, test accuracy 1
>>> runClassifier.trainTestSet(dt.DT({'maxDepth': 5}), datasets.TennisData)
Training accuracy 1, test accuracy 1
Now, let's see how well this does on our (computer graphics) recommender data:
>>> runClassifier.trainTestSet(dt.DT({'maxDepth': 1}), datasets.CFTookCG)
Training accuracy 0.56, test accuracy 0.48
>>> runClassifier.trainTestSet(dt.DT({'maxDepth': 3}), datasets.CFTookCG)
Training accuracy 0.6325, test accuracy 0.5
>>> runClassifier.trainTestSet(dt.DT({'maxDepth': 5}), datasets.CFTookCG)
Training accuracy 0.7475, test accuracy 0.6
Looks like it does better than the dumb classifiers on training data, as well as on test data! Hopefully we can do even better in the future!

We can use more runClassifier functions to generate learning curves and hyperparameter curves:

>>> curve = runClassifier.learningCurveSet(dt.DT({'maxDepth': 5}), datasets.CFTookAI)
[snip]
>>> runClassifier.plotCurve('DT on AI', curve)
This plots training and test accuracy as a function of the number of data points (x-axis) used for training. WU2: We should see training accuracy (roughly) going down and test accuracy (roughly) going up. Why does training accuracy tend to go down? Why is test accuracy not monotonically increasing?

We can also generate similar curves by chaning the maximum depth hyperparameter:

>>> curve = runClassifier.hyperparamCurveSet(dt.DT({'maxDepth': 5}), 'maxDepth', [1,2,3,4,5,6,7,8,9,10], datasets.CFTookAI)
[snip]
>>> runClassifier.plotCurve('DT on AI (hyperparameter)', curve)
Now, the x-axis is the value of the maximum depth.

WU3: You should see training accuracy monotonically increasing and test accuracy making a (wavy) hill. Which of these is guaranteed to happen a which is just something we might expect to happen? Why?

WU4: Train a decision tree on the CG data with a maximum depth of 3. If you look in datasets.CFTookCG.courseIds and .courseNames you'll find the corresponding course for each feature. The first feature is a constant-one "bias" feature. Draw out the decision tree for this classifier, but put in the actual course names/ids as the features. Interpret this tree: do these courses seem like they are actually indicative of whether someone might take CG?

Nearest Neighbors (30%)

To get started with geometry-based classification, we will implement a nearest neighbor classifier that supports both KNN classification and epsilon-ball classification. This should go in knn.py. The only function here that you have to do anything about is the predict function, which does all the work.

In order to test your implementation, here are some outputs (I suggest implementing epsilon-balls first, since they're slightly easier):

>>> runClassifier.trainTestSet(knn.KNN({'isKNN': False, 'eps': 0.5}), datasets.TennisData)
Training accuracy 1, test accuracy 1
>>> runClassifier.trainTestSet(knn.KNN({'isKNN': False, 'eps': 1.0}), datasets.TennisData)
Training accuracy 0.857143, test accuracy 0.833333
>>> runClassifier.trainTestSet(knn.KNN({'isKNN': False, 'eps': 2.0}), datasets.TennisData)
Training accuracy 0.642857, test accuracy 0.5

>>> runClassifier.trainTestSet(knn.KNN({'isKNN': True, 'K': 1}), datasets.TennisData)
Training accuracy 1, test accuracy 1
>>> runClassifier.trainTestSet(knn.KNN({'isKNN': True, 'K': 3}), datasets.TennisData)
Training accuracy 0.785714, test accuracy 0.833333
>>> runClassifier.trainTestSet(knn.KNN({'isKNN': True, 'K': 5}), datasets.TennisData)
Training accuracy 0.857143, test accuracy 0.833333
You can also try it on the course-recommender data:
>>> runClassifier.trainTestSet(knn.KNN({'isKNN': False, 'eps': 1.0}), datasets.CFTookAI)
Training accuracy 1, test accuracy 0.42
>>> runClassifier.trainTestSet(knn.KNN({'isKNN': False, 'eps': 5.0}), datasets.CFTookAI)
Training accuracy 0.5475, test accuracy 0.55
>>> runClassifier.trainTestSet(knn.KNN({'isKNN': False, 'eps': 10.0}), datasets.CFTookAI)
Training accuracy 0.515, test accuracy 0.42

>>> runClassifier.trainTestSet(knn.KNN({'isKNN': True, 'K': 1}), datasets.CFTookAI)
Training accuracy 1, test accuracy 0.51
>>> runClassifier.trainTestSet(knn.KNN({'isKNN': True, 'K': 3}), datasets.CFTookAI)
Training accuracy 0.765, test accuracy 0.57
>>> runClassifier.trainTestSet(knn.KNN({'isKNN': True, 'K': 5}), datasets.CFTookAI)
Training accuracy 0.6825, test accuracy 0.61
WU5: For the course recommender data, generate train/test curves for varying values of K and epsilon (you figure out what are good ranges, this time). Include those curves: do you see evidence of overfitting and underfitting? Next, using K=5, generate learning curves for this data.

Percepton (20%)

The last implementation you have is for the perceptron; see perceptron.py where you will have to implement part of the nextExample function to make a perceptron-style update.

Once you've implemented this, the magic in the Binary class will handle training on datasets for you, as long as you specify the number of epochs (passes over the training data) to run:

>>> runClassifier.trainTestSet(perceptron.Perceptron({'numEpoch': 1}), datasets.TennisData)
Training accuracy 0.642857, test accuracy 0.666667
>>> runClassifier.trainTestSet(perceptron.Perceptron({'numEpoch': 2}), datasets.TennisData)
Training accuracy 0.857143, test accuracy 1
You can view its predictions on the two dimensional data sets:
>>> runClassifier.plotData(datasets.TwoDDiagonal.X, datasets.TwoDDiagonal.Y)
>>> h = perceptron.Perceptron({'numEpoch': 200})
>>> h.train(datasets.TwoDDiagonal.X, datasets.TwoDDiagonal.Y)
>>> h
w=array([  7.3,  18.9]), b=0.0
>>> runClassifier.plotClassifier(array([ 7.3, 18.9]), 0.0)
You should see a linear separator that does a pretty good (but not perfect!) job classifying this data.

Finally, we can try it on the AI data:

>>> runClassifier.trainTestSet(perceptron.Perceptron({'numEpoch': 1}), datasets.CFTookAI)
Training accuracy 0.585, test accuracy 0.59
>>> runClassifier.trainTestSet(perceptron.Perceptron({'numEpoch': 2}), datasets.CFTookAI)
Training accuracy 0.6125, test accuracy 0.5
WU6: Take the best perceptron you've been able to find so far on the AI data. Look at the top five positive weights (those with highest value) and top five negative weights (those with lowest value). Which features do these correspond to? Can you explain why these might get these features as the "most indicative"? Why is it hard to interpret "large weight" as "most indicative"? How do these large weighted features compare to the features selected by the decision tree?