CMSC 726 Final Project
The purpose of the final project is for you to demonstrate to me that
you learned something this semester. Think of it as being analogous
to a depth-oriented final exam, where you choose the topic that you
want to cover in depth. Here are some guidelines:
- Groups of any size are okay. However, the larger the
group, the more I expect you do to; the smaller the group, the lower
my expectations. In general, I'd like to think that a project plus
two months of extra work could turn into a paper.
- Double-counting projects is okay. So long as you have
permission from other instructors involved, it's fine with me if you
use the same work for projects in more than one class. However, you
must still demonstrate to me that you learned something in this
class. I'm happy if you learned stuff in other classes, but
that's not what I'm evaluating.
- Projects related to your research are okay. But don't
feel obligated to do this: you should do something that you think is
fun and interesting. Hopefully your research is so, but...
- Negative results are okay. It's okay if you try something
and it doesn't work. But like P2, I want you to tell me what you
tried and why you think it didn't work. You can't give up too
easily: most things in research don't work the first time.
- Reimplementing papers is okay. If there's a machine
learning paper you find really interesting and you want to
reimplement their system and try to replicate their results, that's
great. If you can find their code online, then applying it to
something new is fine, modulo the caveat on the next point.
- Using off-the-shelf tools is okay. Again, like P2, I
still want you to demonstrate that you learned something: like how
to use those tools, interpret their results, and use this to build
better models.
What to hand in: You should hand in a PDF write-up that's
approximately four pages in the homework Latex format. It's hard for
me to imagine a sufficiently large project leading to a writeup that's
less than two pages, and I don't want to read something that's longer
than eight pages. Somewhere in the four plus-or-minus two range is
probably about right. You may also hand in your code if you want, but
chances are I won't look at it.
Grading: The final project is worth 25% of your grade. 15%
will be based on your writeup and what you did. 5% will be based on
your presentation at the final exam party (see below). 5% will be
based on showing up on time to the final exam party, staying until the
end, and signing the sign-in sheet.
Presentation: You will have roughly 5 minutes to tell the rest
of the class what you did. You should have precisely three slides (no
title slide necessary). You will have to hand in these three slides
ahead of time (in PDF format) so I can merge them into one big PDF for
the presentations. Your slides should be: (1) what is the problem;
(2) how do you solve it; (3) how well did it work and what problems
did you run in to.
Due dates: Everything (presentation slides and write-up) is
"technically" due the last day of class (Dec 13, 11am). However, you
get "free late days" for the presentation up until 1 hour before the
final exam date. And your write-up gets "free late days" until the
end of the exam period: Dec 20, 9pm.
Project topic: You're more than welcome to come up with your
own topic. If you want to run it by me ahead of time so that I can
help assure you about what you'll need to do to satisfy the prime
directive of "convince me your learned something," please send me a
brief email with a description of what you want to do, or talk to me
at some point after class or during office hours or whenever you can
catch me. However, if you don't want to come up with your own topic,
I have some canned projects listed below that I find interesting.
(It's perfectly okay for multiple teams to select the same canned
projects.) If you're having trouble choosing, talk to me!
- Comparison of loss functions and regularizers
- Perceptron/linear models with few active features
- New optimization algorithms for linear models
- Optimizing kernel combinations by gradients
- Explaining the predictions of linear models/SVMs
- Boosting to obtain deep neural networks
- PCA versus JL for dimensionality reduction
- Active learning across learning algorithms
- Active learning against big pools
- Domain adaptation with continuous domains
Several people seem to think that hinge loss leads to better empirical
performance than logistic loss, but that it might be more sensitive to
good regularization. Explicit comparisons are hard because different
tools optimize these models in different ways. Your job would be to
explore this claim empirically. (You could try theoretically, but I
think it's going to be really hard.) What you'll want to do is take a
bunch of classification datasets (eg., a large subset of the UCI
repository) and train linear models for them with lots of values of a
regularization parameter. The question is: how well do they do for an
"oracle" selection of the hyperparameter, and how sensitive are they
to the selection of this value. For instance, how far from this
optimal value can you go and still be within one percent of optimal
performance (or some other such measure). You should do all these
experiments very well, and using the same optimization (eg.,
your -- or my -- project implementation).
(Good for teams of 1-2 people, with reasonable computing resources
and good data analysis minds. Would make a really interesting blog
post, tech report or workshop paper.)
You can achieve sparsity in linear models using an L1 regularizer.
But this sparsity simply means the weight vector will be sparse. It
doesn't me anything about how many features will be used at prediction
time. Suppose that you want to learn a linear model that behaves as
follows. When a new example comes in, it computes
w_{d}x_{d} for each feature. But instead of the
prediction being the sum of all of these, it's just the sum of
the top 5 in absolute value. This would make it much easier to
explain the behavior of you linear model, because it really would only
use 5 features to make a classification decision! You should study
(a) what happens if you learn a regular linear model, for instance
with a perceptron, and then simply apply this heuristic at test time;
(b) try to modify the perceptron algorithm so that it makes updates
according to this rule, perhaps by a subgradient-like argument, and
see if you can do better. If you're up for it, maybe prove a
perceptron convergence theorem, though I suspect this is tricky.
(Good for teams of 1-2 people. Could be an interesting paper,
but that would hinge on being able to get some non-trivial
theoretical results out.)
For those of you in
my Optimization in ML
seminar, you now know about a bunch of cool optimization
algorithms. I'm particularly thinking of Barzilai and Borwein,
Nesterov 2009, Nemirovski 2009 and Bertsekas 2009. Implement some
subset of these algorithms and compare to vanilla (sub)gradient
descent or stochastic (sub)gradient descent. Are the improved
convergence rates that you see in the optimization theory bourne out
in practice? Can you convince me to stop using simply stochastic
(sub)gradient descent in all my implementations?
(Good for teams of K-many people where you are implementing K
of these algorithms. Could be an interesting tech-report or
workshop paper.)
Consider a positive linear combination of kernels
K(x,z)=a_{1}K_{1}(x,z)+a_{2}K_{2}(x,z)+...+a_{M}K_{M}(x,z).
It would be great if you could automatically tune these a_{m}
values while you were learning, for instance using gradient
steps. Of course, you cannot do this on the training data or you'll
overfit massively! However, you can tune them by gradient
descent on held-out data, simultaneously to doing gradient descent for
the model parameters on the training data. You can actually apply the
same trick to tuning the regularization hyperparameter and even
parameters of the kernels, like gamma in an RBF kernel. Or at least I
think you can: you should try it out, we can talk about the details.
(Good for teams of 1-2 people for the basic stuff, or 3-4
people if you want to go for the regularization parameter and kernel
parameters as well. Could be an interesting empirical paper if you
can find a good domain on which to apply it.)
Linear models and SVMs get awesome predictive performance, but its
very hard to explain the decisions that they make. It would be nice,
when such a model makes an error, to explain why. For instance, it
might not have seen enough examples like the test example: you could
test this by adding the test example to the training data, retraining,
and checking that you don't incur any new training error. Perhaps the
training error goes down: the examples on which it does might be
interesting for a person to look at. Or, perhaps there is
contradictory evidence in the training data: retraining would lead to
new training errors. Can you point to those training examples? Both
of these basically get at the question of: how much would the model
have to move to get this example correct, and why didn't it do that
already?
(Good for teams of 1-2 people who care about some particular
application problem that they could try it on and be able to
interpret the results. Would probably be a cool paper in your
application area.)
Fact 1: boosting decision stumps leads to linear models. Fact 2:
boosting linear models leads to two-layer neural networks. Apply
induction. This suggests an algorithm: boost some stumps for a while
until you get bored, then stop and call that thing your first hidden
unit. Start boosting again on the residuals (this will require a
multi-level boost!). Once you're done with this, stop and continue.
The precise algorithm needs to be developed a bit, but it could be a
really cool way to train a deep model. There's obviously an
architecture selection issue, but that's there for normal neural
networks, too!
(Good for teams of 1-2 people if you go for just a two layer
thing, or 3-4 people if you go for the deep thing. If you can get
it to work, it could be a cool paper, but would need to be compared
to "standard" deep neural networks which might be a pain.)
PCA and random projections ala Johnson-Lindenstrauss both give ways of
doing linear dimensionality reduction. PCA is data-based and can
capture information in the data, but can be led astray by an
adversary. JL-style projections are independent of the data, but
perhaps need more low dimensions to do a good job. This is especially
true when one or both of your projection or data is sparse. (I.e.,
there are algorithms for sparse PCA, essentially PCA with an L1
regularizer; and sparse JL, where the random matrix is sparse. But
also your data vectors can be sparse.) Compare these empirically,
especially in the sparse cases, and see how many dimensions you
actually need and whether we can stop running PCA or not.
(Good for teams of 1-2 people if you stick to basic JL and
basic PCA; good for 3-4 people if you try sparse methods too.)
Active learning is great if your goal is a classifier. However, if
your goal is a data set and you might switch classifiers later, this
could be bad. The badness stems from the fact that your data set will
be constructed so that its classifier does well. If you switch
classifiers later, this could be trouble. Check whether this holds
empirically. Can you safely switch classifiers? Can you mitigate the
issue by using different types of learning algorithms at active
learning time, or does this make active learning not help anymore?
(Good for teams of 1-2 people for the first question and teams
of 3-4 people to get the second question too. Could make an
interesting empirical paper with some extra work.)
All active learning results in papers that I know of run active
learning on relatively small pools of data. What if you have
essentially an infinite supply of unlabeled data? You can do this by
considering classification tasks for which data is "free," like trying
to disambiguate between "there," "their" and "they're" in text. I
suspect that active learning algorithms will actually do worse
when they have an infinite amount of data to work with, but I could be
wrong (I suspect there are more outliers). You can check to see if
I'm right or not.
(Good for teams of 1-2 people who have access to good computing
resources. Would make an interesting tech report or workshop
paper.)
The feature augmentation technique for domain adaptation replaces each
example x in the source domain with <x,x,0> and every example x
in the target domain with <x,0,x>. In kernel space this amounts
to doubling the kernel value between points in the same domain. This
assumes domains are discrete. What if they are continuous? I.e.,
what if instead of having a domain, you have a bunch of domain
features with each example. Call these z. So each example is a
pair (x,z) with a label y. You can imagine a bilinear model for which
you learn weights on z and separate weights on x and classify as a
product. This can also be kernelized in a reasonable way and there
are lots of possibilities for how to optimize it (since it's
non-convex).
(Good for teams of 1-2 people. Could definitely be an applied
conference paper somewhere.)