- Introduction
- PCA and Kernel PCA
*[20%]* - HMMs: Viterbi
*[30%]* - HMMs: Forward Backward
*[50%]* - Gaussian Mixture Models
*[30%]* - Manifold Learning
*[50%]*

In this project, we will explore two algorithms for dimensionality reduction (PCA and kernel PCA) and hidden Markov models. You can download all the files here.

Files you'll edit: | |

`dr.py` |
You will implement PCA and kernel PCA here. |

`hmm.py` |
You will implement code for hidden Markov models here. |

`clustering.py` |
You will implement GMM clustering here (baseline K-means is provided). |

Files you might want to look at: | |

`datasets.py` |
Includes (in python format) some simple toy data sets. |

`digits` |
Digits data. |

`kernel.py` |
Some basic kernels. |

`util.py` |
Some helpful utilities, including plotting functions. |

**What to submit:** You
will handin
all of the python files listed above under "Files you'll edit" as
well as a `partners.txt` file that lists the **names**
and **uids** (first four digits) of all members in your team.
Finally, you'll hand in a `writeup.pdf` file that answers all
the written questions in this assignment (denoted by **WU#:** in
this `.html` file), as well as `dr.py`
and `hmm.py`.

**Evaluation:** Your code will be autograded for
technical correctness. Please *do not* change the names of any
provided functions or classes within the code, or you will wreak havoc
on the autograder. However, the correctness of your implementation --
not the autograder's output -- will be the final judge of your score.
If necessary, we will review and grade assignments individually to
ensure that you receive due credit for your work.

**Academic Dishonesty:** We will be checking your code
against other submissions in the class for logical redundancy. If you
copy someone else's code and submit it with minor changes, we will
know. These cheat detectors are quite hard to fool, so please don't
try. We trust you all to submit your own work only; *please*
don't let us down. If you do, we will pursue the strongest
consequences available to us.

**Getting Help:** You are not alone! If you find
yourself stuck on something, contact the course staff for help.
Office hours, class time, and the mailing list are there for your
support; please use them. If you can't make our office hours, let us
know and we will schedule more. We want these projects to be
rewarding and instructional, not frustrating and demoralizing. But,
we don't know when or how to help unless you ask. One more piece of
advice: if you don't know what a variable is, print it out.

Our first test of PCA will be on Gaussian data with a known covariance
matrix. First, let's generate some data and see how it looks, and see
what the *sample covariance* is:

>>> Si = util.sqrtm(array([[3,2],[2,4]])) >>> x = dot(randn(1000,2), Si) >>> plot(x[:,0], x[:,1], 'b.') >>> dot(x.T,x) / real(x.shape[0]) array([[ 2.88360146, 2.05144774], [ 2.05144774, 4.05987148]])

(Note: The reason we have to do a matrix square-root on the covariance is because Gaussians are transformed by standard deviations, not by covariances.)

Note that the sample covariance of the data is almost exactly the true
covariance of the data. If you run this with 100,000 data points
(instead of 1000), you should get something even closer to
`[[3,2],[2,4]]`.

Now, let's run PCA on this data. We basically know what should happen, but let's make sure it happens anyway.

>>> (P,Z,evals) = dr.pca(x, 2) >>> Z array([[ 0.57546631, -0.81782549], [-0.81782549, -0.57546631]]) >>> evals array([ 5.2620058 , 1.25255969])

This tells us that the largest eigenvalue corresponds to the
direction `[0.57, -0.82]` and the second largest corresponds to
the direction `[-0.82, -0.57]`. We can project the data onto
the first eigenvalue and plot it in red, and the second eigenvalue in
green. (Unfortunately we have to do some ugly reshaping to get
dimensions to match up.)

>>> x0 = dot(dot(x, Z[0,:]).reshape(1000,1), Z[0,:].reshape(1,2)) >>> x1 = dot(dot(x, Z[1,:]).reshape(1000,1), Z[1,:].reshape(1,2)) >>> plot(x[:,0], x[:,1], 'b.', x0[:,0], x0[:,1], 'r.', x1[:,0], x1[:,1], 'g.')

**WU1:** Depending exactly on your random data, one or more of
these lines might not pass exactly through the data as we would like
it to. Why not?

Now, back to digits data. Let's look at some "eigendigits."

>>> (X,Y) = datasets.loadDigits() >>> (P,Z,evals) = dr.pca(X, 784) >>> evals array([ 0.05465988, 0.04320249, 0.03914405, 0.03072822, 0.02969435, .....

(Warning: this takes about a minute to compute for me.) Eventually the eigenvalues drop to zero.

**WU2:** Plot the normalized eigenvalues (include the plot in your
writeup). How many eigenvectors do you have to include before you've
accounted for 90% of the variance? 95%? (Hint: see
function `cumsum`.)

Now, let's plot the top 50 eigenvectors:

>>> util.drawDigits(Z[1:50,:], arange(50))

Next, you need to implement Kernel PCA. We can first try this on our simple 2d data with known covariance and a linear kernel:

>>> Si = util.sqrtm(array([[3,2],[2,4]])) >>> x = dot(randn(1000,2), Si) >>> (P, alpha, evals) = dr.kpca(X, 2, kernel.linear) >>> evals array([ 4.00434172e+08, 5.46598996e+01]) >>> alpha array([[ 3.16227743e-02, 8.31483667e-02, -2.75562562e-02, ..., -2.93759139e-03, -2.93759139e-03, -5.91730489e-03], [ 3.16227802e-02, 5.05064932e-03, 1.38360913e-02, ..., -8.02854635e-05, -8.02854635e-05, -2.06104070e-04]])

Now, let's try with some data that vanilla PCA will find difficult:

>>> (a,b) = datasets.makeKPCAdata() >>> plot(a[:,0], a[:,1], 'b.', b[:,0], b[:,1], 'r.') >>> x = vstack((a,b)) >>> (P,Z,evals) = dr.pca(x, 2) >>> Z array([[ 0.87703838, 0.48042032], [-0.48042032, 0.87703838]]) >>> evals array([ 6.26494952, 5.72135994])

**WU4:** Why does vanilla PCA find this data difficult? What is
the significance of the relatively large value of the eigenvalues
here?

Now, let's look at the projected data:

>>> Pa = P[0:a.shape[0],:] >>> Pb = P[a.shape:-1,:] >>> plot(Pa[:,0], randn(Pa.shape[0]), 'b.', Pb[:,0], randn(Pb.shape[0]), 'r.')

Here, we've added a bit of random noise to the Y-axis so that the points don't all lie on top of one another.

**WU5:** Did PCA do what we might want it to? Why or why not?
Include the plot to justify your answer.

Now, let's use some kernels.

>>> (P,alpha,evals) = dr.kpca(x, 2, kernel.rbf1) >>> evals array([ 3.55250103e+07, 7.28020391e+01]) >>> Pa = P[0:a.shape[0],:] >>> Pb = P[a.shape[0]:-1,:] >>> plot(Pa[:,0], Pa[:,1], 'b.', Pb[:,0], Pb[:,1], 'r.')

**WU6:** How do the eigenvalues here compare to the linear case?
What does this tell you? How does the plot look? How might this be
useful for supervised learning?

**WU7:** Experiment with different kernels, and perhaps
interpolations of different kernels. Try to find a kernel that gets
as much of the variance on the first two principle components as
possible. Report your kernel and a plot of the data projected into 2d
under that kernel.

We'll begin by testing this on the data from HW11:

>>> (a,b,pi) = datasets.getHMMData() >>> a array([[ 0.66666667, 0.33333333], [ 0.5 , 0.5 ]]) >>> b array([[ 0.66666667, 0.25 , 0.08333333], [ 0.25 , 0.25 , 0.5 ]]) >>> pi array([ 0.5, 0.5]) >>> hmm.viterbi(array([0,1,1,2]), a, b, pi) array([0, 0, 0, 1]) >>> hmm.viterbi(array([0,2,1,2]), a, b, pi) array([0, 1, 1, 1])

The observation sequences are Hot Cold Cold Wet and Hot Wet Cold Wet, respectively. The inferred latent state sequences are G G G B and G B B B, respectively.

*If you need help debugging:* I added statements to
print `al` and `ze` at the end of the viterbi function.
Here's the output I get:

>>> hmm.viterbi(array([0,1,1,2]), a, b, pi) [[-0.69314718 -1.5040774 -3.29583687 -5.08759634 -7.97796809] [-0.69314718 -2.77258872 -3.98898405 -5.78074352 -6.8793558 ]] [[-1 0 0 0 1] [-1 0 0 0 1]] array([0, 0, 0, 1]) >>> hmm.viterbi(array([0,2,1,2]), a, b, pi) [[-0.69314718 -1.5040774 -4.39444915 -5.37527841 -8.26565017] [-0.69314718 -2.77258872 -3.29583687 -5.37527841 -6.76157277]] [[-1 0 1 0 1] [-1 0 1 1 1]] array([0, 1, 1, 1])

Note that Viterbi did something magical: when we changed the second
observation from Cold to Wet, both the second and third latent states
changed! **WU8:** find two *different* observation sequences
of length four that differ only in one place, but differ in more than
one place in the guessed output.

One thing that becomes a problem for forward-backward that's not a
problem in Viterbi has to do with log probabilities. In Viterbi, you
had to compute the log probability that you came from state 0 and
the log probability that you came from state 1 and then just take the
max of the two. In forward-backward, you actually have to *add*
the probabilities. But how can you do this when they're stored as log
probabilities?! A magical function, `util.addLog` will do this
for you. Let's see how it works:

>>> 0.5 + 0.25 0.75 >>> log(0.75) -0.2876820724517809 >>> util.addLog( log(0.5) , log(0.25) ) -0.2876820724517809If you say

So, keep in mind that: you replace probability multiplication with log
probability addition, you replace one with zero, replace zero with
-inf, and you can now replace probability addition with log
probability `addLog`ition.

Now, armed with this new tool, let's implement the forward algorithm. Once you've got a solution, here's some debugging help based on the same data we had from the Viterbi case:

>>> hmm.forward(array([0,1,1,2]), a, b, pi) array([[-0.69314718, -1.25624123, -2.67140358, -4.06259134, -5.56850996], [-0.69314718, -1.75093747, -3.09162132, -4.47051216, -5.70230466]]) >>> hmm.forward(array([0,2,1,2]), a, b, pi) array([[-0.69314718, -1.25624123, -2.82648449, -4.11756738, -5.58815463], [-0.69314718, -1.75093747, -2.96983593, -4.47862365, -5.71699198]])

From there, you can move on to the backward algorithm. Here's some similar debugging help:

>>> hmm.backward(array([0,1,1,2]), a, b, pi) array([[-4.56743938, -4.17757522, -2.89037176, -2.48490665, 0. ], [-5.5405585 , -4.13148411, -2.61843804, -0.69314718, 0. ]]) >>> hmm.backward(array([0,2,1,2]), a, b, pi) array([[-4.6660574 , -5.27618751, -2.89037176, -2.48490665, 0. ], [-5.37008359, -3.43833693, -2.61843804, -0.69314718, 0. ]])I've provided a

>>> al = hmm.forward(array([0,1,1,2]), a, b, pi) >>> be = hmm.backward(array([0,1,1,2]), a, b, pi) >>> hmm.sanityCheck(al, be)If it prints something, that's bad!

Finally, we have to implement parameter re-estimation based on the
forward and backward tables. This goes in `reestimate`.
Again, you'll need to use the magic of `addLog` to make this
work. At the end, `pi` should contain *log probabilities*
for the pi value. The `normalizeLog` function will turn an
unnormalized vector of log probabilities into a normalized vector of
probabilities. Here's an example of how it works:

>>> v = log(array([1,2,2,3,2])) >>> v array([ 0. , 0.69314718, 0.69314718, 1.09861229, 0.69314718]) >>> util.normalizeLog(v) array([ 0.1, 0.2, 0.2, 0.3, 0.2])

What it's done is `addLog`ed all the values in the vector and
then used this as a normalization constant. It turns the result into
a bunch of probabilities.

*Big hint:* In the notes and the book and elsewhere, it makes a
big deal out of re-estimating values as the fraction of something that
you care about (expected counts) to the sum of expected counts. It's
not worth computing the denominators. Just compute the numerators and
let `normalizeLog` ensure that these probabilities sum to one.
(This is all the denominator is doing, anyway).

Here's re-estimation in practice:

>>> al = hmm.forward(array([0,1,1,2]), a, b, pi) >>> be = hmm.backward(array([0,1,1,2]), a, b, pi) >>> (a_new, b_new, pi_new) = hmm.reestimate(array([0,1,1,2]), al, be, a, b, pi) >>> a_new array([[ 0.53662942, 0.46337058], [ 0.39886289, 0.60113711]]) >>> b_new array([[ 0.35001693, 0.55333559, 0.09664748], [ 0.14235731, 0.44259786, 0.41504483]]) >>> pi_new array([ 0.72574077, 0.27425923])

And here's another one:

>>> al = hmm.forward(array([0,2,1,2]), a, b, pi) >>> be = hmm.backward(array([0,2,1,2]), a, b, pi) >>> (a_new, b_new, pi_new) = hmm.reestimate(array([0,2,1,2]), al, be, a, b, pi) >>> a_new array([[ 0.34624522, 0.65375478], [ 0.35236106, 0.64763894]]) >>> b_new array([[ 0.43532693, 0.30443136, 0.26024171], [ 0.13435433, 0.21603434, 0.64961132]]) >>> pi_new array([ 0.66907982, 0.33092018])

Once you've got this all set up, we can run EM on this single example:

>>> (a_em,b_em,pi_em,logProbs) = hmm.runEM(array([0,1,1,2]), 2, 3) iteration 1... log probability -5.25485 iteration 2... log probability -4.15148 iteration 3... log probability -4.14962 iteration 4... log probability -4.14593 iteration 5... log probability -4.13452 iteration 6... log probability -4.09672 iteration 7... log probability -3.99093 iteration 8... log probability -3.80666 iteration 9... log probability -3.64349 iteration 10... log probability -3.49871 iteration 11... log probability -3.29489 iteration 12... log probability -3.0354 iteration 13... log probability -2.83776 iteration 14... log probability -2.78245 iteration 15... log probability -2.77644 iteration 16... log probability -2.77452 iteration 17... log probability -2.77356 iteration 18... log probability -2.77308 iteration 19... log probability -2.77283 iteration 20... log probability -2.77271 iteration 21... log probability -2.77265 iteration 22... log probability -2.77262 iteration 23... log probability -2.7726 iteration 24... log probability -2.7726 iteration 25... log probability -2.77259 ... iteration 50... log probability -2.77259 >>> a_em array([[ 2.54354295e-293, 1.00000000e+000], [ 1.00000000e+000, 5.67486661e-014]]) >>> b_em array([[ 5.00000000e-001, 5.00000000e-001, 1.49055507e-282], [ 0.00000000e+000, 5.00000000e-001, 5.00000000e-001]]) >>> pi_em array([ 1., 0.])

**WU9:** Why does EM settle in this configuration? What is it
saying? What happens when you use three states instead of two?
What's the resulting probability of the data? What about four states?
What happens and why?

Now, we do some fun data. Take a look at `words.txt`. This
contains a small news article. We're going to run EM on this data
where each *character* is an observation. We can load this data
as:

>>> (words, wordsDict) = datasets.readCharacterFile("words.txt") >>> words array([ 8, 1, 22, ..., 12, 5, 19]) >>> wordsDict[words] array(['h', 'a', 'v', ..., 'l', 'e', 's'], dtype='|S1')

A space character has gotten translated to observation 0, and then "a"
is 1, "b" is 2, ..., and "z" is 27. There are no other characters
allowed.
**WU10:** Run EM on this data using two states. Run it a few times
until you get a final log probability at least -9250. (It should
happen in two or three trials. This will take a few minutes to run.)
When you do this, it will print out the resulting initial state
probabilities, transition probabilities and emission probabilities.
What has it learned? Include all of this output in your write-up.

We'll now quickly run through the same experiments we did with k-means:

>>> mu0 = clustering.initialize_clusters(datasets.X2d, 2, 'determ') >>> (mu,Si,pk,ll) = clustering.gmm(datasets.X2d, mu0) Iteration 0... ll -136.062 Iteration 1... ll -134.546 Iteration 2... ll -134.603 ... Iteration 98... ll -133.888 Iteration 99... ll -133.888 >>> pk array([ 0.49705476, 0.50294524]) >>> mu array([[ 2.21490879, 1.42426306], [-2.23276225, -2.22951005]]) >>> Si array([[[ 0.80769547, 0.11842686], [ 0.11842686, 1.41535853]], [[ 0.63844626, -0.06697666], [-0.06697666, 0.98891016]]])

**Hint:** While running, this will plot the results of EM. If you want
to turn that off, comment out the obvious line in the `gmm`
function. Plus, when it says "Press enter to continue", if you type
"q" and press enter, it will stop bugging you.

You can also play with the example I did in class:

>>> mu0 = clustering.initialize_clusters(datasets.X2d2, 4, 'determ') >>> (mu,Si,pk,ll) = clustering.gmm(datasets.X2d2[0:50,:], mu0) Iteration 0... ll -221.05 Iteration 1... ll -214.35 ... Iteration 99... ll -174.277 >>> pk array([ 0.37999285, 0.18 , 0.17999931, 0.26000783]) >>> mu array([[ 3.27887088, -1.11009874], [-4.25488151, 2.14473591], [ 0.58294548, 4.76656061], [-3.22310253, -4.17603284]]) >>> Si array([[[ 1.47527597, 0.18752747], [ 0.18752747, 1.86006074]], [[ 0.1634365 , 0.0515103 ], [ 0.0515103 , 0.27863332]], [[ 0.13719571, -0.13635173], [-0.13635173, 0.79989168]], [[ 0.84347685, 0.16791742], [ 0.16791742, 0.94558951]]])

Which runs only on the first 50 points; if you run on all the data, it takes quite a bit longer, but I get:

>>> (mu,Si,pk,ll) = clustering.gmm(datasets.X2d2, mu0) Iteration 0... ll -1406.57 Iteration 1... ll -1376.17 Iteration 2... ll -1349.41 Iteration 3... ll -1321.55 Iteration 4... ll -1299.73 Iteration 5... ll -1282.37 Iteration 6... ll -1267.76 ... Iteration 98... ll -1218.32 Iteration 99... ll -1218.32 >>> pk array([ 0.33112729, 0.16666649, 0.16823876, 0.33396746]) >>> mu array([[ 2.96732434, -0.96002069], [-3.92433308, 1.99052864], [ 0.81200349, 4.87488728], [-3.19631327, -4.1116481 ]]) >>> Si array([[[ 1.99362992, -0.0459743 ], [-0.0459743 , 2.35381307]], [[ 0.29393661, -0.04375116], [-0.04375116, 0.23631325]], [[ 0.72873692, 0.04663613], [ 0.04663613, 0.81434992]], [[ 1.02420113, 0.11266846], [ 0.11266846, 0.80237127]]])

You can do an additional test by:

>>> mu0 = clustering.initialize_clusters(X, 10, 'ldh') >>> (mu,Si,pk,ll) = clustering.gmm(X, mu0) >>> plot(ll) >>> util.drawDigits(mu, arange(10))

(This takes a while to run for me: about 30 seconds total.)

**WU8:** Run the above a 5 times. How many iterations does it seem
to take for GMM to converge using ldh? Do the resulting cluster means
look like digits for most of these runs? Pick the "best" run (i.e.,
the one with the lowest final objective) and plot the digits (include
the plot in the writeup). How many of the digits 0-9 are represented?
Which ones are missing? Try both with ldh and with random
initialization: how many iterations does it take for GMM to
converge (on average) for each setting?

**WU9:** Repeat WU8, but for k in 10, 15, 20, 25, 30. For each k,
pick the best of 5 runs, and plot the digits. How big does k have to
get before you start seeing representatives for each digit?