## Assignment 6: Minding the Hive

Due: Thursday 4/28 (11:59pm)

### Background

This assignment builds on the data that we discussed in class. Make sure you have a careful look at the lecture slides; it will make your life much easier.

These data were collected by the Movie Lens project at the University of Minnesota and represent users' reviews of movies. Each movie has a number of reviews. The data are at the MovieLens site.

These data aren't particularly large, so we strongly encourage you to use the Cloudera VM you set up for the first assignment (and probably have been using since for development). However, hopefully this will give you intuitions about how to design algorithms and how you'd cope with databases that are actually too large to fit on a single machine.

### Setting Up

First, you need to set up the data on your computer.

1. create table 'u_item' based on the schema of file 'u.item'
2. import file 'u.item' to table 'u_item'
3. create table 'u_data' based on the schema of the file 'u.data'
4. import file 'u.data' to table 'u_data'
5. Do a simple query or two to check if everything is OK (for example Body Snatchers is movie 573 and it has 33 reviews)

### Questions

1. What is the variance (not sample-corrected variance) of the reviews for the movie Body Snatchers?
2. We are interested in the distribution of the first letter of movie titles, i.e., we want to count the number of movies sharing the same first letter in the title.
1. create a table that stores the itemid and the first letter
2. create a Python script 'firstletter.py' that returns (itemid, title, first_letter) for a given pair (itemid, title)
3. fill the table u_item_first
4. List the movie titles starting with 'V'.
3. Let's dig in a little to what Hive does behind the scenes. Data can be sorted using 'SORT BY' or 'ORDER BY'. We consider the following 'query' that sorts the letters based on the number of movies.
```SET mapred.reduce.tasks=[NUMBER_OF_TASKS];
SELECT firstl, count(itemid) AS n
FROM u_item_first
GROUP BY firstl
[SORT_OR_ORDER] BY n DESC;
```

Execute the query using the following three configurations: