Assignment 6: Minding the Hive
Due: Thursday 4/28 (11:59pm)
Background
This assignment builds on the data that we discussed in class. Make sure you have a careful look at the lecture slides; it will make your life much easier.
These data were collected by the Movie Lens project at the University of Minnesota and represent users' reviews of movies. Each movie has a number of reviews. The data are at the MovieLens site.
These data aren't particularly large, so we strongly encourage you to use the Cloudera VM you set up for the first assignment (and probably have been using since for development). However, hopefully this will give you intuitions about how to design algorithms and how you'd cope with databases that are actually too large to fit on a single machine.
Setting Up
First, you need to set up the data on your computer.
- create table 'u_item' based on the schema of file 'u.item'
- import file 'u.item' to table 'u_item'
- create table 'u_data' based on the schema of the file 'u.data'
- import file 'u.data' to table 'u_data'
- Do a simple query or two to check if everything is OK (for example Body Snatchers is movie 573 and it has 33 reviews)
Questions
- What is the variance (not sample-corrected variance) of the reviews for the movie Body Snatchers?
- We are interested in the distribution of the first letter of movie titles, i.e., we want to count the number of movies sharing the same first letter in the title.
- create a table that stores the itemid and the first letter
- create a Python script 'firstletter.py' that returns (itemid, title, first_letter) for a given pair (itemid, title)
- fill the table u_item_first
- List the movie titles starting with 'V'.
- Let's dig in a little to what Hive does behind the scenes. Data can be
sorted
using 'SORT BY' or 'ORDER BY'. We consider the following 'query' that sorts the letters based on the number of movies.
SET mapred.reduce.tasks=[NUMBER_OF_TASKS]; SELECT firstl, count(itemid) AS n FROM u_item_first GROUP BY firstl [SORT_OR_ORDER] BY n DESC;
Execute the query using the following three configurations:
- [NUMBER_OF_TASKS]=1; [SORT_OR_ORDER]=SORT
- [NUMBER_OF_TASKS]=2; [SORT_OR_ORDER]=SORT
- [NUMBER_OF_TASKS]=2; [SORT_OR_ORDER]=ORDER
Submission Instructions
This assignment is due by 11:59pm, Thursday 4/28. Please send us (both Jordan and Yingying) an email with "[CCC Assignment 6]" as the subject. In the body of the email put answers to the questions above, Hive commands to get the answers you provided, and any source code you created or modified.
Hints / Tips
- Investigate the anagram example from the lecture for a starting place for the first letter table.
- You can probably do the first-letter table in native Hive, but please use Python.
- In Python, you can access strings using the [] operator.