Data-Intensive Information Processing Applications: Assignment 1

Assignment 1a: Getting started on Hadoop

Due: Thursday 2/10 (11:59pm)

The primary purpose of this assignment is to familiarize you with running Hadoop in two different ways: in standalone mode and in the Cloudera VM. You will be asked to work through a few tutorials. The assignment does not involve actual coding, but requires a lot of activity on the command line (running Hadoop jobs, copying files around, etc.). This is the first part of a two-part assignment: in the second part, you'll run Hadoop on the Google/IBM cluster.

The secondary purpose of this assignment is to make sure that you have sufficient background to take this course. This assignment is written in such a way that you should be able to figure out details that we have omitted (for example, on downloading, configuring, and installing software). Also note: machines are configured in slightly different ways, and as a result you may run into issues that require troubleshooting (e.g., differences in install paths, environment settings, etc.). We expect that you have sufficient familiarity with general operating system concepts to be able to solve most issues yourself. If you are having a lot of trouble completing this assignment, you might not be ready for the course.

For this class, we'll be using Cloud⁹, a Hadoop library developed at Maryland both for this course and for research in text processing. The first goal of this assignment is to get the word count demo in Cloud⁹ running in standalone mode. In standalone mode, Hadoop runs in a single thread on your local machine. Do the following:

First, download the Hadoop version: 0.20.1. Unpack it somehwere where you can get to it, and make sure that you can run "hadoop" from the command line.
Next, download and set up Cloud⁹. Note that you can either download a repository or clone the github repository (if you clone the repo using git, it will be easier for future assignments).
Then, work through the tutorial on getting started in standalone mode.

Now, answer the following questions:

Question 1. Have you successfully completed the above tutorials and run the word count demo in standalone mode? (yes or no)

Look at the output in demo/part-00000.

Question 2. What's the next term found in the collection after ''my? How many times does it appear?

Question 3. Scan down the output a few more lines: how many times does 'and appear in the collection?

Note: It is very important to understand that in standalone mode, there is no HDFS.

The second goal of this assignment is to get the word count demo running inside the Cloudera VM. Follow the instructions on the page to download the image and also VMware Player (for Windows and Linux) or VMware Fusion (for Mac). Start up the VM. Important: while 0.3.3 will work for this assignment, in the future we will be using features only in 0.3.4, so save yourself trouble later and install 0.3.4 (0.3.3 also is running the Ubuntu version Intrepid Ibex, which is no longer supported, so you can't do apt-get).

Inside the VM, Hadoop is running in what's called "pseudo-distributed mode", which means that all the daemon processes (JT, NN, TT) are running on the same machine and communicating via loopback. Inside the VM, open up a browser, and you should see the Hadoop webapps.

Your task is to now run the Cloud⁹ word count demo inside the VM. The requires that you copy over the data (bible+shakes.nopunc) to the VM. Once the data is inside the VM, you'll need to put the data into HDFS. You'll also need to copy the Cloud⁹ jar onto the VM. Once you've done all of this, you can now submit a Hadoop job. Run word count example on the bible+shakes.nopunc data with 5 reducers. Answer the following questions:

Question 4. Have you successfully run the word count demo inside the Cloudera VM on the sample dataset? (yes or no)

Question 5. What is the first term in part-00000 and how many times does it appear?

Question 6. What is the third to last term in part-00004 and how many times does it appear?

Question 7. How long did it take you to complete this assignment?

Hints:

Copying the data onto the Cloudera VM does not mean that the data is placed into HDFS. That requires a second step.
Here is a user guide for HDFS commands.
It your job to figure out how to get data from your local machine onto the Cloudera VM. Think about using scp and ifconfig.
If you have trouble getting network connectivity in your VM, try deleting /etc/udev/rules.d/70-persistent-net.rules and restarting the system.
If you're using VirtualBox, try installing an extension pack. It makes the resolution tolerable. (You have to mount the virtual CD to the VM).

Submission Instructions

This assignment is due by 11:59pm, Thursday 2/10. Please send us (both Jordan and Yingying) an email with "[CCC Assignment 1a]" as the subject. In the body of the email put answers to the questions above.

Important: Follow these instructions exactly as specified. This means exactly the subject line indicated above and answers in the email message itself (not as an attachment).