Assignment 1b: Getting started on Hadoop at scale
hadoop jar cloud9.jar edu.umd.cloud9.example.simple.DemoWordCount /tmp/wiki /tmp/jbg-course/cnt2-USERNAME 50
Due: Tuesday 2/22 (11:59pm)
The purpose of this assignment is to familiarize you with running Hadoop on the Google/IBM cluster. It is the sequel to Assignment 1-1 and builds directly on it.
First, let's start with the Google/IBM cluster. You should have received separate instructions on account creation. Remember, when selecting a username, please prefix the username by "ccc_", so that, for example, I would be "ccc_jbg". This allows us to distinguish students in the class from other cluster users.
On the cluster, we've prepped a raw text dump of Wikipedia for you to play with:
hadoop fs -ls /tmp/wiki
You can check out the contents with something like this:
hadoop fs -cat /tmp/wiki/part-00000 | head
Now, run the word count demo on this dataset, with 100 reducers:
hadoop jar cloud9.jar edu.umd.cloud9.example.simple.DemoWordCount /tmp/wiki /tmp/jbg-course/cnt1-USERNAME 100
Substitute USERNAME with your actual username without the "ccc_" prefix. Therefore, I would put the output in /tmp/jbg-course/cnt1-jbg. It is important that you follow these instructions exactly, because this is where we are going to look for your output.
Question 1. What is your job id? If you ran the code more than once, any job id of a successful run will do.
Question 2. How large is the input data? (Hint, look in the jobtracker webapp.)
Question 3. How many map tasks does your job contain?
Question 4. What is the 6th word in part-00042 and how many times does it appear?
You'll notice that there is a lot of "junk" in the output. Let's try to clean this up by throwing away terms that don't appear often. Modify the word count demo so to retain only words that occur more than 100 times (i.e., cnt > 100).
Once you've modified the program (remember to remake the jar file!), run it again:
hadoop jar cloud9.jar edu.umd.cloud9.example.simple.DemoWordCount /tmp/wiki /tmp/jbg-course/cnt2-USERNAME 10
This time, use only 10 reducers. Note the slightly different path in which to put your results.
Question 5. How many terms appear more than 100 times in the collection?
Question 6. How many time does "life" appear in the collection?
This assignment is due by 11:59pm, Tuesday 2/22. Please send us (both Jordan and Yingying) an email with "[CCC Assignment 1b]" as the subject. In the body of the email put answers to the questions above.
Note: The Google/IBM cluster is a shared resource accessible by many. Any impropriety on the cluster will be taken very seriously. This includes tampering or attempting to tamper with another student's results, attempting to pass another student's result as one's own, etc. See the Code of Academic Integrity or the Student Honor Council for more information.
- The support page for the cluster has examples of configuration files you can download if you don't have luck with these instructions.