Data-Intensive Information Processing Applications (Spring 2011)

What's the course about?

This course is about scalable approaches to processing large amounts of information (terabytes and even petabytes). We focus mostly on MapReduce, which is presently the most accessible and practical means of computing at this scale, but will discuss other approaches as well.

MapReduce is a programming model for expressing distributed computations on massive amounts of information and an execution framework for large-scale information processing on clusters of commodity servers. It was originally developed by Google and built on well-known principles in parallel and distributed processing dating back several decades. MapReduce has since enjoyed widespread adoption via an open-source implementation called Hadoop, whose development was led by Yahoo.

This is the "version 4.0" of a course that was previously offered in Spring 2010, Spring 2008, and Fall 2008. Core content is similar, but organization and scope have been overhauled for the fourth time, particularly revamping the assignments.

Course readings will be drawn from research papers available online, as well as the following two textbooks:

Tom White, Hadoop: The Definitive Guide SECOND EDITION, O'Reilly, 2010. (The full text of this book is available to UMD students via Safari Books Online, but you may choose to purchase a paper copy for handy reference)
Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with MapReduce, Morgan and Claypool, 2010.

In addition to the standard distributions of Hadoop, this course uses Cloud⁹, a MapReduce library for Hadoop developed at the University of Maryland.

Pre-requisites

No previous experience with MapReduce or parallel and distributed programming is necessary. However, students taking this course should be competent Java programmers, since concepts taught in class will be reinforced through extensive programming exercises in the Hadoop implementation of MapReduce (which is in Java). Note that this is a course on algorithms and "thinking at scale"—not about Hadoop programming. Therefore, we expect you to "pick up" the details of the Hadoop API without explicit instruction from us. Of course, we will assist you by providing resources and a reasonable amount of guidance.

In addition, students are assumed to have knowledge of basic probability and statistics (e.g., axioms of probability, Bayes' Theorem, relative frequency estimation, etc.) and also a solid understanding of basic computer architecture (e.g., microprocessor architectures, memory hierarchies, cache coherence protocols, etc.).

Course Grade

Exams (30%): There will be a midterm exam (15%) and a final exam (15%).
Final project (30%): There will be group final project (~3 students) on a topic of your own choosing. More details on this will be discussed later on in the semester.
Class assignments (30%): There will be six assignments, but we will drop your lowest grade and take the average of five assignment grades. Each assignment grade is worth 5% of your course grade.
Class participation (10%): Showing up for class, demonstrating preparedness (i.e., doing the readings), and contributing to class discussions (both in class and on the mailing list).

See course schedule for assignment due dates and exams dates.

Homework Assignments

The homework assignments are designed to help you learn the material, and will be graded out of 50 points each. It's possible to get more than 50 points in the case of going above and beyond the required assignment.

Policy for Late and Incomplete Work

Assignment deadlines: Assignments are due at the beginning of class unless otherwise noted.
Late assignments: You have four late days that can be used for any assignment in any combination so long as the total number of late days is less than or equal to four (e.g. one late days for assignment 2, three late days for assignment 3). You do not need prior permission to use these default late days. Exceptions can be discussed in cases of medical excuses, family emergencies, etc., but being busy is not a valid excuse. Exceptions will only be considered prior to the deadline, and the sooner you talk to us about a problem the more likely we will be sympathetic. There are several common problems that we are unlikely to consider as valid reasons for failing to get work in on time. These include (a) failure to manage your time properly, (b) discovering an assignment is harder or takes longer than you expected it to be (see item a), (c) having competing priorities such as other classes, job interviews, etc., and (d) losing code or data that should have been backed up, unless it's someone else's fault. Homework submitted beyond the default late policy without a prior excuse will be graded at 50% credit up to one week late. Beyond that, homework will not be graded (you will get a zero).
'Incomplete' as a grade: We will not issue an 'incomplete' as a grade except for serious, valid reasons, generally in the category of emergencies. See above for some reasons unlikely to be considered valid. If you are having problems of any kind, please talk to us as soon as possible.

Academic Integrity

The University of Maryland, College Park has a nationally recognized Code of Academic Integrity, administered by the Student Honor Council. This Code sets standards for academic integrity at Maryland for all undergraduate and graduate students. As a student you are responsible for upholding these standards for this course. It is very important for you to be aware of the consequences of cheating, fabrication, facilitation, and plagiarism. Please visit the Code of Academic Integrity or the Student Honor Council for more information.

Students with Disabilities

Students with disabilities needing academic accommodation should: (1) register with and provide documentation to the Disability Support Services office, and (2) discuss any necessary academic accommodation with me. This should be done at the beginning of the semester, within the first three class sessions.

Emergency Preparedness

Information about the status of the campus is available at the campus emergency preparedness website. If the campus is closed, please make sure to stay safe. Information about possible rescheduling of course activities will be provided via e-mail once the campus has reopened.