Data-Intensive Information Processing Applications (Spring 2011)

What's the course about?

This course is about scalable approaches to processing large amounts of information (terabytes and even petabytes). We focus mostly on MapReduce, which is presently the most accessible and practical means of computing at this scale, but will discuss other approaches as well.

MapReduce is a programming model for expressing distributed computations on massive amounts of information and an execution framework for large-scale information processing on clusters of commodity servers. It was originally developed by Google and built on well-known principles in parallel and distributed processing dating back several decades. MapReduce has since enjoyed widespread adoption via an open-source implementation called Hadoop, whose development was led by Yahoo.

This is the "version 4.0" of a course that was previously offered in Spring 2010, Spring 2008, and Fall 2008. Core content is similar, but organization and scope have been overhauled for the fourth time, particularly revamping the assignments.

Course readings will be drawn from research papers available online, as well as the following two textbooks:

In addition to the standard distributions of Hadoop, this course uses Cloud9, a MapReduce library for Hadoop developed at the University of Maryland.

Pre-requisites

No previous experience with MapReduce or parallel and distributed programming is necessary. However, students taking this course should be competent Java programmers, since concepts taught in class will be reinforced through extensive programming exercises in the Hadoop implementation of MapReduce (which is in Java). Note that this is a course on algorithms and "thinking at scale"—not about Hadoop programming. Therefore, we expect you to "pick up" the details of the Hadoop API without explicit instruction from us. Of course, we will assist you by providing resources and a reasonable amount of guidance.

In addition, students are assumed to have knowledge of basic probability and statistics (e.g., axioms of probability, Bayes' Theorem, relative frequency estimation, etc.) and also a solid understanding of basic computer architecture (e.g., microprocessor architectures, memory hierarchies, cache coherence protocols, etc.).

Course Grade

See course schedule for assignment due dates and exams dates.

Homework Assignments

The homework assignments are designed to help you learn the material, and will be graded out of 50 points each. It's possible to get more than 50 points in the case of going above and beyond the required assignment.

Policy for Late and Incomplete Work

Academic Integrity

The University of Maryland, College Park has a nationally recognized Code of Academic Integrity, administered by the Student Honor Council. This Code sets standards for academic integrity at Maryland for all undergraduate and graduate students. As a student you are responsible for upholding these standards for this course. It is very important for you to be aware of the consequences of cheating, fabrication, facilitation, and plagiarism. Please visit the Code of Academic Integrity or the Student Honor Council for more information.

Students with Disabilities

Students with disabilities needing academic accommodation should: (1) register with and provide documentation to the Disability Support Services office, and (2) discuss any necessary academic accommodation with me. This should be done at the beginning of the semester, within the first three class sessions.

Emergency Preparedness

Information about the status of the campus is available at the campus emergency preparedness website. If the campus is closed, please make sure to stay safe. Information about possible rescheduling of course activities will be provided via e-mail once the campus has reopened.