DataIntensive Text Processing with MapReduce Jimmy Lin and Chris Dyer University of Maryland, College Park {jimmylin,redpony}@umd.edu Overview This halfday tutorial introduces participants to dataintensive text processing with the MapReduce programming model [1], using the opensource Hadoop implementation. The focus will be on scalability and the tradeoffs associated with distributed processing of large datasets. Content will include general discussions about algorithm design, presentation of illustrative algorithms, case studies in HLT applications, as well as practical advice in writing Hadoop programs and running Hadoop clusters. Amazon has generously agreed to provide each participant with $100 in Amazon Web Services (AWS) credits that can used toward its Elastic Compute Cloud (EC2) "utility computing" service (sufficient for 1000 instancehours). EC2 allows anyone to rapidly provision Hadoop clusters "on the fly" without upfront hardware investments, and provides a lowcost vehicle for exploring Hadoop. Intended Audience The tutorial is targeted at any NLP researcher interested in dataintensive processing and scalability issues in general. No background in parallel or distributed computing is necessary, but a prior knowledge of HLT is assumed. Course Objectives · · · Acquire understanding of the MapReduce programming model and how it relates to alternative approaches to concurrent programming. Acquire understanding of how dataintensive HLT problems (e.g., text retrieval, iterative optimization problems, etc.) can be solved using MapReduce. Acquire understanding of the tradeoffs involved in designing MapReduce algorithms and awareness of associated engineering issues. Tutorial Topics The following lists topics that will be covered: · · · · · · · MapReduce algorithm design Distributed counting applications (e.g., relative frequency estimation) Applications to text retrieval Applications to graph algorithms Applications to iterative optimization algorithms (e.g., EM) Practical Hadoop issues Limitations of MapReduce Instructor Bios Jimmy Lin is an assistant professor in the iSchool at the University of Maryland, College Park. He joined the faculty in 2004 after completing his Ph.D. in Electrical Engineering and Computer Science at MIT. Dr. Lin's research interests lie at the intersection of natural language processing and information retrieval. 1 Proceedings of NAACL HLT 2009: Tutorials, pages 1­2, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics He leads the University of Maryland's effort in the Google/IBM Academic Cloud Computing Initiative. Dr. Lin has taught two semesterlong Hadoop courses [2] and has given numerous talks about MapReduce to a wide audience. Chris Dyer is a Ph.D. student at the University of Maryland, College Park, in the Department of Linguistics. His current research interests include statistical machine translation, machine learning, and the relationship between artificial language processing systems and the human linguistic processing system. He has served on program committees for AMTA, ACL, COLING, EACL, EMNLP, NAACL, ISWLT, and the ACL Workshops on Machine translation, and is one of the developers of the Moses open source machine translation toolkit. He has practical experience solving NLP problems with both the Hadoop MapReduce framework and Google's MapReduce implementation, which was made possible by an internship with Google Research in 2008. Acknowledgments This work is supported by NSF under awards IIS0705832 and IIS0836560; the Intramural Research Program of the NIH, National Library of Medicine; DARPA/IPTO Contract No. HR00110620001 under the GALE program. Any opinions, findings, conclusions, or recommendations expressed here are the instructors' and do not necessarily reflect those of the sponsors. We are grateful to Amazon for its support of tutorial participants. References [1] Dean, Jeffrey and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004), p. 137150, 2004, San Francisco, California. [2] Jimmy Lin. Exploring LargeData Issues in the Curriculum: A Case Study with MapReduce. Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics (TeachCL08) at ACL 2008, p. 5461, 2008, Columbus, Ohio. 2