METU Graduate School of Informatics
May 23, Wednesday, 14:00-16:30
Institute of Applied Mathematics, S209
This talk will be an introduction of the MapReduce [1] paradigm for distributed computing originally developed by Google for processing extremely large amounts of data. MapReduce scales the functional programming operators`map’ and `fold’ up to large, heterogenous, and loosely coupled computing clusters in order to perform arbitrarily complex processing in a parallel and distributed manner. The talk will elaborate on the details of computation and data flow in MapReduce using examples from well known algorithms in information retrieval and data mining.
Hadoop [2] is a software framework developed by the Apache Software Foundation that provides open-source and accessible derivative implementations of MapReduce and the closely related Google (distributed) File System [3]. Hadoop allows researchers and developers to utilize MapReduce for their own projects and, thus, has played a big role in the popularity of MapReduce in both enterprise and academic settings. The talk will give a broad overview of the Hadoop framework in parallel with the discussion on MapReduce, and hopefully give interested listeners enough information to start utilizing Hadoop/MapReduce for their own research.
The latter part of the talk will include a demo illustrating the application of MapReduce to data clustering (and market basket analysis, if time permits) using several Hadoop instances running on Amazon EC2 [4].
[1] Dean J. and Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters; Proc. of OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, Dec. 2004
[2] The Apache Hadoop Project, http://hadoop.apache.org/
[3] Ghemawat S., Gobioff H., Leung S-T. The Google File System; Proc. of SOSP’ 03: 19th ACM Symposium on Operating Systems Principles, Lake George, NY, Oct. 2003
[4] Amazon Elastic Compute Cloud, http://aws.amazon.com/ec2/