Tag Archives: Data

Introduction to R and GGobi

Özlem İlk

Middle East Technical University, Department of Statistics

May 17, Thursday, 14:00-17:30

Department of Mathematics, Computer Lab

Introduction to GGobi: A free data visiualization software

Duration: 1 hour

GGobi is a free software for visualizing high dimensional data. In this short course on GGobi, we will start with downloading it from internet. Later, basic properties, such as brushing, identifying, jittering, will be illustrated. We will also demonstrate the following tools of the software: variable manipulation, handling missing data, case subsetting and sampling. Interactive graphics will
be illustrated through rotation and projection of high dimensional data. The methods will be demonstrated on some demo datasets available in GGobi.

Introduction to R: A Free Computer Language and Computing Environment for Everyone

R is one of the most popular software for statistical computing and graphics; and yet the users are not restricted with statisticians anymore. In this short course on R, we will start with demonstrating how to download this free software from internet. Later, connecting to packages,libraries and help menus will be illustrated. One of the biggest challenges for new R users is reading data into the environment. Different solutions will be proposed for this issue. Moreover, how to save your results to an outside file will be covered. Basic applications, such as matrixoperations, random number generation, creating graphics, writing your own small functions, will be provided as well.

Tagged , , ,

MapReduce and Hadoop: Mining Big Data in the Cloud

METU Graduate School of Informatics

May 23, Wednesday, 14:00-16:30
Institute of Applied Mathematics, S209

This talk will be an introduction of the MapReduce [1] paradigm for distributed computing originally developed by Google for processing extremely large amounts of data. MapReduce scales the functional programming operators`map’ and `fold’ up to large, heterogenous, and loosely coupled computing clusters in order to perform arbitrarily complex processing in a parallel and distributed manner. The talk will elaborate on the details of computation and data flow in MapReduce using examples from well known algorithms in information retrieval and data mining.

Hadoop [2] is a software framework developed by the Apache Software Foundation that provides open-source and accessible derivative implementations of MapReduce and the closely related Google (distributed) File System [3]. Hadoop allows researchers and developers to utilize MapReduce for their own projects and, thus, has played a big role in the popularity of MapReduce in both enterprise and academic settings. The talk will give a broad overview of the Hadoop framework in parallel with the discussion on MapReduce, and hopefully give interested listeners enough information to start utilizing Hadoop/MapReduce for their own research.

The latter part of the talk will include a demo illustrating the application of MapReduce to data clustering (and market basket analysis, if time permits) using several Hadoop instances running on Amazon EC2 [4].

[1] Dean J. and Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters; Proc. of OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, Dec. 2004

[2] The Apache Hadoop Project, http://hadoop.apache.org/

[3] Ghemawat S., Gobioff H., Leung S-T. The Google File System; Proc. of SOSP’ 03: 19th ACM Symposium on Operating Systems Principles, Lake George, NY, Oct. 2003

[4] Amazon Elastic Compute Cloud, http://aws.amazon.com/ec2/

Tagged , , , , ,