Tutor: Prof. Dr. Artur Andrzejak, Institute of Computer Science of Heidelberg University, Germany
Emergence of very big data sets in the last decade has created a lot of interest for their scalable mining and processing. Due to availability of open-source tools like Hadoop such tasks can be nowadays performed with a moderate programming effort and without deep knowledge of distributed systems. In this tutorial we will discuss all necessary ingredients for mining "Big Data".
In the first part we will introduce the Map-Reduce programming paradigm and its "dataflow" extensions represented by the frameworks Spark, Apache Pig and DryadLINQ.
The second part will illustrate on simple machine learning algorithms (e.g. linear regression, k-means clustering) how sequential algorithms can be adapted to these programming models.
The third and last part will be devoted to a survey of existing large-scale data mining implementations (especially Apache Mahout) and to the practical aspects of using these tools.
Prerequisites: Basic knowledge of statistics; Familiarity with Java or a comparable programming language.
Time and place of the tutorial: April 1, 2013, 14:00, SUSU, main building, room 1007.
Duration: 3 hours (3 blocks: 50mins lecture + 10mins break/informal discussion)
The tutorial will be conducted in English.
To participate in this tutorial please send a request till February 28 to gleb.radchenko@gmail.com with the following content:
Subject: PCT-2013: Request for tutorial
============================
Application ID: <<Personal application ID for participation in the conference>>
Name: <<Full name>>
Tutorial: Mining of Big Data: Programming Paradigms, Algorithms and Tools
============================