Carnegie Mellon University

Advanced Big Data 

Instructor CEU Units # of Lectures Hours per Week Tuition
Ravi Starzl
4.8 12 8-10 $2,700

Course Objectives

This course will teach you the advanced skills required to create and deploy custom Big Data Analytics solutions using sophisticated machine learning based analysis. You will deploy MapReduce learning algorithms written in your own code, as well as pre-packaged learning algorithms on the Spark platform. Throughout the course, you will work with a single large dataset and progressively engineer better solutions to the analytic task given. Most of the work in this course will be team based task oriented competitive engineering.

Upon course completion students will:

  • Know how to write and deploy ensemble methods for MapReduce.
  • Know how to configure and use Spark for analytics.
  • Understand the principles that enable machine learning.
  • Know how to implement a variety of machine learning algorithms.
  • Understand where and when supervised or unsupervised learning methods are most applicable and which will lead to best results.
  • Properly utilize validation methods to make accurate estimates of model performance in the real world.
  • Develop practical skills and intuition for reasoning with results of analysis.
  • Develop better skills for communicating results.

Prerequisites

Students considering this course should have completed the Introduction to Big Data Systems and Analytics course or alternatively have significant prior big data systems experience and instructor permission. Students should be familiar with Java, Pig, and Hive, be familiar with the installation and configuration of hadoop in a single-node distribution configuration, comfortable with using command line interfaces, have knowledge of basic distributed computing concepts, be comfortable with undergraduate college level probability and statistics, and have good debugging skills. 

Required Textbook

None

Topics

Lecture 1:      Machine Learning Fundamentals [Final project assigned]
Lecture 2:      Model Validation and Performance Metrics
Lecture 3:      Configuring and Using Mahout
Lecture 4:      Feature Selection and Engineering
Lecture 5:      Linear and Non-Linear Learning Algorithms
Lecture 6:      Configuring and Using RHIPE
Lecture 7:      Advanced Map-Reduce App Development
Lecture 8:      Configuring and Optimizing Hadoop for Your Cluster
Lecture 9:      Avro and Zookeeper
Lecture 10:    Sqoop and Flume
Lecture 11:    Deductive Investigation of Confirmatory and Exploratory Analyses
Lecture 12:    Apache Storm