Carnegie Mellon University

Introduction to Big Data Analytics

Instructor CEU Units # of Lectures Hours per Week Tuition
Ravi Starzl 4.8 12 8-10 $2,700

Course Objectives

This course will introduce you to the fundamental technologies, platforms, and methods that enable Big Data analysis. You will learn how to setup and operate the Amazon Web Services (AWS) platform to complete real world Big Data analysis tasks and then to become comfortable with summarizing and communicating your results.  By the end of this course you will:

  • Understand the basic principles of high performance computing, parallelization, distributed systems, and map-reduce.
  • Have a methodology for systematically enumerating information needs and structuring your analysis to meet those information needs.
  • Know how to setup and operate an AWS Hadoop cluster (Elastic MapReduce).
  • Be able to write your own map/reduce programs.
  • Understand the various technologies associated with Hadoop, including Pig, Hive, and HBase, as well as know where and when to deploy them.
  • Learn how to write programs in pig and hive.
  • Conduct basic statistical analysis on data.
  • Gain a basic understanding of how to engineer features from data.
  • Have the skills to do real-world Big Data analysis.

Prerequisite

COMPUTING RESOURCES:   Students must have a 64-bit machine with a multi-core CPU that supports virtualization.   The machine must have at least 10 GB of RAM.

Unix command line familiarity:

  • Ability to list, copy, delete, files and navigate the unix file system from the command line
  • Familiarity with environment variables, including what they are used for and how to set them
  • Understand the concept of the path in the file system, and how the path is used by programs to locate required resources (e.g. libraries)
  • Understand what a URI and URL are, and how they function
A basic understanding of Java or an object-oriented programming language is required to succeed in this class.  An understanding of Python is desirable as well.   Java concepts including:
  • Inheritance, encapsulation, and polymorphism
  • How to create classes in Java
  • How to import libraries from java archive files (.jar) and use their methods in your own class
  • Basic file operations, such as opening, reading, and closing files
  • Flow control statements such as the if and for loops
  • Data structures such as arrays and hashmaps
Understand basic statistical concepts, such as:
  • Understand what a distribution is
  • Understand what the mean, median, and mode are
  • Understand what a confidence interval is
  • Prior experience or knowledge of machine learning will be helpful, but is not required. 

Required Textbook

None

Topics

Lecture 1:      What is Big Data - Overview of High-Performance Computing and Big Data Platforms
Lecture 2:      Overview of Hadoop Technology Stack, Cloud Computing, and Amazon Web Services
Lecture 3:      MapReduce & HDFS Fundamentals
Lecture 4:      EMR jobs and Public Datasets
Lecture 5:      Writing Your Own Java MapReduce Applications 
Lecture 6:      Analytics: Looking for Patterns in the Data
Lecture 7:      Introduction to Pig
Lecture 8:      Introduction to Hive
Lecture 9:      Introduction to NoSQL
Lecture 10:    Implementations and Algorithms That Scale
Lecture 11:    Communicating Your Findings Effectively
Lecture 12:    Preparing for On-Site Deployment