Muhammad Ali Gulzar | CS 5614 - Big Data Engineering

Course Information

Instructor: Muhammad Ali Gulzar
Office: 4106 Gilbert Place
Lecture : Tue/Thu 8AM - 9:15 AM in person in TORG 1040.
Office Hours: Tuesday 9:30 AM - 10:30AM
Optional Textbook: Database System Concepts. Avi Silberschatz, Henry F. Korth, and S. Sudarshan. 7th Edition

Course Description

The prevalence of big data analytics in almost every large-scale software system has generated a substantial push to build data-intensive scalable computing (DISC) frameworks such as Google MapReduce and Apache Spark that can fully harness the power of existing data centers. However, frameworks once used by domain experts are now being leveraged by data scientists, business analysts, and researchers. This shift in user demographics calls for immediate advancements in big data application development, debugging, and testing practices, falling behind compared to the DISC framework design and implementation. In this class, we will discuss several aspects of the development cycle of a big data analytics application running on a cloud computing environment. This course aims to provide a handsome understanding of current research in big data systems and hands-on experience with state-of-art data engineering tools. The course components will include but not be limited to:

Fundamental Database: languages, operation, and performances
Data-intensive scalable computing (DISC) e.g., Apache Spark, Hive, MapReduce, etc.
DISC application development\textemdash programming, refactoring, optimization, and testing
Interactive and automated debugging for big data analytics and their performance
Configuration management and runtime optimizations in DISC
Data stream processing and incremental computation

Course Schedule

Week	Lecture	Topic	Description	Reading	Milestones	Optional Reading
1	Jan 21st	Database Fundamentals	Introduction, logistics, goals, & expectations
1	Jan 23rd	Database Fundamentals	Relational and dataflow operators	Chapter 2 & 3
2	Jan 28th	Database Fundamentals	SQL, Schema, Views	Chapter 4 & 5	Homework 1 Released
2	Jan 30th	Database Fundamentals	Constraints and SQL Operators	Chapter 4 & 5	Pick Demos
3	Feb 4th	Database Fundamentals	Indexing and Sorting	Chapter 12, 13, & 14
3	Feb 6th	Database Fundamentals	Transactions, procedures, and query optimization	Chapter 15,16, & 17
4	Feb 11th	Database Fundamentals	Transactions, procedures, and query optimization	Chapter 15,16, & 17	Homework 1 Due, Homework 2 Released
4	Feb 13th	Big Data Processing Systems - I	Disk based big data systems	Google MapReduce
5	Feb 18th	Big Data Processing Systems - II	Expressiveprogrammingmodels for big data	FlumeJava
5	Feb 20th	Big Data Processing Systems - III	Big Data programmin models	Apache Spark
6	Feb 25th	Development - I	Big data programming model and interfaces	DryadLinq	Homework 2 Due, Homework 3 Released
6	Feb 27th	Development - II	Big data workload generators and code transformations	Casper	Project Teams Setup	Pipegen
7	Mar 4th	Development - III	Runtime Optimizations Big data application layer	PeriScope		Niijima
7	Mar 6th	Runtime Optimization - I	Big data query optimization	Catalyst	Finalize Project Teams
	Mar 11th	Spring Break
	Mar 13th	Spring Break
9	Mar 18th	Runtime Optimization - II	Optimizing big data iterative workloads	Vega	Homework 3 Due, Homework 4 Released	Haloop
9	Mar 20th	Performance Debugging - I	Sources of performance issues—data or CPU	SkewTune
10	Mar 25nd	Performance Debugging - II	Performance estimation of big data application	Ernest
10	Mar 27th	Data Stream Processing - I	Stream processing systems built on top of batch models	MapReduce Online
11	Apr 1st	Data Stream Processing - II	Stream processing systems built on top of batch models	Spark Streaming	Homework 4 Due, Homework 5 Released
11	Apr 3rd	Data Stream Processing - III	Advanced data stream processing	Dataflow
12	Apr 8th	Testing - I	Random and symbolic testing in SQL	CSmith
12	Apr 10th	Testing - II	Testing and verification of Databases	SQLancer
13	Apr 15th	Debugging - I	Large scale data provenance	Titian		RAMP
13	Apr 17th	Debugging - II	Automated debugging andexplanation	BigSift	Homework 5 Due	Data Xray
14	Apr 22nd	Configuration Management - I	Big data configuration debugging and Tuning	CherryPick
14	Apr 24th	Output Visualization	Output inspection and verification	Wrangler
	Apr 29th	Conference Travel
	May 1st	Conference Travel
16	May 6th	Project Presentations

Grading Policy

Programming Assignments

Course Project

Tool Demos and Discussions

Quizzes

50% — Homeworks/Programming Assignments (5x 10%). Must be done individually.
30% — Course Project: Second half, either a research prototype or end-to-end data pipeline. Must be done in a team of 4 students.
15% — Tool Demonstrations and Discussions. Must be done individually
05% — Pop Quizzes (5x 1%)

This course requires familiarity with basic databases, data structures, algorithms, operating systems, and software engineering for apparent reasons. Your ability to review and apply in-depth analysis on a paper would go a long way, but it can always be learned. The homework assignments will involve the use of Scala programming language. Your familiarity with a functional programming language or willingness to learn a functional language before the release date of the first homework is essential.