Muhammad Ali Gulzar | CS 5614 - Big Data Engineering

Course Information

Instructor: Muhammad Ali Gulzar
Office: 2224 Knowledgeworks II
Lecture : MW 2:30 PM - 3:45 PM (Virtual Synchronous at Zoom Room)
Office Hours: Tuesday 1PM - 2PM at Zoom Room
Optional Textbook:Database System Concepts. Avi Silberschatz, Henry F. Korth, and S. Sudarshan. 7th Edition

Course Description

The prevalence of big data analytics in almost every large-scale software system has generated a substantial push to build data-intensive scalable computing (DISC) frameworks such as Google MapReduce and Apache Spark that can fully harness the power of existing data centers. However, frameworks once used by domain experts are now being leveraged by data scientists, business analysts, and researchers. This shift in user demographics calls for immediate advancements in big data application development, debugging, and testing practices, falling behind compared to the DISC framework design and implementation. In this class, we will discuss several aspects of the development cycle of a big data analytics application running on a cloud computing environment. This course aims to provide a handsome understanding of current research in big data systems and hands-on experience with state-of-art data engineering tools. The course components will include but not be limited to:

Fundamental Database: languages, operation, and performances
Data-intensive scalable computing (DISC) e.g., Apache Spark, Hive, MapReduce, etc.
DISC application development\textemdash programming, refactoring, optimization, and testing
Interactive and automated debugging for big data analytics and their performance
Configuration management and runtime optimizations in DISC
Data stream processing and incremental computation

Course Schedule

Week	Lecture	Topic	Description	Reading	Milestones	Optional Reading
1	Jan 18th	Database Fundamentals	Introduction, logistics, goals, & expectations
2	Jan 23rd	Database Fundamentals	Relational and dataflow operators, schema, views	Textbook Chapter 2 & 3	Project Teams Setup
2	Jan 25th	Database Fundamentals	Constraints, indexing, and sorting	Textbook Chapter 4 & 5	Homework 1 Released
3	Jan 30th	Database Fundamentals	Transactions, procedures, and query optimization	Textbook Chapter 4 & 5	Finalize Project Teams, Pick Papers
3	Feb 1st	Big Data Processing Systems - I	Disk based big data systems	Google MapReduce
4	Feb 6th	Big Data Processing Systems - II	Expressiveprogrammingmodels for big data	FlumeJava		Dyrad
4	Feb 8th	Big Data Processing Systems - III	Big Data programming models	Apache Spark	Homework 1 Due, Homework 2 Released
5	Feb 13th	Development - I	Big data programming model and interfaces	DryadLinq		Boom Analytics
5	Feb 15th	Development - II	Big data workload generators and code transformations	Casper		PigLatin, Pipegen
6	Feb 20th	Development - III	Runtime Optimizations Big data application layer	PeriScope		Symbolic Aggregations, Niijima
6	Feb 22nd	Data Stream Processing - I	Stream processing systems built on top of batch models	MapReduce Online	Homework 2 Due, Homework 3 Released	Spark Streaming
7	Feb 27th	Data Stream Processing - II	Advanced data stream processing	Dataflow		Millwheel
7	Mar 1st	Testing - I	Random and symbolic testing in SQL	Database Test Generation		JavaPath Finder
	Mar 6th	Spring Break
	Mar 8th	Spring Break
9	Mar 13th	Testing - II	Testing and verification ofbig data applications	BigTest	Finalize Projects	Sedge, Oslton et al
9	Mar 15th	Debugging - I	Data-oriented software debugging	Delta Debugging	Homework 3 Due, Homework 4 Released	WhyLine, Debugging Study
10	Mar 20th	Debugging - II	Large scale data provenance	Titian		NEWT, RAMP
10	Mar 22nd	Debugging - III	Interactive debugging for bigdata applications	BigDebug		Inspector Gadget, BugDoc
11	Mar 27th	Debugging - IV	Automated debugging andexplanation	BigSift		QFix, Data Xray
11	Mar 29th	Performance Debugging - I	Sources of performance issues—data or CPU	SkewTune	Homework 4 Due	Ousterhaur et al
12	Apr 3rd	Performance Debugging - II	Performance explanation of big data application	PerfXplain		PerfDebug
12	Apr 5th	Performance Debugging - III	Performance estimation of big data application	Ernest		PerfEnforce
13	Apr 10th	Configuration Management - I	Big data configuration debugging	PCheck		Tortoise, Dai et al
13	Apr 13th	Configuration Management - II	Big data configuration tuning	CherryPick		StarFish, Aria
14	Apr 17th	Runtime Optimization - I	Big data query optimization	Catalyst
14	Apr 19th	Runtime Optimization - II	Optimizing big data iterative workloads	Vega		Giannikis et al, Haloop
15	Apr 24th	Output Visualization	Output inspection and verification	Wrangler		Predictive Interaction
15	Apr 26th	Incremental Computations	Differential execution	Naiad
16	May 1st	Project Presentations
16	May 3rd	Project Presentations

Grading Policy

Programming Assignments

Course Project

Paper Presentations

Questions

Quizzes

40% — Homeworks/Programming Assignments (4x 10%). Must be done individually.
30% — Course Project: Second half, either a research prototype or end-to-end data pipeline. Must be done in a team of 4 students.
15% — Paper Presentation and Discussions. Once by a team of 2 students.
10% — Questions/Discussion/Insights (1 per reading): Submitted via Canvas Discussion feature.
05% — Pop Quizzes (5x 1%)

This course requires familiarity with basic databases, data structures, algorithms, operating systems, and software engineering for apparent reasons. Your ability to review and apply in-depth analysis on a paper would go a long way, but it can always be learned. The homework assignments will involve the use of Scala programming language. Your familiarity with a functional programming language or willingness to learn a functional language before the release date of the first homework is essential.