Muhammad Ali Gulzar | CS 5614 - Big Data Engineering

Course Information

Instructor: Muhammad Ali Gulzar
Office: 2224 Knowledgeworks II
Lecture : MW 5:30 PM - 6:45 PM (Virtual Synchronous at Zoom Room)
Office Hours: TR 11 AM - Noon at Zoom Room
Optional Textbook:Database System Concepts. Avi Silberschatz, Henry F. Korth, and S. Sudarshan. 7th Edition

Course Description

The prevalence of big data analytics in almost every large-scale software system has generated a substantial push to build data-intensive scalable computing (DISC) frameworks such as Google MapReduce and Apache Spark that can fully harness the power of existing data centers. However, frameworks once used by domain experts are now being leveraged by data scientists, business analysts, and researchers. This shift in user demographics calls for immediate advancements in the development, debugging, and testing practices of big data applications, falling behind compared to the DISC framework design and implementation. In this class, we will discuss several aspects of the development cycle of a big data analytics application running on a cloud computing environment. This course aims to provide a handsome understanding of current research in big data systems and hands-on experience of the state-of-art data engineering tools. The course components will include but not limited to:

Fundamental Database: languages, operation, and performances
Data intensive scalable computing (DISC) e.g., Apache Spark, hive, MapReduce, etc.
DISC application development\textemdash programming, refactoring, optimization, and testing
Interactive and automated debugging for big data analytics and their performance
Configuration management and runtime optimizations in DISC
Data stream processing and incremental computation

Course Schedule

Week	Lecture	Topic	Description	Reading	Optional Reading
1	Jan 20th	Database Fundamentals	Introduction, logistics, goals, & expectations
2	Jan 25th	Database Fundamentals	Relational and dataflow operators, schema, and views	Chapter 2 and 3
2	Jan 27th	Database Fundamentals	Constraints, indexing, and sorting	Chapter 4 and 5
3	Feb 1st	Database Fundamentals	Transactions,procedures,and query optimization	Chapter 4 and 5
3	Feb 3rd	Big Data Processing Systems - I	Disk-based big data systems	Google MapReduce
4	Feb 8th	Big Data Processing Systems II	Expressive programming models for big data	FlumeJava	Dyrad
4	Feb 10th	Big Data Processing Systems III	In-memory data processing systems	Apache Spark
5	Feb 15th	Development I	Big data programming model and interfaces	DryadLinq	PigLatin, Boom Analytics
5	Feb 17th	Development II	Big data workload generators and code transformations	Casper	Pipegen
6	Feb 22nd	Development III	Runtime optimizations at application layer	PeriScope	Symbolic Aggregations, Niijima
6	Feb 24th	Testing I	Random and symbolic testing in SQL	Database Test Generation	JavaPath Finder
7	Mar 1st	Testing II	Testing and verification of big data applications	BigTest	Sedge, Oslton et al
7	Mar 3rd	Debugging I	Data-oriented software debugging	Delta Debugging	WhyLine, Debugging Study
8	Mar 8th	Debugging II	Large scale data provenance	Titian	NEWT, RAMP
8	Mar 10th	Debugging III	Interactive debugging for big data applications	BigDebug	Inspector Gadget, BugDoc
9	Mar 15th	Debugging IV	Automated debugging and explanation	BigSift	QFix, Data Xray
	Mar 17th	Spring Break Day
10	Mar 22nd	Performance Debugging I	Sources of performance issues—data or CPU	SkewTune	Ousterhaur et al
10	Mar 24th	Performance Debugging II	Performance explanation of big data application	PerfXplain	PerfDebug
11	Mar 29th	Performance Debugging III	Performance estimation of big data application	Ernest	PerfEnforce
11	Mar 31st	Configuration Management I	Big data configuration debugging	PCheck	Tortoise, Dai et al
12	Apr 5th	Configuration Management II	Big data configuration tuning	CherryPick	StarFish, Aria
12	Apr 7th	Runtime Optimization I	Big data query optimization	Catalyst
13	Apr 12th	Runtime Optimization II	Optimizing big data iterative workloads	Vega	Giannikis et al, Haloop
13	Apr 14th	Output Visualization	Output inspection and verification	Wrangler	Predictive Interaction
14	Apr 19th	Incremental Computations	Differential execution	Naiad
14	Apr 21st	Data Stream Processing I	Stream processing systems on batch processing models	MapReduce Online	Spark Streaming
	Apr 26th	Spring Break Day
15	Apr 28th	Data Stream Processing II	Advanced data stream processing	Dataflow	Millwheel
16	May 3rd	Project Presentations
16	May 5th	Project Presentations

Grading Policy

Course Project

Programming Assignments

Paper Presentations

Questions

Quizzes

40% — Course Project: Second half, either a research prototype or end-to-end data pipeline
30% — Short Homeworks/Programming Assignments (3x 10%): First half and on DISC systems.
15% — Paper Presentation and Discussions: Frequency is based on class enrollment count
10% — Questions/Discussion/Insights (1 per reading): Submitted via Canvas Discussion feature.
05% — Pop Quizzes (5x 1%)

This course requires familiarity with basic databases, data structures, algorithms, operating systems, and software engineering for apparent reasons. Your ability to review and apply in-depth analysis on a paper would go a long way, but it can always be learned. The homework assignments will involve the use of Scala programming language. Your familiarity with a functional programming language or willingness to learn a functional language before the release date of the first homework is essential.