CS 5614 - Big Data Engineering

Engineering Large-scale Data Processing Applications


Course Information

Instructor: Muhammad Ali Gulzar
Office: 4106 Gilbert Place
Lecture : Tue/Thu 8AM - 9:15 AM in person in TORG 1040.
Office Hours: Tuesday 9:30 AM - 10:30AM
Optional Textbook: Database System Concepts. Avi Silberschatz, Henry F. Korth, and S. Sudarshan. 7th Edition


Course Description

The prevalence of big data analytics in almost every large-scale software system has generated a substantial push to build data-intensive scalable computing (DISC) frameworks such as Google MapReduce and Apache Spark that can fully harness the power of existing data centers. However, frameworks once used by domain experts are now being leveraged by data scientists, business analysts, and researchers. This shift in user demographics calls for immediate advancements in big data application development, debugging, and testing practices, falling behind compared to the DISC framework design and implementation. In this class, we will discuss several aspects of the development cycle of a big data analytics application running on a cloud computing environment. This course aims to provide a handsome understanding of current research in big data systems and hands-on experience with state-of-art data engineering tools. The course components will include but not be limited to:

  • Fundamental Database: languages, operation, and performances
  • Data-intensive scalable computing (DISC) e.g., Apache Spark, Hive, MapReduce, etc.
  • DISC application development\textemdash programming, refactoring, optimization, and testing
  • Interactive and automated debugging for big data analytics and their performance
  • Configuration management and runtime optimizations in DISC
  • Data stream processing and incremental computation


Course Schedule


Week Lecture Topic Description Reading Milestones Optional Reading
1 Jan 21st Database Fundamentals Introduction, logistics, goals, & expectations
Jan 23rd Database Fundamentals Relational and dataflow operators Chapter 2 & 3
2 Jan 28th Database Fundamentals SQL, Schema, Views Chapter 4 & 5 Homework 1 Released
Jan 30th Database Fundamentals Constraints and SQL Operators Chapter 4 & 5 Pick Demos
3 Feb 4th Database Fundamentals Indexing and Sorting Chapter 12, 13, & 14
Feb 6th Database Fundamentals Transactions, procedures, and query optimization Chapter 15,16, & 17
4 Feb 11th Database Fundamentals Transactions, procedures, and query optimization Chapter 15,16, & 17 Homework 1 Due, Homework 2 Released
Feb 13th Big Data Processing Systems - I Disk based big data systems Google MapReduce
5 Feb 18th Big Data Processing Systems - II Expressiveprogrammingmodels for big data FlumeJava
Feb 20th Big Data Processing Systems - III Big Data programmin models Apache Spark
6 Feb 25th Development - I Big data programming model and interfaces DryadLinq Homework 2 Due, Homework 3 Released
Feb 27th Development - II Big data workload generators and code transformations Casper Project Teams Setup Pipegen
7 Mar 4th Development - III Runtime Optimizations Big data application layer PeriScope Niijima
Mar 6th Runtime Optimization - I Big data query optimization Catalyst Finalize Project Teams
Mar 11th

Spring Break

Mar 13th

Spring Break

9 Mar 18th Runtime Optimization - II Optimizing big data iterative workloads Vega Homework 3 Due, Homework 4 Released Haloop
Mar 20th Performance Debugging - I Sources of performance issues—data or CPU SkewTune
10 Mar 25nd Performance Debugging - II Performance estimation of big data application Ernest
Mar 27th Data Stream Processing - I Stream processing systems built on top of batch models MapReduce Online
11 Apr 1st Data Stream Processing - II Stream processing systems built on top of batch models Spark Streaming Homework 4 Due, Homework 5 Released
Apr 3rd Data Stream Processing - III Advanced data stream processing Dataflow
12 Apr 8th Testing - I Random and symbolic testing in SQL CSmith
Apr 10th Testing - II Testing and verification of Databases SQLancer
13 Apr 15th Debugging - I Large scale data provenance Titian RAMP
Apr 17th Debugging - II Automated debugging andexplanation BigSift Homework 5 Due Data Xray
14 Apr 22nd Configuration Management - I Big data configuration debugging and Tuning CherryPick
Apr 24th Output Visualization Output inspection and verification Wrangler
Apr 29th

Conference Travel

May 1st

Conference Travel

16 May 6th Project Presentations

Grading Policy

Programming Assignments
Course Project
Tool Demos and Discussions
Quizzes


  • 50% — Homeworks/Programming Assignments (5x 10%). Must be done individually.
  • 30% — Course Project: Second half, either a research prototype or end-to-end data pipeline. Must be done in a team of 4 students.
  • 15% — Tool Demonstrations and Discussions. Must be done individually
  • 05% — Pop Quizzes (5x 1%)

This course requires familiarity with basic databases, data structures, algorithms, operating systems, and software engineering for apparent reasons. Your ability to review and apply in-depth analysis on a paper would go a long way, but it can always be learned. The homework assignments will involve the use of Scala programming language. Your familiarity with a functional programming language or willingness to learn a functional language before the release date of the first homework is essential.