CS 5614 - Big Data Engineering

Engineering Large-scale Data Processing Software


Course Information

Instructor: Muhammad Ali Gulzar
Office: 2224 Knowledgeworks II
Lecture : MW 2:30 PM - 3:45 PM (Virtual Synchronous at Zoom Room)
Office Hours: Tuesday 1PM - 2PM at Zoom Room
Optional Textbook:Database System Concepts. Avi Silberschatz, Henry F. Korth, and S. Sudarshan. 7th Edition


Course Description

The prevalence of big data analytics in almost every large-scale software system has generated a substantial push to build data-intensive scalable computing (DISC) frameworks such as Google MapReduce and Apache Spark that can fully harness the power of existing data centers. However, frameworks once used by domain experts are now being leveraged by data scientists, business analysts, and researchers. This shift in user demographics calls for immediate advancements in big data application development, debugging, and testing practices, falling behind compared to the DISC framework design and implementation. In this class, we will discuss several aspects of the development cycle of a big data analytics application running on a cloud computing environment. This course aims to provide a handsome understanding of current research in big data systems and hands-on experience with state-of-art data engineering tools. The course components will include but not be limited to:

  • Fundamental Database: languages, operation, and performances
  • Data-intensive scalable computing (DISC) e.g., Apache Spark, Hive, MapReduce, etc.
  • DISC application development\textemdash programming, refactoring, optimization, and testing
  • Interactive and automated debugging for big data analytics and their performance
  • Configuration management and runtime optimizations in DISC
  • Data stream processing and incremental computation


Course Schedule


Week Lecture Topic Description Reading Milestones Optional Reading
1 Jan 18th Database Fundamentals Introduction, logistics, goals, & expectations
2 Jan 23rd Database Fundamentals Relational and dataflow operators, schema, views Textbook Chapter 2 & 3 Project Teams Setup
Jan 25th Database Fundamentals Constraints, indexing, and sorting Textbook Chapter 4 & 5 Homework 1 Released
3 Jan 30th Database Fundamentals Transactions, procedures, and query optimization Textbook Chapter 4 & 5 Finalize Project Teams, Pick Papers
Feb 1st Big Data Processing Systems - I Disk based big data systems Google MapReduce
4 Feb 6th Big Data Processing Systems - II Expressiveprogrammingmodels for big data FlumeJava Dyrad
Feb 8th Big Data Processing Systems - III Big Data programming models Apache Spark Homework 1 Due, Homework 2 Released
5 Feb 13th Development - I Big data programming model and interfaces DryadLinq Boom Analytics
Feb 15th Development - II Big data workload generators and code transformations Casper PigLatin, Pipegen
6 Feb 20th Development - III Runtime Optimizations Big data application layer PeriScope Symbolic Aggregations, Niijima
Feb 22nd Data Stream Processing - I Stream processing systems built on top of batch models MapReduce Online Homework 2 Due, Homework 3 Released Spark Streaming
7 Feb 27th Data Stream Processing - II Advanced data stream processing Dataflow Millwheel
Mar 1st Testing - I Random and symbolic testing in SQL Database Test Generation JavaPath Finder
Mar 6th

Spring Break

Mar 8th

Spring Break

9 Mar 13th Testing - II Testing and verification ofbig data applications BigTest Finalize Projects Sedge, Oslton et al
Mar 15th Debugging - I Data-oriented software debugging Delta Debugging Homework 3 Due, Homework 4 Released WhyLine, Debugging Study
10 Mar 20th Debugging - II Large scale data provenance Titian NEWT, RAMP
Mar 22nd Debugging - III Interactive debugging for bigdata applications BigDebug Inspector Gadget, BugDoc
11 Mar 27th Debugging - IV Automated debugging andexplanation BigSift QFix, Data Xray
Mar 29th Performance Debugging - I Sources of performance issues—data or CPU SkewTune Homework 4 Due Ousterhaur et al
12 Apr 3rd Performance Debugging - II Performance explanation of big data application PerfXplain PerfDebug
Apr 5th Performance Debugging - III Performance estimation of big data application Ernest PerfEnforce
13 Apr 10th Configuration Management - I Big data configuration debugging PCheck Tortoise, Dai et al
Apr 13th Configuration Management - II Big data configuration tuning CherryPick StarFish, Aria
14 Apr 17th Runtime Optimization - I Big data query optimization Catalyst
Apr 19th Runtime Optimization - II Optimizing big data iterative workloads Vega Giannikis et al, Haloop
15 Apr 24th Output Visualization Output inspection and verification Wrangler Predictive Interaction
Apr 26th Incremental Computations Differential execution Naiad
16 May 1st Project Presentations
May 3rd Project Presentations


Grading Policy

Programming Assignments
Course Project
Paper Presentations
Questions
Quizzes


  • 40% — Homeworks/Programming Assignments (4x 10%). Must be done individually.
  • 30% — Course Project: Second half, either a research prototype or end-to-end data pipeline. Must be done in a team of 4 students.
  • 15% — Paper Presentation and Discussions. Once by a team of 2 students.
  • 10% — Questions/Discussion/Insights (1 per reading): Submitted via Canvas Discussion feature.
  • 05% — Pop Quizzes (5x 1%)

This course requires familiarity with basic databases, data structures, algorithms, operating systems, and software engineering for apparent reasons. Your ability to review and apply in-depth analysis on a paper would go a long way, but it can always be learned. The homework assignments will involve the use of Scala programming language. Your familiarity with a functional programming language or willingness to learn a functional language before the release date of the first homework is essential.