CS 5614 - Big Data Engineering
Software Engineering Practices for Large-scale Data Processing Applications.
Course Information
Instructor: Muhammad Ali Gulzar
Office: 2224 Knowledgeworks II
Lecture : MW 5:30 PM - 6:45 PM (Virtual Synchronous at Zoom Room)
Office Hours: TR 11 AM - Noon at Zoom Room
Optional Textbook:Database System Concepts. Avi Silberschatz, Henry F. Korth, and S. Sudarshan. 7th Edition
Course Description
The prevalence of big data analytics in almost every large-scale software system has generated a substantial push to build data-intensive scalable computing (DISC) frameworks such as Google MapReduce and Apache Spark that can fully harness the power of existing data centers. However, frameworks once used by domain experts are now being leveraged by data scientists, business analysts, and researchers. This shift in user demographics calls for immediate advancements in the development, debugging, and testing practices of big data applications, falling behind compared to the DISC framework design and implementation. In this class, we will discuss several aspects of the development cycle of a big data analytics application running on a cloud computing environment. This course aims to provide a handsome understanding of current research in big data systems and hands-on experience of the state-of-art data engineering tools. The course components will include but not limited to:
- Fundamental Database: languages, operation, and performances
- Data intensive scalable computing (DISC) e.g., Apache Spark, hive, MapReduce, etc.
- DISC application development\textemdash programming, refactoring, optimization, and testing
- Interactive and automated debugging for big data analytics and their performance
- Configuration management and runtime optimizations in DISC
- Data stream processing and incremental computation
Course Schedule
Week | Lecture | Topic | Description | Reading | Optional Reading |
---|---|---|---|---|---|
1 | Jan 20th | Database Fundamentals | Introduction, logistics, goals, & expectations | ||
2 | Jan 25th | Database Fundamentals | Relational and dataflow operators, schema, and views | Chapter 2 and 3 | |
Jan 27th | Database Fundamentals | Constraints, indexing, and sorting | Chapter 4 and 5 | ||
3 | Feb 1st | Database Fundamentals | Transactions,procedures,and query optimization | Chapter 4 and 5 | |
Feb 3rd | Big Data Processing Systems - I | Disk-based big data systems | Google MapReduce | ||
4 | Feb 8th | Big Data Processing Systems II | Expressive programming models for big data | FlumeJava | Dyrad |
Feb 10th | Big Data Processing Systems III | In-memory data processing systems | Apache Spark | ||
5 | Feb 15th | Development I | Big data programming model and interfaces | DryadLinq | PigLatin, Boom Analytics |
Feb 17th | Development II | Big data workload generators and code transformations | Casper | Pipegen | |
6 | Feb 22nd | Development III | Runtime optimizations at application layer | PeriScope | Symbolic Aggregations, Niijima |
Feb 24th | Testing I | Random and symbolic testing in SQL | Database Test Generation | JavaPath Finder | |
7 | Mar 1st | Testing II | Testing and verification of big data applications | BigTest | Sedge, Oslton et al |
Mar 3rd | Debugging I | Data-oriented software debugging | Delta Debugging | WhyLine, Debugging Study | |
8 | Mar 8th | Debugging II | Large scale data provenance | Titian | NEWT, RAMP |
Mar 10th | Debugging III | Interactive debugging for big data applications | BigDebug | Inspector Gadget, BugDoc | |
9 | Mar 15th | Debugging IV | Automated debugging and explanation | BigSift | QFix, Data Xray |
Mar 17th | Spring Break Day |
||||
10 | Mar 22nd | Performance Debugging I | Sources of performance issues—data or CPU | SkewTune | Ousterhaur et al |
Mar 24th | Performance Debugging II | Performance explanation of big data application | PerfXplain | PerfDebug | |
11 | Mar 29th | Performance Debugging III | Performance estimation of big data application | Ernest | PerfEnforce |
Mar 31st | Configuration Management I | Big data configuration debugging | PCheck | Tortoise, Dai et al | |
12 | Apr 5th | Configuration Management II | Big data configuration tuning | CherryPick | StarFish, Aria |
Apr 7th | Runtime Optimization I | Big data query optimization | Catalyst | ||
13 | Apr 12th | Runtime Optimization II | Optimizing big data iterative workloads | Vega | Giannikis et al, Haloop |
Apr 14th | Output Visualization | Output inspection and verification | Wrangler | Predictive Interaction | |
14 | Apr 19th | Incremental Computations | Differential execution | Naiad | |
Apr 21st | Data Stream Processing I | Stream processing systems on batch processing models | MapReduce Online | Spark Streaming | |
Apr 26th | Spring Break Day |
||||
15 | Apr 28th | Data Stream Processing II | Advanced data stream processing | Dataflow | Millwheel |
16 | May 3rd | Project Presentations | |||
May 5th | Project Presentations |
Grading Policy
- 40% — Course Project: Second half, either a research prototype or end-to-end data pipeline
- 30% — Short Homeworks/Programming Assignments (3x 10%): First half and on DISC systems.
- 15% — Paper Presentation and Discussions: Frequency is based on class enrollment count
- 10% — Questions/Discussion/Insights (1 per reading): Submitted via Canvas Discussion feature.
- 05% — Pop Quizzes (5x 1%)
This course requires familiarity with basic databases, data structures, algorithms, operating systems, and software engineering for apparent reasons. Your ability to review and apply in-depth analysis on a paper would go a long way, but it can always be learned. The homework assignments will involve the use of Scala programming language. Your familiarity with a functional programming language or willingness to learn a functional language before the release date of the first homework is essential.