Muhammad Ali Gulzar

I am an assistant professor in the Computer Science Department at Virginia Tech. I am also an Amazon Scholar at Amazon Web Services. I received my Ph.D. in Computer Science at the University of California, Los Angeles where I was a Google Ph.D. Fellow 2017-20.

My research vision is to build systems that improve developer productivity through automated debugging and testing for applications in the emerging domains, including data-intensive software such as dataflow programs, ML/AI applications, and scientific analysis software such as computations notebooks. Under these broader goals, I redesign existing software productivity tools for emerging applications in three areas. I am interested in (1) automated tracking-code localization techniques in web applications, (2) re-engineering testing and debugging for data-intensive applications, and (3) advancing current testing and debugging practices in Federated Learning Applications.

My past work has focused on interactive and automated debugging for Apache Spark, symbolic execution based test generation for dataflow programs, and performance debugging in Apache Spark.

gulzar cs.vt.edu | Google Scholar | Github | LinkedIn

News

	Our work on token provenance in Federated Language Models has been accepted to MLSys 2026. Congratulations, Waris!
	My student, Waris Gill, has successfully defended his Ph.D thesis and has joined Microsoft as a Senior Applied Scientist.
	In my role as an Amazon Scholar, I am excited to share our scientific contributions to the launch of Troubleshooting Agent for Amazon EMR and AWS Glue.
	I am thrilled to receive the 2025-26 Amazon - Virginia Tech Initiative for Efficient and Robust Machine Learning Award for investigating the code comprehensibility of large language models.
	In collaboration with University of Minnesota and UMass Amherst, we received a $1.1 million NSF award for raising the robustness and privacy of Federated Learning systems.
	Our findings on the accessibility of web advertisements are featured in VT News.
	Our work on pathological non-executable notebooks is accepted to MSR 2025—congrats, Tien!
	Our work on semantic cache for LLMs is accepted to IPDPS. Congrats, Waris!
	My student,Haddi, co-authored the 2024 Web Almanac’s Privacy Chapter.
	Our research on rare-path coverage and evidence-based tech hirring are accepted to SANER 2025.
	Older news

Publications

2026

[MLSys 2026] ProToken: Token-level Attribution for Federated Large Language Models

Waris Gill, Ahmad Humayun, Ali Anwar, and Muhammad Ali Gulzar

Proceedings of Machine Learning and Systems (MLSys) 2026

Abstract

Federated Learning (FL) enables collaborative training of Large Language Models (LLMs) across distributed data sources while preserving privacy. However, when federated LLMs are deployed in critical applications, it remains unclear which client(s) contributed to specific generated responses, hindering debugging, malicious client identification, fair reward allocation, and trust verification. We present ProToken, a novel provenance methodology for token-level attribution in federated LLMs that addresses client attribution during autoregressive text generation while maintaining FL privacy constraints. ProToken leverages two key insights to enable provenance at each token: (1) transformer architectures concentrate task-specific signals in later blocks, enabling strategic layer selection for computational tractability, and (2) gradient-based relevance weighting filters out irrelevant neural activations, focusing attribution on neurons that directly influence token generation. We evaluate ProToken across 16 configurations spanning four LLM architectures (Gemma, Llama, Qwen, SmolLM) and four domains (medical, financial, mathematical, coding). ProToken achieves 98% average attribution accuracy in correctly localizing responsible client(s), and maintains high accuracy when the number of clients are scaled, validating its practical viability for real-world deployment settings.
[ICPC 2026] Path-aware LLM-based Test Generation with Comprehension

Yaoxuan Wu, Xiaojie Zhou, Ahmad Humayun, Muhammad Ali Gulzar, and Miryung Kim

The 34th IEEE/ACM International Conference on Program Comprehension. 2026

Abstract

Symbolic execution is a widely used technique for test generation, offering systematic exploration of program paths through constraint solving. However, it is fundamentally constrained by the capability to model the target code including library functions in terms of symbolic constraint and the capability of underlying constraint solvers. As a result, many paths involving complex features remain unanalyzed or insufficiently modeled. Recent advances in large language models (LLMs) have shown promise in generating diverse and valid test inputs. Yet, LLMs lack mechanisms for systematically enumerating program paths and often fail to cover subtle corner cases. We observe that directly prompting an LLM with the full program leads to missed coverage of interesting paths. In this paper, we present PALM, a test generation system that combines symbolic path enumeration with LLM-assisted test generation. PALM statically enumerates possible paths through AST-level analysis and transforms each into an executable variant with embedded assertions that specify the target path. This avoids the need to translate path constraints into SMT formulae, by instead constructing program variants that LLM can interpret. Importantly, PALM is the first to provide an interactive frontend that visualizes path coverage alongside generated tests, assembling tests based on the specific paths they exercise. A user study with 12 participants demonstrates that PALM’s frontend helps users better understand path coverage and identify which paths are actually exercised by PALM-generated tests, through verification and visualization of their path profiles.
[Fuzzing 2026] TenSure: Fuzzing Sparse Tensor Compilers

Kabilan Mahathevan, Yining Zhang, Muhammad Ali Gulzar, and Kirshanthan Sundararajah

The 5th International Fuzzing Workshop (FUZZING), co-located with the Network and Distributed System Security Symposium (NDSS). 2026

Abstract

Sparse Tensor Compilers (STCs) have emerged as critical infrastructure for optimizing high-dimensional data analytics and machine learning workloads. The STCs must synthesize complex, irregular control flow for various compressed storage formats directly from high-level declarative specifications, thereby making them highly susceptible to subtle correctness defects. Existing testing frameworks, which rely on mutating computation graphs restricted to a standard vocabulary of operators, fail to exercise the arbitrary loop synthesis capabilities of these compilers. Furthermore, generic grammar-based fuzzers struggle to generate valid inputs due to the strict rules governing how indices are reused across multiple tensors. In this paper, we present TenSure, the first extensible black-box fuzzing framework specifically designed for the testing of STCs. TenSure leverages Einstein Summation (Einsum) notation as a general input abstraction, enabling the generation of complex, unconventional tensor contractions that expose corner cases in the code-generation phases of STCs. We propose a novel constraint-based generation algorithm that guarantees 100% semantic validity of synthesized kernels, significantly outperforming the \sim3.3% validity rate of baseline grammar fuzzers. To enable differential testing without a trusted reference, we introduce a set of semantic-preserving mutation operators that exploit algebraic commutativity and heterogeneity in storage formats. Our evaluation on two state-of-the-art systems, TACO and Finch, reveals widespread fragility, particularly in TACO, where TenSure exposed crashes or silent miscompilations in a majority of generated test cases. These findings underscore the critical need for specialized testing tools in the sparse compilation ecosystem.

2025

[ICSE 2025] Accessibility Issues in Ad-Driven Web Applications

Abdul Haddi Amjad, Muhammad Danish, Bless Jah, and Muhammad Ali Gulzar

The 47th IEEE/ACM International Conference of Software Engineering. 2025

Abstract

Website accessibility is essential for inclusiveness and regulatory compliance. Although third-party advertisements (ads) are a vital revenue source for free web services, they introduce significant accessibility challenges. Leasing a websiteś space to ad-serving technologies like DoubleClick results in developers losing control over ad content accessibility. Even on highly accessible websites, third-party ads can undermine adherence to Web Content Accessibility Guidelines (WCAG). We conduct the first large-scale investigation of 430K website elements, including nearly 100K ad elements, to understand the accessibility of ads on websites. We seek to understand the prevalence of inaccessible ads and their overall impact on the accessibility of websites. Our findings show that 67% of websites experience increased accessibility violations due to ads, with common violations including Focus Visible and On Input. Popular ad-serving technologies like Taboola, DoubleClick, and RevContent often serve ads that fail to comply with WCAG standards. Even when ads are WCAG compliant, 27% of them have alternative text in ad images that misrepresents information, potentially deceiving users. Manual inspection of a sample of these misleading ads revealed that user-identifiable data is collected on 94% of websites through interactions, such as hovering or pressing enter. Since users with disabilities often rely on tools like screen readers that require hover events to access website content, they have no choice but to compromise their privacy in order to navigate website ads. Based on our findings, we further dissect the root cause of these violations and provide design guidelines to both website developers and ad-serving technologies to achieve WCAG-compliant ad integration.
[ICSE 2025] TraceFL: Interpretability-Driven Debugging in Federated Learning via Neuron Provenance

Waris Gill, Ali Anwar, and Muhammad Ali Gulzar

The 47th IEEE/ACM International Conference of Software Engineering. 2025

Abstract

In Federated Learning, clients train models on local data and send updates to a central server, which aggregates them into a global model using a fusion algorithm. This collaborative yet privacy-preserving training comes at a cost—FL developers face significant challenges in attributing global model predictions to specific clients. Localizing responsible clients is a crucial step towards (a) excluding clients primarily responsible for incorrect predictions and (b) encouraging clients who contributed high quality models to continue participating in the future. Existing ML explainability approaches are inherently inapplicable as they are designed for single-model, centralized training. We introduce TraceFL, a fine-grained neuron provenance capturing mechanism that identifies clients responsible for the global model’s prediction by tracking the flow of information from individual clients to the global model. Since inference on different inputs activates a different set of neurons of the global model, TraceFL dynamically quantifies the significance of the global model’s neurons in a given prediction. It then selectively picks a slice of the most crucial neurons in the global model and maps them to the corresponding neurons in every participating client to determine each client’s contribution, ultimately localizing the responsible client. We evaluate TraceFL on six datasets, including two real-world medical imaging datasets and four neural networks, including advanced models such as GPT. TraceFL achieves 99% accuracy in localizing the responsible client in FL tasks spanning both image and text classification tasks. At a time when state-of-the-art ML debugging approaches are mostly domain-specific (e.g., image classification only), TraceFL is the first technique to enable highly accurate automated reasoning across a wide range of FL applications.
[MSR 2025] Are the Majority of Public Computational Notebooks Pathologically Non-Executable? ( EMSE Special Issue Invitee)

Tien Nguyen, Waris Gill, and Muhammad Ali Gulzar

The 22nd IEEE/ACM International Conference on Mining Software Repositories. 2025

Abstract

Computational notebooks are the de facto platforms for exploratory data science, offering an interactive programming environment where users can create, modify, and execute code cells in any sequence. However, this flexibility often introduces code quality issues, with prior studies showing that approximately 76% of public notebooks are non-executable, raising significant concerns about reusability. We argue that the traditional notion of executability—requiring a notebook to run fully and without error—is overly rigid, misclassifying many notebooks and overestimating their non-executability. This paper investigates pathological executability issues in public notebooks under varying notions and degrees of executability. Notebooks, by construction, are incrementally and interactively executed, where each cell execution advances logic toward the notebook’s goal. Even partially improving executability can improve code comprehension and offer a pathway for dynamic analyses. With this insight, we first categorize notebooks into potentially restorable and pathological non-executable notebooks and then measure how removing misconfiguration and superficial execution issues in notebooks can improve their executability (i.e., additional cells executed without error). For instance, we use a Large-Language Model (LLM) to generate synthetic input data to restore non-executable notebooks with “FileNotFound” errors. In a dataset of 42,546 popular public notebooks, containing 34,659 non-executable notebooks, only 21.3% are truly pathologically non-executable. For restorable notebooks, LLM-based methods fully restore 5.4% of previously non-executable notebooks. Among the partially restored, it improves the notebook’s executability by 42.7% and 28% by installing the correct modules and generating synthetic data. These findings challenge prior assumptions, suggesting that notebooks have higher executability than previously reported, many of which offer valuable partial execution, and that their executability should be evaluated within the interactive notebook paradigm rather than through traditional software executability standards.
[IPDPS 2025] MeanCache: User-Centric Semantic Caching for LLM Web Services

Waris Gill, Mohamed Elidrisi, Pallavi Kalapatapu, Ammar Ahmed, Ali Anwar, and Muhammad Ali Gulzar

The 39th IEEE International Parallel & Distributed Processing Symposium 2025

Abstract

Large Language Models (LLMs) like ChatGPT and Llama have revolutionized natural language processing and search engine dynamics. However, these models incur exceptionally high computational costs. For instance, GPT-3 consists of 175 billion parameters, where inference demands billions of floating-point operations. Caching is a natural solution to reduce LLM inference costs on repeated queries, which constitute about 31% of the total queries. However, existing caching methods are incapable of finding semantic similarities among LLM queries nor do they operate on contextual queries, leading to unacceptable false hit-and-miss rates. This paper introduces MeanCache, a user-centric semantic cache for LLM-based services that identifies semantically similar queries to determine cache hit or miss. Using MeanCache, the response to a user’s semantically similar query can be retrieved from a local cache rather than re-querying the LLM, thus reducing costs, service provider load, and environmental impact. MeanCache leverages Federated Learning (FL) to collaboratively train a query similarity model without violating user privacy. By placing a local cache in each user’s device and using FL, MeanCache reduces the latency and costs and enhances model performance, resulting in lower false hit rates. MeanCache also encodes context chains for every cached query, offering a simple yet highly effective mechanism to discern contextual query responses from standalone. Our experiments benchmarked against the state-of-the-art caching method, reveal that MeanCache attains an approximately 17% higher F-score and a 20% increase in precision during semantic cache hit-and-miss decisions while performing even better on contextual queries. It also reduces the storage requirement by 83% and accelerates semantic cache hit-and-miss decisions by 11%.
[SANER 2025] A Metric for Measuring the Impact of Rare Paths on Program Coverage

Leo St. Amour, Eli Tilevich, and Muhammad Ali Gulzar

The IEEE International Conference on Software Analysis, Evolution and Reengineering. 2025

Abstract

Fuzzing has become a popular technique for discovering bugs and vulnerabilities. To increase the probability of finding bugs, developers should apply fuzzers that maximize program coverage. Program coverage typically measures the percentage of program lines or branches a fuzzer executes. However, these metrics fail to communicate the value of hitting an individual line, branch, or path. Many bugs manifest only within non-trivial control flows. To improve software quality, fuzzing non-trivial program paths should be more important than fuzzing trivial ones. This paper introduces rare-path coverage (RP-Coverage), a novel program coverage metric to convey the value of discovering an unlikely control flow path. We have developed a new technique for estimating the probability of taking an execution path. Our technique relies on probabilistic logic programming to declaratively express the logic for constructing and analyzing a probabilistic control flow graph. We empirically evaluate the fitness of RP-Coverage as a metric for measuring fuzzing efficacy. Our experiments confirm that defects along rare paths—intuitively—substantially impact the effectiveness of fuzzers, while existing fuzzing metrics fail to convey that significance. Our evaluation demonstrates that the value of uncovering an unlikely path is better reflected by increases in RP-Coverage than existing metrics. Specifically, we observe an average increase of up to 49.5%, 11.1%, and 15.4% for RP-Coverage, line coverage, and branch coverage, respectively. This finding indicates that RP-Coverage is more elastic to path probabilities and thus more effectively quantifies a fuzzer’s ability to discover unlikely program paths. As such, RP-Coverage demonstrates promise as a program coverage metric that enhances fuzzer fitness measures when supplementing standard criteria.
[SANER 2025] Improving Evidence-Based Tech Hiring with GitHub-Supported Resume Matching

Swanand Vaishampayan, Muhammad Ali Gulzar, and Chris Brown

The IEEE International Conference on Software Analysis, Evolution and Reengineering. 2025

Abstract

Current hiring practices use technical & soft skills proxy keyword based resume matching via Automated Resume Parsers (ARPs). However, this process fails to extract the actual abilities of candidates, such as the quality of their written code, raising concerns regarding the effectiveness of current approaches. Thus, novel evi- dences accurately depicting candidates’ skills are necessary to in-form hiring decisions. We posit GitHub-supported resume matching as a solution, mining data from candidates’ open source projects to provide evidence for their technical skills. We conducted a preliminary survey (n = 48) to gain insights from candidates and recruiters on proxies from GitHub projects indicative of technical abilities. We found both groups preferred metrics regarding code quality and used technologies, and there was overwhelming willingness to incorporate this analysis in resume matching tasks. Based on these insights, we designed ‘GitMeter’ – a tool to capture technical abilities (i.e., code quality) and soft skills of candidates by mining public GitHub repositories. GitMeter uses a novel heuristic-based approach to find the most accurate code quality approximation for candidate-written code (core code), minimizing the time and computational overhead. Finally, we evaluate effectiveness and potential impact of ‘GitMeter‘ through a user study (n = 20) with developers and recruiters. Our findings provide implications for future tools and methods aiming to promote evidence-based hiring in software engineering (SE) contexts.

2024

[CCS 2024] Blocking Tracking JavaScript at the Function Granularity ( Distinguished Artifiact Award)

Abdul Haddi Amjad, Shaoor Munir, Zubair Shafiq, and Muhammad Ali Gulzar

The 31st ACM Conference on Computer and Communications Security. 2024

Abstract

Modern websites extensively rely on JavaScript to implement both functionality and tracking. Existing privacy-enhancing content blocking tools struggle against mixed scripts, which simultaneously implement both functionality and tracking, because blocking the script would break functionality and not blocking it would allow tracking. We propose NoT.JS, a fine-grained JavaScript blocking tool that operates at the function-level granularity. NoT.JS’s strengths lie in analyzing the dynamic execution context, including the call stack and calling context of each JavaScript function, and then encoding this context to build a rich graph representation. NoT.JS trains a supervised machine learning classifier on a webpage’s graph representation to first detect tracking at the JavaScript function-level and then automatically generate surrogate scripts that preserve functionality while removing tracking. Our evaluation of NoT.JS on the top-10K websites demonstrates that it achieves high precision (94%) and recall (98%) in detecting tracking JavaScript functions, outperforming the state-of-the-art while being robust against off-the-shelf JavaScript obfuscation. Fine-grained detection of tracking functions allows NoT.JS to automatically generate surrogate scripts that remove tracking JavaScript functions without causing major breakage. Our deployment of NoT.JS shows that mixed scripts are present on 62.3% of the top-10K websites, with 70.6% of the mixed scripts being third-party that engage in tracking activities such as cookie ghostwriting. We share a sample of the tracking functions detected by NoT.JS within mixed scripts—not currently on filter lists—with filter list authors, who confirm that these scripts are not blocked due to potential functionality breakage, despite being known to implement tracking

Bug Finding Campaign:
[FSE 2024] DeSQL: Interactive Debugging of SQL in DISC

Sabaat Haroon, Chris Brown, and Muhammad Ali Gulzar

The ACM International Conference on the Foundations of Software Engineering. 2024

Abstract

Data-intensive scalable computing (DISC) frameworks, such as Apache Spark, support runtimes in many popular languages. Yet, SQL is still the most commonly used front-end language for DISC applications due to its broad presence in new and legacy workflows and shallow learning curve. However, DISC-backed SQL introduces several layers of abstraction that significantly reduce the visibility and transparency of workflows, making it challenging for developers to find and fix errors in a query. When a query returns incorrect outputs, it takes a non-trivial, manual effort to comprehend every stage of the query execution and find the root cause of bugs among the input data and complex SQL query. We aim to bring the benefits of step-through interactive debugging to DISC-powered SQL with DeSQL. When a SQL query is executed on a DISC system, DeSQL automatically decomposes it into subqueries and closely monitors the execution to identify the precise intermediate data corresponding to every constituent subquery. This enables a complete interactive debugging experience with full access to the intermediate query states. We evaluate DeSQL’s scalability, overhead, and efficiency against two baselines. The experiment results show that DeSQL can provide a complete debugging view in 13% less time than the original job time while incurring an average overhead of 10% in addition to retaining Apache Spark’s scale-out and scale-up properties. Through a user study comprising 10 participants engaged in two debugging tasks, we find that participants utilizing DeSQL identify the root cause behind a wrong query output in 75% less time than the de-facto, manual debugging.
[FSE 2024] Natural Symbolic Execution-based Testing for Big Data Analytics

Yaoxuan Wu, Ahmad Humayun, Muhammad Ali Gulzar, and Miryung Kim

The ACM International Conference on the Foundations of Software Engineering. 2024

Abstract

Symbolic execution is an automated test input generation technique that models individual program paths as logical constraints. However, the realism of concrete test inputs generated by SMT solvers often comes into question. Existing symbolic execution tools only seek arbitrary solutions for given path constraints. These constraints do not incorporate the naturalness of inputs that observe statistical distributions, range constraints, or preferred string constants. This results in unnatural-looking inputs that fail to emulate real-world data. In this paper, we extend symbolic execution with consideration for incorporating naturalness. Our key insight is that users typically understand the semantics of program inputs, such as the distribution of height or possible values of zipcode, that can be leveraged to advance the ability of symbolic execution to produce natural test inputs. We instantiate this idea in NaturalSym, a symbolic execution-based test generation tool for data-intensive scalable computing (DISC) applications. NaturalSym generates natural-looking data that mimics real-world distributions by utilizing user-provided input semantics to drastically enhance the naturalness of inputs while preserving strong bug-finding potential. On custom DISC applications and commercial big data test benchmarks, NaturalSym achieves a higher degree of realism —as evidenced by perplexity score 994 points lower, and detects 1.29× logical bugs compared to the state-of-the-art symbolic executor for DISC, BigTest. This is because BigTest draws inputs purely based on the satisfiability of path constraints constructed from branch predicates, while NaturalSym is able to draw natural concrete values based on user-specified semantics and prioritize using these values in input generation. NaturalSym is the first symbolic executor that combines the notion of input naturalness in symbolic path constraints during SMT-based input generation.
[NAACL 2024] Human-in-the-Loop Synthetic Text Data Inspection with Provenance Tracking

Hong Jin Kang, Fabrice Harel-Canada, Muhammad Ali Gulzar, Nanyun Peng, and Miryung Kim

2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics 2024
[CCS 2024] How Do Visually Impaired Users Navigate Accessibility Challenges in an Ad-Driven Web

Abdul Haddi Amjad, and Muhammad Ali Gulzar

Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. Poster Track 2024

Abstract

Website accessibility is crucial for inclusiveness and regulatory compliance. While third-party advertisements (ads) are essential for funding free web services, they pose significant accessibility challenges. When developers lease space to ad-serving technologies like DoubleClick, they lose control over the accessibility of ad content. Even highly accessible websites can have their adherence to Web Content Accessibility Guidelines (WCAG) undermined by third-party ads. We conduct an investigation into the accessibility of ads across 430K website elements, including nearly 100K ad elements. Our study aims to evaluate the prevalence of inaccessible ads and their impact on overall website accessibility. Our findings reveal that 67% of websites experience increased accessibility violations due to ads, with common issues including Focus Visible (WCAG 2.4.7) and On Input (WCAG 3.2.2). Ad-serving technologies such as Taboola, DoubleClick, and RevContent frequently serve ads that do not meet WCAG standards. Inaccessible ads can significantly increase privacy risks for users with disabilities, as these ads may force them to engage with potentially unsafe or misleading content without proper accessibility features to protect their information.

2023

[ASE 2023] NaturalFuzz: Natural Input Generation for Big Data Analytics

Ahmad Humayun, Yaoxuan Wu, Miryung Kim, and Muhammad Ali Gulzar

The 38th IEEE/ACM International Conference on Automated Software Engineering. 2023

Abstract

Fuzzing applies input mutations iteratively with the only goal of finding more bugs, resulting in synthetic tests that tend to lack realism. Big data analytics are expected to ingest real-world data as input. Therefore, when synthetic test data are not easily comprehensible, they are less likely to facilitate the downstream task of fixing errors. Our position is that fuzzing in this domain must achieve both high naturalness and high code coverage. We propose a new natural synthetic test generation tool for big data analytics, called NaturalFuzz. It generates both unstructured, semi-structured, and structured data with corresponding semantics such as ’zipcode’ and ’age.’ The key insights behind NaturalFuzz are two-fold. First, though existing test data may be small and lack coverage, we can grow this data to increase code coverage. Second, we can strategically mix constituent parts across different rows and columns to construct new realistic synthetic data by leveraging fine-grained data provenance. On commercial big data application benchmarks, NaturalFuzz achieves an additional 19.9% coverage and detects 1.9× more faults than an ML-based synthetic data generator SDV when generating comparably sized inputs. This is because an ML-based synthetic data generator does not consider which code branches are exercised by which input rows from which tables, while NaturalFuzz is able to select input rows that have a high potential to increase code coverage and mutate the selected data towards unseen, new program behavior. NaturalFuzz’s test data is more realistic than the test data generated by two baseline fuzzers (BigFuzz and Jazz), while increasing code coverage and fault detection potential. NaturalFuzz is the first fuzzing methodology with three benefits: (1) exclusively generate natural inputs, (2) fuzz multiple input sources simultaneously, and (3) find deeper semantics faults.
[ESEC/FSE 2023] Co-Dependence Aware Fuzzing for Dataflow-based Big Data Analytics

Ahmad Humayun, Miryung Kim, and Muhammad Ali Gulzar

ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2023

Abstract

Data-intensive scalable computing has become popular due to the increasing demands of analyzing big data. For example, Apache Spark and Hadoop allow developers to write dataflow-based applications with user-defined functions to process data with custom logic. Testing such applications is difficult. (1) These applications often take multiple datasets as input. (2) Unlike in SQL, there is no explicit schema for these datasets and each unstructured (or semi-structured) dataset is segmented and parsed at runtime. (3) Dataflow operators (e.g., join) create implicit co-dependence constraints between the fields of multiple datasets. An efficient and effective testing technique must analyze co-dependence among different regions of multiple datasets at the level of rows and columns and orchestrate input mutations jointly on co-dependent regions. We propose DepFuzz to increase the effectiveness and efficiency of fuzz testing dataflow-based big data applications. The key insight behind DepFuzz is two folds. It keeps track of which code segments operate on which datasets, which rows, and which columns. By analyzing the use of dataflow operators (e.g., join and groupBy) in tandem with the semantics of UDFs, DepFuzz generates test data that subsequently reach hard-to-reach regions of the application code. In real-world big data applications, DepFuzz finds 3.4× more faults, achieving 29% more statement coverage in half the time as Jazzer’s, a state-of-the-art commercial fuzzer for Java bytecode. It outperforms prior DISC testing by exposing deeper semantic faults beyond simpler input formatting errors, especially when multiple datasets have complex interactions through dataflow operators
[ICSE 2023] FedDebug: Systematic Debugging for Federated Learning Applications

Waris Gill, Ali Anwar, and Muhammad Ali Gulzar

The ACM/IEEE 45th International Conference on Software Engineering 2023

Abstract

In Federated Learning (FL), clients independently train local models and share them with a central aggregator to build a global model. Impermissibility to access clients’ data and collaborative training make FL appealing for applications with data-privacy concerns, such as medical imaging. However, these FL characteristics pose unprecedented challenges for debugging. When a global model’s performance deteriorates, identifying the responsible rounds and clients is a major pain point. Developers resort to trial-and-error debugging with subsets of clients, hoping to increase the global model’s accuracy or let future FL rounds retune the model, which are time-consuming and costly. We design a systematic fault localization framework, FEDDE- BUG, that advances the FL debugging on two novel fronts. First, FEDDEBUG enables interactive debugging of realtime collaborative training in FL by leveraging record and replay techniques to construct a simulation that mirrors live FL. FEDDEBUG’s breakpoint can help inspect an FL state (round, client, and global model) and move between rounds and clients’ models seam- lessly, enabling a fine-grained step-by-step inspection. Second, FEDDEBUG automatically identifies the client(s) responsible for lowering the global model’s performance without any testing data and labels—both are essential for existing debugging techniques. FEDDEBUG’s strengths come from adapting differential testing in conjunction with neuron activations to determine the client(s) deviating from normal behavior. FEDDEBUG achieves 100% accuracy in finding a single faulty client and 90.3% accuracy in finding multiple faulty clients. FEDDEBUG’s interactive debugging incurs 1.2% overhead during training, while it localizes a faulty client in only 2.1% of a round’s training time. With FEDDEBUG, we bring effective debugging practices to federated learning, improving the quality and productivity of FL application developers.
[PETS 2023] Blocking JavaScript without Breaking the Web:An Empirical Investigation

Abdul Haddi Amjad, Zubair Shafiq, and Muhammad Ali Gulzar

Proceedings on Privacy Enhancing Technologies Symposium 2023

Abstract

Modern websites heavily rely on JavaScript (JS) to implement legitimate functionality as well as privacy-invasive advertising and tracking. Browser extensions such as NoScript block any script not loaded by a trusted list of endpoints, thus hoping to block privacy-invasive scripts while avoiding breaking legitimate website functionality. In this paper, we investigate whether blocking JS on the web is feasible without breaking legitimate functionality. To this end, we conduct a large-scale measurement study of JS blocking on 100K websites. We evaluate the effectiveness of different JS blocking strategies in tracking prevention and functionality breakage. Our evaluation relies on quantitative analysis of network requests, and resource loads as well as manual qualitative analysis of visual breakage. First, we show that while blocking all scripts is quite effective at reducing tracking, it significantly degrades functionality on approximately two-thirds of the tested websites. Second, we show that selective blocking of a subset of scripts based on a curated list achieves a better tradeoff. However, there remain approximately 15% “mixed” scripts, which essentially merge tracking and legitimate functionality and thus cannot be blocked without causing website breakage. Finally, we show that finger-grained blocking of a subset of JS methods, instead of scripts, reduces major breakage by 3.7× while providing the same level of tracking prevention. Our work highlights the promise and open challenges in finer-grained JS blocking for tracking prevention without breaking the web.
[SE4SafeML 2023] FedDefender: Backdoor Attack Defense in Federated Learning

Waris Gill, Ali Anwar, and Muhammad Ali Gulzar

Proceedings of the 1st International Workshop on Dependability and Trustworthiness of Safety-Critical Systems with Machine Learned Components 2023

Abstract

Federated Learning (FL) is a privacy-preserving distributed machine learning technique that enables individual clients (e.g., user participants, edge devices, or organizations) to train a model on their local data in a secure environment and then share the trained model with an aggregator to build a global model collaboratively. In this work, we propose FedDefender, a defense mechanism against targeted poisoning attacks in FL by leveraging differential testing. FedDefender first applies differential testing on clients’ models using a synthetic input. Instead of comparing the output (predicted label), which is unavailable for synthetic input, FedDefender fingerprints the neuron activations of clients’ models to identify a potentially malicious client containing a backdoor. We evaluate FedDefender using MNIST and FashionMNIST datasets with 20 and 30 clients, and our results demonstrate that FedDefender effectively mitigates such attacks, reducing the attack success rate (ASR) to 10% without deteriorating the global model performance.

2022

[ASE 2022] Detecting Build Conflicts in Software Merge for Java Programs via Static Analysis

Sheikh Towqir, Bowen Shen, Muhammad Ali Gulzar, and Na Meng

The 37th IEEE/ACM International Conference on Automated Software Engineering 2022

Abstract

In software merge, the edits from different branches can textually overlap (i.e., textual conflicts) or cause build and test errors (i.e., build and test conflicts), jeopardizing programmer productivity and software quality. Existing tools primarily focus on textual conflicts; few tools detect higher-order conflicts (i.e., build and test conflicts). However, existing detectors of build conflicts are limited. Due to their heavy usage of automatic build, current detectors (e.g., Crystal) only report build errors instead of identifying the root causes; developers have to manually locate conflicting edits. These detectors only help when the branches-to-merge have no textual conflict. We present a new static analysis-based approach Bucond (“build conflict detector”). Given three code versions in a merging scenario: base b, left l, and right r, Bucond models each version as a graph, and compares graphs to extract entity-related edits (e.g., class renaming) in l and r. We believe that build conflicts occur when certain edits are co-applied to related entities between branches. Bucond realizes this insight via pattern matching to identify any cross-branch edit combination that can trigger build conflicts (e.g., one branch adds a reference to field F while the other branch removes F). We systematically explored and devised 57 patterns, covering 97% of the build conflicts in our experiments. Our evaluation shows Bucond to complement build-based detectors, as it (1) detects conflicts with 100% precision and 88%–100% recall, (2) locates conflicting edits, and (3) works well when those detectors do not.
[TOSEM 2022] A Characterization Study of Merge Conflicts in Java Projects

Bowen Shen, Muhammad Ali Gulzar, Fei He, and Na Meng

2022

Abstract

In collaborative software development, programmers create software branches to add features and fix bugs tentatively, and then merge branches to integrate edits. When edits from different branches textually overlap (i.e., textual conflicts) or lead to compilation and runtime errors (i.e., build and test conflicts), it is challenging for developers to remove such conflicts. Prior work proposed tools to detect and solve conflicts. They investigate how conflicts relate to code smells and the software development process. However, many questions are still not fully investigated, such as what types of conflicts exist in real-world applications and how developers or tools handle them. For this paper, we used automated textual merge, compilation, and testing to reveal 3 types of conflicts in 208 open-source repositories: textual conflicts, build conflicts (i.e., conflicts causing build errors), and test conflicts (i.e., conflicts triggering test failures). We manually inspected 538 conflicts and their resolutions to characterize merge conflicts from different angles. Our analysis revealed three interesting phenomena. First, higher-order conflicts (i.e., build and test conflicts) are harder to detect and resolve, while existing tools mainly focus on textual conflicts. Second, developers manually resolved most higher-order conflicts by applying similar edits to multiple program locations; their conflict resolutions share common editing patterns implying great opportunities for future tool design. Third, developers resolved 64% of true textual conflicts by keeping complete edits from either a left or right branch. Unlike prior studies, our research for the first time thoroughly characterizes three types of conflicts, with a special focus on higher-order conflicts and limitations of existing tool design. Our work will shed light on future research of software merge.
[ACL 2022] Sibylvariant Transformations for Robust Text Classification

Fabrice Harel-Canada, Muhammad Ali Gulzar, Nanyun Peng, and Miryung Kim

In 60th Annual Meeting of the Association for Computational Linguistics 2022

16 Pages.

Abstract

The vast majority of text transformation techniques in NLP are inherently limited in their ability to expand input space coverage due to an implicit constraint to preserve the original class label. In this work, we propose the notion of sibylvariance (SIB) to describe the broader set of transforms that relax the label-preserving constraint, knowably vary the expected class, and lead to significantly more diverse input distributions. We offer a unified framework to organize all data transformations, including two types of SIB: (1) Transmutations convert one discrete kind into another, (2) Mixture Mutations blend two or more classes together. To explore the role of sibylvariance within NLP, we implemented 41 text transformations, including several novel techniques like Concept2Sentence and SentMix. Sibylvariance also enables a unique form of adaptive training that generates new input mixtures for the most confused class pairs, challenging the learner to differentiate with greater nuance. Our experiments on six benchmark datasets strongly support the efficacy of sibylvariance for generalization performance, defect detection, and adversarial robustness.

2021

[SOCC 2021] OptDebug: Fault-Inducing Operation Isolation for Dataflow Applications

Muhammad Ali Gulzar, and Miryung Kim

In The 12th ACM Symposium on Cloud Computing 2021

13 Pages. 30% Acceptance Rate

Abstract

Fault-isolation is extremely challenging in large scale data processing in cloud environments. Data provenance (DP) is a dominant existing approach to isolate data records responsible for a given output. However, as the name suggests, data provenance concerns fault isolation only in the data-space, as opposed to fault isolation in the code-space–how can we precisely localize operations or APIs responsible for a given suspicious or incorrect result? We present OptDebug that identifies fault-inducing operations in a dataflow application using three insights. First, debugging is easier with a small-scale input than a large-scale input. So it uses data provenance to simplify the original set of input records data to a smaller set leading to test failures and test successes. Second, keeping track of operation provenance is crucial for debugging. Thus, it leverages automated taint analysis to propagate the lineage of operations downstream with individual records. Lastly, each operation may contribute to test failures to a different degree. Thus OptDebug ranks each operation’s spectra–the relative frequency in failing vs. passing tests. In our experiments, OptDebug achieves 100% recall and 86% precision in terms of detecting faulty operations and reduces the debugging time by 17X compared to a naïve approach. Overall, OptDebug shows great promise in improving developer productivity in today’s complex data processing pipelines by obviating the need to re-execute the program repetitively with different inputs and manually examine program traces to isolate buggy code.
[IMC 2021] TrackerSift: Untangling Mixed Tracking and Functional Web Resources

Abdul Hadi Amjad, Muhammad Saleem, Muhammad Ali Gulzar*, Zubair Shafiq*, and Fareed Zaffar*

In Proceedings of the 2021 ACM Internet Measurement Conference 2021

8 Pages. 27.9% Acceptance Rate

Abstract

Trackers typically circumvent filter lists used by privacy-enhancing content blocking tools by changing the domains or URLs of their resources. Filter list maintainers painstakingly attempt to keep up in the ensuing arms race by frequently updating the filter lists. Trackers have recently started to mix tracking and functional resources, putting content blockers in a bind: risk breaking legitimate functionality if they act and risk missing privacy-invasive advertising and tracking if they do not. In this paper, we conduct a large-scale measurement study of such mixed (i.e., both tracking and functional) resources on 100K websites. We propose TrackerSift, an approach that progressively classifies and untangles mixed web resources at multiple granularities of analysis (domain, hostname, script, and method). Using TrackerSift, we find that 83% of the domains can be separated as tracking or functional, and the remaining 17% (11.8K) domains are classified as mixed. For the mixed domains, 52% of the hostnames can be separated, and the remaining 48% (12.3K) hostnames are classified as mixed. For the mixed hostnames, 94% of the JavaScript snippets can be separated, and the remaining 6% (21.1K) scripts are classified as mixed. For the mixed scripts, 91% of the JavaScript methods can be separated, and the remaining 9% (5.5K) methods are classified as mixed. Overall, TrackerSift is able to attribute 98% of all requests to tracking or functional resources at the finest level of granularity. Our analysis shows that mixed resources at different granularities are typically served from CDN and other general-purpose content hosting domains and hostnames or as inlined and bundled scripts. Our results highlight the opportunities for fine-grained content blocking to remove mixed resources without breaking legitimate functionality.
[HiPS 2021] Towards a Serverless Bioinformatics Cyberinfrastructure Pipeline

Shunyu David Yao, Muhammad Ali Gulzar, Liqing Zhang, and Ali R. Butt

In Proceedings of the 1st Workshop on High Performance Serverless Computing 2021

8 Pages. Workshop Paper.

Abstract

Function-as-a-Service (FaaS) and the serverless computing model offer a powerful abstraction for supporting large-scale applications in the cloud. A major hurdle in this context is that it is non-trivial to transform an application, even an already containerized one, to a FaaS implementation. In this paper, we take the first step towards supporting easier and efficient application transformation to FaaS. We present a systematic scheme to transform applications written in Python into a set of functions that can then be automatically deployed atop platforms such as AWS Lamda. We target a Bioinformatics cyberinfrastructure pipeline, CIWARS, that provides waste-water analysis for the identification of antibiotic-resistant bacteria and viruses such as SARS-CoV-2. Based on our experience with enabling FaaS-based CIWARS, we develop a methodology that would help the conversion of other similar applications to the FaaS model. Our evaluation shows that our approach can correctly transform CIWARS to FaaS, and the new FaaS-based CIWARS incurs only negligible(≤2%) less than 2% overhead for representative workloads.

2020

[SOCC 2020] Influence-Based Provenance for Dataflow Applications with Taint Propagation

Jason Teoh, Muhammad Ali Gulzar, and Miryung Kim

In The 11th ACM Symposium on Cloud Computing 2020

12 Pages. Full Paper. 24.4% Acceptance Rate

Abstract

Debugging big data analytics often requires a root cause analysis to pinpoint the precise culprit records in an input dataset responsible for incorrect or anomalous output. Existing debugging or data provenance approaches do not track fine-grained control and data flows in user-defined application code; thus, the returned culprit data is often too large for manual inspection and expensive post-mortem analysis is required.We design FlowDebug to identify a highly precise set of input records based on two key insights. First, FlowDebug precisely tracks control and data flow within user-defined functions to propagate taints at a fine-grained level by inserting custom data abstractions through automated source to source transformation. Second, it introduces a novel notion of influence-based provenance for many-to-one dependencies to prioritize which input records are more responsible than others by analyzing the semantics of a user-defined function used for aggregation. By design, our approach does not require any modification to the framework’s runtime and can be applied to existing applications easily. FlowDebug significantly improves the precision of debugging results by up to 99.9 percentage points and avoids repetitive re-runs required for post-mortem analysis by a factor of 33 while incurring an instrumentation overhead of 0.4X - 6.1X on vanilla Spark.
[ASE 2020] BigFuzz: Efficient Fuzz Testing for Data Analytics using Framework Abstraction

Qian Zhang, Jiyuan Wang, Muhammad Ali Gulzar, Rohan Padhye, and Miryung Kim

In The 35th IEEE/ACM International Conference on Automated Software Engineering 2020

12 Pages. Full Paper. 22.5% Acceptance Rate

Abstract

As big data analytics become increasingly popular, data-intensive scalable computing (DISC) systems help address the scalability issue of handling large data. However, automated testing for such data-centric applications is challenging, because data is often incomplete, continuously evolving, and hard to know a priori. Fuzz testing has been proven to be highly effective in other domains such as security; however, it is nontrivial to apply such traditional fuzzing to big data analytics directly for three reasons: (1) the long latency of DISC systems prohibits the applicability of fuzzing: naı̈ve fuzzing would spend 98% of the time in setting up a test environment; (2) conventional branch coverage is unlikely to scale to DISC applications because most binary code comes from the framework implementation such as Apache Spark; and (3) random bit or byte level mutations can hardly generate meaningful data, which fails to reveal real-world application bugs.We propose a novel coverage-guided fuzz testing tool for big data analytics, called BigFuzz. The key essence of our approach is that: (a) we focus on exercising application logic as opposed to increasing framework code coverage by abstracting the DISC framework using specifications. BigFuzz performs automated source to source transformations to construct an equivalent DISC application suitable for fast test generation, and (b) we design schema-aware data mutation operators based on our in-depth study of DISC application error types. BigFuzz speeds up the fuzzing time by 78 to 1477X compared to random fuzzing, improves application code coverage by 20% to 271%, and achieves 33% to 157% improvement in detecting application errors. When compared to the state of the art that uses symbolic execution to test big data analytics, BigFuzz is applicable to twice more programs and can find 81% more bugs.
[ESEC/FSE 2020] Is Neuron Coverage a Meaningful Measure for Testing Deep Neural Networks?

Fabrice Harel-Canada, Lingxiao Wang, Muhammad Ali Gulzar, Quanquan Gu, and Miryung Kim

In The 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering 2020

12 Pages. Full Paper. 28.0% Acceptance Rate

Abstract

Recent effort to test deep learning systems has produced an intuitive and compelling test criterion called neuron coverage (NC), which resembles the notion of traditional code coverage. NC measures the proportion of neurons activated in a neural network and it is implicitly assumed that increasing NC improves the quality of a test suite. In an attempt to automatically generate a test suite that increases NC, we design a novel diversity promoting regularizer that can be plugged into existing adversarial attack algorithms. We then assess whether such attempts to increase NC could generate a test suite that (1) detects adversarial attacks successfully, (2) produces natural inputs, and (3) is unbiased to particular class predictions. Contrary to expectation, our extensive evaluation finds that increasing NC actually makes it harder to generate an effective test suite: higher neuron coverage leads to fewer defects detected, less natural inputs, and more biased prediction preferences. Our results invoke skepticism that increasing neuron coverage may not be a meaningful objective for generating tests for deep neural networks and call for a new test generation technique that considers defect detection, naturalness, and output impartiality in tandem.
[ICSE 2020] HeteroRefactor: Refactoring for Heterogeneous Computing with FPGA

Jason Lau*, Aishwarya Sivaraman*, Qian Zhang*, Muhammad Ali Gulzar, Jason Cong, and Miryung Kim

In 2020 IEEE/ACM 42nd International Conference on Software Engineering 2020

13 Pages. Full Paper. 20.9% Acceptance Rate

Abstract

Heterogeneous computing with field-programmable gate-arrays (FPGAs) has demonstrated orders of magnitude improvement in computing efficiency for many applications. However, the use of such platforms so far is limited to a small subset of programmers with specialized hardware knowledge. High-level synthesis (HLS) tools made significant progress in raising the level of programming abstraction from hardware programming languages to C/C++, but they usually cannot compile and generate accelerators for kernel programs with pointers, memory management, and recursion, and require manual refactoring to make them HLS-compatible. Besides, experts also need to provide heavily handcrafted optimizations to improve resource efficiency, which affects the maximum operating frequency, parallelization, and power efficiency.We propose a new dynamic invariant analysis and automated refactoring technique, called HeteroRefactor. First, HeteroRefactor monitors FPGA-specific dynamic invariants—the required bitwidth of integer and floating-point variables, and the size of recursive data structures and stacks. Second, using this knowledge of dynamic invariants, it refactors the kernel to make traditionally HLS-incompatible programs synthesizable and to optimize the accelerator’s resource usage and frequency further. Third, to guarantee correctness, it selectively offloads the computation from CPU to FPGA, only if an input falls within the dynamic invariant. On average, for a recursive program of size 175 LOC, an expert FPGA programmer would need to write 185 more LOC to implement an HLS compatible version, while HeteroRefactor automates such transformation. Our results on Xilinx FPGA show that HeteroRefactor minimizes BRAM by 83% and increases frequency by 42% for recursive programs; reduces BRAM by 41% through integer bitwidth reduction; and reduces DSP by 50% through floating-point precision tuning.
[ICSE Demo 2020] BigTest: Symbolic Execution Based Systematic Test Generation Tool for Apache Spark

Muhammad Ali Gulzar, Madan Musuvathi, and Miryung Kim

In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Companion Proceedings 2020

4 Pages. Demonstration Paper. 33.3% Acceptance Rate

Abstract

Data-intensive scalable computing (DISC) systems such as Google’s MapReduce, Apache Hadoop, and Apache Spark are prevalent in many production services. Despite their popularity, the quality of DISC applications suffers due to a lack of exhaustive and automated testing. Current practices of testing DISC applications are limited to using a small random sample of the entire input dataset which merely exposes any program faults. Unlike SQL queries, testing DISC applications has new challenges due to a composition of both dataflow and relational operators, and user-defined functions (UDF) that could be arbitrarily long and complex.To address this problem, we demonstrate a new white-box testing framework called BigTest that takes an Apache Spark program as input and automatically generates synthetic, concrete data for effective and efficient testing. BigTest combines the symbolic execution of UDFs with the logical specifications of dataflow and relational operators to explore all paths in a DISC application. Our experiments show that BigTest is capable of generating test data that can reveal up to 2X more faults than the entire data set with 194X less testing time. We implement BigTest in a Java-based command line tool with a pre-compile binary jar. It exposes a configuration file in which a user can edit preferences, including the path of a target program, the upper bound of loop exploration, and a choice of theorem solver. The demonstration video of BigTest is available at https://youtu.be/OeHhoKiDYso and BigTest is available at https://github.com/maligulzar/BigTest.

2019

[ESEC/FSE 2019] White-box Testing of Big Data Analytics with Complex User-defined Functions

Muhammad Ali Gulzar, Shaghayegh Mardani, Madanlal Musuvathi, and Miryung Kim

In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering 2019

12 Pages. Full Paper. 24.4% Acceptance Rate

Abstract

Data-intensive scalable computing (DISC) systems such as Google’s MapReduce, Apache Hadoop, and Apache Spark are being leveraged to process massive quantities of data in the cloud. Modern DISC applications pose new challenges in exhaustive, automatic testing because they consist of dataflow operators, and complex user-defined functions (UDF) are prevalent unlike SQL queries. We design a new white-box testing approach, called BigTest to reason about the internal semantics of UDFs in tandem with the equivalence classes created by each dataflow and relational operator. Our evaluation shows that, despite ultra-large scale input data size, real world DISC applications are often significantly skewed and inadequate in terms of test coverage, leaving 34% of Joint Dataflow and UDF (JDU) paths untested. BigTest shows the potential to minimize data size for local testing by 10^5 to 10^8 orders of magnitude while revealing 2X more manually-injected faults than the previous approach. Our experiment shows that only few of the data records (order of tens) are actually required to achieve the same JDU coverage as the entire production data. The reduction in test data also provides CPU time saving of 194X on average, demonstrating that interactive and fast local testing is feasible for big data analytics, obviating the need to test applications on huge production data.
[SoCC 2019] PerfDebug: Performance Debugging of Computation Skew in Dataflow Systems

Jason Teoh, Muhammad Ali Gulzar, Harry Xu, and Miryung Kim

In Proceedings of the 2019 Symposium on Cloud Computing 2019

12 Pages. Full Paper. 24.8% Acceptance Rate

Abstract

Performance is a key factor for big data applications, and much research has been devoted to optimizing these applications. While prior work can diagnose and correct data skew, the problem of computation skew—abnormally high computation costs for a small subset of input data—has been largely overlooked. Computation skew commonly occurs in real-world applications and yet no tool is available for developers to pinpoint underlying causes.To enable a user to debug applications that exhibit computation skew, we develop a post-mortem performance debugging tool. PerfDebug automatically finds input records responsible for such abnormalities in a big data application by reasoning about deviations in performance metrics such as job execution time, garbage collection time, and serialization time. The key to PerfDebug’s success is a data provenance-based technique that computes and propagates record-level computation latency to keep track of abnormally expensive records throughout the pipeline. Finally, the input records that have the largest latency contributions are presented to the user for bug fixing. We evaluate PerfDebug via in-depth case studies and observe that remediation such as removing the single most expensive record or simple code rewrite can achieve up to 16X performance improvement.
[ICSE SEIP 2019] Perception and Practices of Differential Testing

Muhammad Ali Gulzar, Yongkang Zhu, and Xiaofeng Han

In Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice 2019

10 Pages. Full Paper. 22.2% Acceptance Rate

Abstract

Tens of thousands engineers are contributing to Google’s codebase that spans billions of lines of code. To ensure high code quality, tremendous amount of effort has been made with new testing techniques and frameworks. However, with increasingly complex data structures and software systems, traditional test case based testing strategies cannot scale well to achieve the desired level of test adequacy. Differential (Diff) testing is one of the new testing techniques adapted to fill this gap. It uses the same input to run two versions of a software system, namely base and test, where base is the verified/tested version of the system while test is the modified version. The output of two runs are then thoroughly compared to find abnormalities that may lead to possible bugs.Over the past few years, differential testing has been quickly adopted by hundreds of teams across all major product areas at Google. Meanwhile, many new differential testing frameworks were developed to simplify the creation, maintenance, and analysis of diff tests. Curious by this emerging popularity, we conducted the first empirical study on differential testing in practice at large scale. In this study, we investigated common practices and usage of diff tests. We further explore the features of diff tests that users value the most and the pain points of using diff tests. Through this user study, we discovered that differential testing does not replace fine-grained testing techniques such as unit tests. Instead it supplements existing testing suites. It helps users verify the impact on unmodified and unfamiliar components in the absence of a test oracle. In terms of limitations, diff tests often take long time to run and appear to generate noisy and flaky outcomes. Finally, we highlight problems (including smart data differencing, sampling, and traceability) to guide future research in differential testing.

2018

[ICDCS 2018] LogLens: A Real-Time Log Analysis System

Biplob Debnath, Mohiuddin Solaimani, Muhammad Ali Gulzar, Nipon Arora, Cristian Lumezanu, Jianwu Xu, Bo Zong, Hui Zhang, Guofei Jiang, and Latifur Khan

In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS) 2018

11 Pages. Full Paper. 20.6% Acceptance Rate
[VLDB Journal 2018] Adding Data Provenance Support to Apache Spark

Matteo Interlandi, Ari Ekmekji, Kshitij Shah, Muhammad Ali Gulzar, Sai Deep Tetali, Miryung Kim, Todd Millstein, and Tyson Condie

The VLDB Journal 2018

21 Pages. VLDB Journal Paper.

Abstract

Debugging data processing logic in data-intensive scalable computing (DISC) systems is a difficult and time-consuming effort. Today’s DISC systems offer very little tooling for debugging programs, and as a result, programmers spend countless hours collecting evidence (e.g., from log files) and performing trial-and-error debugging. To aid this effort, we built Titian, a library that enables data provenance–tracking data through transformations–in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds–orders of magnitude faster than alternative solutions–while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.
[ESEC/FSE Demo 2018] BigSift: Automated Debugging of Big Data Analytics in Data-intensive Scalable Computing

Muhammad Ali Gulzar, Siman Wang, and Miryung Kim

In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering 2018

4 Pages. Demonstration Paper. 38.8% Acceptance Rate

Abstract

Developing Big Data Analytics often involves trial and error debugging, due to the unclean nature of datasets or wrong assumptions made about data. When errors (e.g. program crash, outlier results, etc.) arise, developers are often interested in pinpointing the root cause of errors. To address this problem, BigSift takes an Apache Spark program, a user-defined test oracle function, and a dataset as input and outputs a minimum set of input records that reproduces the same test failure by combining the insights from delta debugging with data provenance. The technical contribution of BigSift is the design of systems optimizations that bring automated debugging closer to a reality for data intensive scalable computing. BigSift exposes an interactive web interface where a user can monitor a big data analytics job running remotely on the cloud, write a user-defined test oracle function, and then trigger the automated debugging process. BigSift also provides a set of predefined test oracle functions, which can be used for explaining common types of anomalies in big data analytics–for example, finding the origin of the output value that is more than k standard deviations away from the median. The demonstration video is available at https://youtu.be/jdBsCd61a1Q.
[ICSE ACM Student Research Competition 2018] Interactive and Automated Debugging for Big Data Analytics ( ACM Student Research Competition Gold Medal Winner)

Muhammad Ali Gulzar,

In Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings 2018

3 Pages. Short Paper.

Abstract

An abundance of data in many disciplines of science, engineering, national security, health care, and business has led to the emerging field of Big Data Analytics that run in a cloud computing environment. To process massive quantities of data in the cloud, developers leverage Data-Intensive Scalable Computing (DISC) systems such as Google’s MapReduce, Hadoop, and Spark.Currently, developers do not have easy means to debug DISC applications. The use of cloud computing makes application development feel more like batch jobs and the nature of debugging is therefore post-mortem. Developers of big data applications write code that implements a data processing pipeline and test it on their local workstation with a small sample data, downloaded from a TB-scale data warehouse. They cross fingers and hope that the program works in the expensive production cloud. When a job fails or they get a suspicious result, data scientists spend hours guessing at the source of the error, digging through post-mortem logs. In such cases, the data scientists may want to pinpoint the root cause of errors by investigating a subset of corresponding input records.The vision of my work is to provide interactive, real-time and automated debugging services for big data processing programs in modern DISC systems with minimum performance impact. My work investigates the following research questions in the context of big data analytics: (1) What are the necessary debugging primitives for interactive big data processing? (2) What scalable fault localization algorithms are needed to help the user to localize and characterize the root causes of errors? (3) How can we improve testing efficiency during iterative development of DISC applications by reasoning the semantics of dataflow operators and user-defined functions used inside dataflow operators in tandem?To answer these questions, we synthesize and innovate ideas from software engineering, big data systems, and program analysis, and coordinate innovations across the software stack from the user-facing API all the way down to the systems infrastructure.

2017

[SoCC 2017] Automated Debugging in Data-intensive Scalable Computing

Muhammad Ali Gulzar, Matteo Interlandi, Xueyuan Han, Mingda Li, Tyson Condie, and Miryung Kim

In Proceedings of the 2017 Symposium on Cloud Computing 2017

15 Pages. Full Paper. 23.6% Acceptance Rate

Abstract

Developing Big Data Analytics workloads often involves trial and error debugging, due to the unclean nature of datasets or wrong assumptions made about data. When errors (e.g., program crash, outlier results, etc.) arise, developers are often interested in identifying a subset of the input data that is able to reproduce the problem. BigSift is a new faulty data localization approach that combines insights from automated fault isolation in software engineering and data provenance in database systems to find a minimum set of failure-inducing inputs. BigSift redefines data provenance for the purpose of debugging using a test oracle function and implements several unique optimizations, specifically geared towards the iterative nature of automated debugging workloads. BigSift improves the accuracy of fault localizability by several orders-of-magnitude (∼103 to 107\texttimes) compared to Titian data provenance, and improves performance by up to 66\texttimes compared to Delta Debugging, an automated fault-isolation technique. For each faulty output, BigSift is able to localize fault-inducing data within 62% of the original job running time.
[SIGMOD Demo 2017] Debugging Big Data Analytics in Spark with BigDebug

Muhammad Ali Gulzar, Matteo Interlandi, Tyson Condie, and Miryung Kim

In Proceedings of the 2017 ACM International Conference on Management of Data 2017

4 Pages. Demonstration Paper. 34% Acceptance Rate

Abstract

To process massive quantities of data, developers leverage Data-Intensive Scalable Computing (DISC) systems such as Apache Spark. In terms of debugging, DISC systems support only post-mortem log analysis and do not provide any debugging functionality. This demonstration paper showcases BigDebug: a tool enhancing Apache Spark with a set of interactive debugging features that can help users in debug their Big Data Applications.

2016

[ICSE 2016] BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark

Muhammad Ali Gulzar, Matteo Interlandi, Seunghyun Yoo, Sai Tetali, Tyson Condie, Todd Millstein, and Miryung Kim

In 2016 IEEE/ACM 38th International Conference on Software Engineering 2016

12 Pages. Full Paper. 19.1% Acceptance Rate

Abstract

Developers use cloud computing platforms to process a large quantity of data in parallel when developing big data analytics. Debugging the massive parallel computations that run in today’s datacenters is time consuming and error-prone. To address this challenge, we design a set of interactive, real-time debugging primitives for big data processing in Apache Spark, the next generation data-intensive scalable cloud computing platform. This requires rethinking the notion of step-through debugging in a traditional debugger such as gdb, because pausing the entire computation across distributed worker nodes causes significant delay and naively inspecting millions of records using a watchpoint is too time consuming for an end user.First, BigDebug’s simulated breakpoints and on-demand watchpoints allow users to selectively examine distributed, intermediate data on the cloud with little overhead. Second, a user can also pinpoint a crash-inducing record and selectively resume relevant sub-computations after a quick fix. Third, a user can determine the root causes of errors (or delays) at the level of individual records through a fine-grained data provenance capability. Our evaluation shows that BigDebug scales to terabytes and its record-level tracing incurs less than 25% overhead on average. It determines crash culprits orders of magnitude more accurately and provides up to 100% time saving compared to the baseline replay debugger. The results show that BigDebug supports debugging at interactive speeds with minimal performance impact.
[SoCC 2016] Optimizing Interactive Development of Data-Intensive Applications

Matteo Interlandi, Sai Deep Tetali, Muhammad Ali Gulzar, Joseph Noor, Tyson Condie, Miryung Kim, and Todd Millstein

In Proceedings of the Seventh ACM Symposium on Cloud Computing 2016

13 Pages. Full Paper. 25.1% Acceptance Rate

Abstract

Modern Data-Intensive Scalable Computing (DISC) systems are designed to process data through batch jobs that execute programs (e.g., queries) compiled from a high-level language. These programs are often developed interactively by posing ad-hoc queries over the base data until a desired result is generated. We observe that there can be significant overlap in the structure of these queries used to derive the final program. Yet, each successive execution of a slightly modified query is performed anew, which can significantly increase the development cycle. Vega is an Apache Spark framework that we have implemented for optimizing a series of similar Spark programs, likely originating from a development or exploratory data analysis session. Spark developers (e.g., data scientists) can leverage Vega to significantly reduce the amount of time it takes to re-execute a modified Spark program, reducing the overall time to market for their Big Data applications.
[VLDB 2016] Titian: Data Provenance Support in Spark ( The "Best of VLDB" Paper)

Matteo Interlandi, Kshitij Shah, Sai Deep Tetali, Muhammad Ali Gulzar, Seunghyun Yoo, Miryung Kim, Todd Millstein, and Tyson Condie

Proc. VLDB Endow. 2016

12 Pages. Full Paper. 21.2% Acceptance Rate

Abstract

Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today’s DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence (e.g., from log files) and performing trial and error debugging. To aid this effort, we built Titian, a library that enables data provenance—tracking data through transformations—in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds—orders-of-magnitude faster than alternative solutions—while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.
[HotCloud 2016] Interactive Debugging for Big Data Analytics

Muhammad Ali Gulzar, Xueyuan Han, Matteo Interlandi, Shaghayegh Mardani, Sai Deep Tetali, Todd Millstein, and Miryung Kim

In 8th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 16) 2016

7 Pages. Workshop Paper. 30.8% Acceptance Rate

Abstract

An abundance of data in many disciplines has accelerated the adoption of distributed technologies such as Hadoop and Spark, which provide simple programming semantics and an active ecosystem. However, the current cloud computing model lacks the kinds of expressive and interactive debugging features found in traditional desktop computing. We seek to address these challenges with the development of BIGDEBUG, a framework providing interactive debugging primitives and tool-assisted fault localization services for big data analytics. We showcase the data provenance and optimized incremental computation features to effectively and efficiently support interactive debugging, and investigate new research directions on how to automatically pinpoint and repair the root cause of errors in large-scale distributed data processing.
[ESEC/FSE Demo 2016] BigDebug: Interactive Debugger for Big Data Analytics in Apache Spark

Muhammad Ali Gulzar, Matteo Interlandi, Tyson Condie, and Miryung Kim

In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering 2016

5 Pages. Demonstration Paper. 40.1% Acceptance Rate

Abstract

To process massive quantities of data, developers leverage data-intensive scalable computing (DISC) systems in the cloud, such as Google’s MapReduce, Apache Hadoop, and Apache Spark. In terms of debugging, DISC systems support post-mortem log analysis but do not provide interactive debugging features in realtime. This tool demonstration paper showcases a set of concrete usecases on how BigDebug can help debug Big Data Applications by providing interactive, realtime debug primitives. To emulate interactive step-wise debugging without reducing throughput, BigDebug provides simulated breakpoints to enable a user to inspect a program without actually pausing the entire computation. To minimize unnecessary communication and data transfer, BigDebug provides on-demand watchpoints that enable a user to retrieve intermediate data using a guard and transfer the selected data on demand. To support systematic and efficient trial-and-error debugging, BigDebug also enables users to change program logic in response to an error at runtime and replay the execution from that step. BigDebug is available for download at http://web.cs.ucla.edu/ miryung/software.html

2015

[PACIS 2015] A Classification Based Framework to Predict Viral Threads

Hashim Sharif, Saad Ismail, Shehroze Farooqi, Mohammad Taha Khan, Muhammad Ali Gulzar, Hasnain Lakhani, Fareed Zaffar, and Ahmed Abbasi

In The Pacific Asia Conference on Information Systems (PACIS) 2015

13 Pages. Full Paper.

* Student authors contributed equally.

* Senior authors are alphabatically arranged.