Data and Information Qualifying Exam

Data and Information Ph.D. Qualifier Exam

Spring 2023

Examining Faculty

    Dr. Lifu Huang (Chair)
    Dr. Bimal Viswanath
    Dr. Wuchun Feng
    Dr. Jin-Hee Cho
    Dr. Dawei Zhou
    Dr. Peng Gao

Registered Students

    Shravya Kanchi (Machine Learning and Security)
    Amun Kharel (Reinforcement Learning)
    Anish Narkar (Machine Learning and Security)
    Qi Zhang (Reinforcement Learning)
    Sha Li (Data Mining and Information Retrieval)
    Ali Haisam Muhammad Rafid (Data Mining and Information Retrieval)
    Chase Vickery(Reinforcement Learning)
    Saikat Dey (Machine Learning and Software)
    Kiet Nguyen (Natural Language Processing)
    Indrajeet Kumar Mishra (Natural Language Processing)
    Tianjiao Yu (Natural Language Processing)

Tentative Instructions

Send email to lifuh@vt.edu to be registered. Your participation in the survey did not register you. We need to hear from you!
First, a paper reading list will be released. At the beginning of the examination period, all students will receive a document that contains questions.
By the end of the examination period, each student must turn in a written solution to those questions. The solutions will be no longer than 8 pages (excluding references) at 11 point font or larger using a format TBA.
Written solutions should take the form of a scientific paper. It should include at least the following:
- a motivation section making clear the context of the problem/situation;
- a clear statement of the problem in terms of concepts and terminology in the information/data area, that addresses the situation/context;
- a review of related literature, drawn partially from multiple relevant works in the reading list, but must include additional references found by the student during a thorough literature search;
- descriptions of approaches to solve the problem; and
- an evaluation plan for how such approaches would be validated.
Students will then provide an oral presentation detailing their solution. They must be completed within a 15 minute period, in which 10 minutes are for presentation and 5 minutes for answering questions posed by faculty examiners.
Each solution will be graded by at least 2 faculty members. A combined grade will then be assigned for each student based on all faculty input by the area committee, on a scale of 0-3, as is called for by GPC policies.

Early Withdrawal Policy

A student registered for the PhD qualifier exam may withdraw at any point of time before the early withdrawal deadline, which is Jan 1, 2023. After this date, withdrawal is prohibited. Students with questions about this policy should contact the exam chair directly.

Academic Integrity

Discussions among students of the papers identified for the exam are reasonable up until the date the exam is released publicly. Once the exam questions are released, we expect all such discussions will cease as students are required to conduct their own work entirely to answer the qualifier questions. This examination is conducted under the University's Graduate Honor System Code. Students are encouraged to draw from other papers than those listed in the exam to the extent that this strengthens their arguments. However, the answers submitted must represent the sole and complete work of the student submitting the answers. Material substantially derived from other works, whether published in print or found on the web, must be explicitly and fully cited. Note that your grade will be more strongly influenced by arguments you make rather than arguments you quote or cite.

Exam Schedule

    12/01/2022: Release of reading list
    12/05/2022: Deadline for students to commit to exam
    1/1/2023: Last day to withdraw
    1/11/2023: Release of written exam
    1/27/2023: Student solutions to written exam due
    Beginning of Feb: Oral exams

Reading List

The reading lists below cover the following topics: (1) Data Mining and Information Retrieval, (2) Natural Language Processing, (3) Reinforcement Learning, (4) Machine Learning and Security, (5) Machine Learning and Software. You may choose choose any one of these lists for your exam. You are expected to significantly expand on your selected list while preparing your written solution.

List 1: Data Mining and Information Retrieval

MentorGNN: Deriving Curriculum for Pre-Training GNNs, Dawei Zhou, Lecheng Zheng, Dongqi Fu, Jiawei Han, and Jingrui He. CIKM, 2022.
A Data-Driven Graph Generative Model for Temporal Interaction Networks, Dawei Zhou, Lecheng Zheng, Jiawei Han, Jingrui He. KDD, 2020.
Beta embeddings for multi-hop logical reasoning in knowledge graphs, Hongyu Ren, and Jure Leskovec. NeurIPS, 2020.
Local motif clustering on time-evolving graphs, Dongqi Fu, Dawei Zhou, and Jingrui He. KDD, 2020.
Domain Adaptive Multi-Modality Neural Attention Network for Financial Forecasting, Dawei Zhou, Lecheng Zheng, Jianbo Li, Yada Zhu, Jingrui He. WWW, 2020.
Adversarial attacks on neural networks for graph data, Daniel Zügner, Amir Akbarnejad, and Stephan Günnemann. KDD, 2018.

List 2: Natural Language Processing

Fine-tuned Language Models are Zero-Shot Learners, Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. ICLR, 2022.
Lifelong Event Detection with Knowledge Transfer, Pengfei Yu, Heng Ji, and Prem Natarajan. EMNLP, 2021.
Diffusion-LM Improves Controllable Text Generation, Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori B. Hashimoto. NeurIPS, 2022.
MERLOT: Multimodal Neural Script Knowledge Models, Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. NeurIPS, 2021.
TaPas: Weakly Supervised Table Parsing via Pre-training, Jonathan Herzig, Pawel Krzysztof Nowak, Thomas Müller, Francesco Piccinno, and Julian Eisenschlos. ACL, 2020.
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework, Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. ICML, 2022.

List 3: Reinforcement Learning

Deep Reinforcement Learning at the Edge of the Statistical Precipice, Rishabh Agarwal, Max Schwarzer, and Marc G. Bellemare NeurIPS, 2021.
Uncertainty-Aware Action Advising for Deep Reinforcement Learning Agents, Felipe Leno Da Silva, Pablo Hernandez-Leal, Bilal Kartal, and Matthew E. Taylor AAAI, 2020.
Deep Reinforcement Learning that Matters, Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger AAAI, 2018.
A Deep Reinforcement Learning Perspective on Internet Congestion Control, Nathan Jay, Noga H. Rotman, P. Brighten Godfrey, Michael Schapira, and Aviv Tamar ICML, 2019.
DRN: A Deep Reinforcement Learning Framework for News Recommendation, Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas Jing Yuan, Xing Xie, and Zhenhui Li WWW, 2018.
Cooperative Exploration for Multi-Agent Deep Reinforcement Learning, Iou-Jen Liu, Unnat Jain, Raymond A. Yeh, and Alexander G. Schwing ICML, 2021.

List 4: Machine Learning and Security

Transcending TRANSCEND: Revisiting Malware Classification in the Presence of Concept Drift, Federico Barbero, Feargus Pendlebury, Fabio Pierazzi, and Lorenzo Cavallaro IEEE S&P, 2022.
CADE: Detecting and Explaining Concept Drift Samples for Security Applications, Limin Yang, Wenbo Guo, Qingying Hao, Arridhana Ciptadi, Ali Ahmadzadeh, Xinyu Xing, Gang Wang USENIX Security, 2021.
INSOMNIA: Towards Concept-Drift Robustness in Network Intrusion Detection, Giuseppina Andresini, Feargus Pendlebury, Fabio Pierazzi, Corrado Loglisci, Annalisa Appice, Lorenzo Cavallaro AISec, 2021.
AI/ML for Network Security: The Emperor has no Clothes, Arthur S. Jacobs, Roman Beltiukov, Walter Willinger, Ronaldo A. Ferreira, Arpit Gupta, Lisandro Z. Granville CCS, 2022.
Concept Drift Detection Through Resampling, Maayan Harel, Koby Crammer, Ran El-Yaniv, Shie Mannor ICML, 2014.
Outside the Closed World: On Using Machine Learning for Network Intrusion Detection, Robin Sommer, Vern Paxson IEEE S&P, 2010.

List 5: Machine Learning and Software

ProGraML: A Graph-based Program Representation for Data Flow Analysis and Compiler Optimizations, Chris Cummins, Zacharias V. Fisches, Tal Ben-Nun, Torsten Hoefler, Michael O’Boyle, Hugh Leather ICML, 2021.
Scalable Deep Learning via I/O Analysis and Optimization, Sarunya Pumma, Min Si, Wu-chun Feng, Pavan Balaji TOPC, 2019.
Iterative Machine Learning (IterML) for Effective Parameter Pruning and Tuning in Accelerators, Xuewen Cui, Wu-chun Feng Computing Frontiers, 2019.
An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks, Albert Njoroge Kahira et al. HPDC, 2021.
Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning, Truong Thao Nguyen et al. IPDPS, 2022.

Exam Questions

Exam questions are available here: PDF

Grading Scale

The exam will ultimately be graded on a scale as detailed in the Ph.D. Student Handbook, as replicated here.

Student's performance is such that the committee considers the student unable to do Ph.D-level work in computer science.
While the student adequately understands the content of the work, the student is deficient in one or more factors listed for assessment under score value of 2. A score of 1 is the minimum necessary for an MS-level pass.
Performance appropriate for students preparing to do Ph.D-level work. Prime factors for assessment include being able to distinguish good work from poor work, and explain why; being able to synthesize the body of work into an assessment of the state-of-the-art on a problem (as indiciated by the collection of papers); and being able to identify open problems and suggest future work.
Excellent performance, beyond that normally expected or required for a Ph.D. student.