About
PhD candidate specializing in reasoning, factual grounding, and agentic systems for large language models, integrating retrieval and memory architectures with model adaptation and data-centric learning. Built and deployed large-scale ML systems across structured and unstructured domains, including generative AI at The Washington Post and applied ML in search, traceability, and forecasting.
Skills
Machine Learning & GenAI:
LLMs, Prompt Engineering, Fine-tuning (SFT, RLHF, LoRA/QLoRA),
Agentic AI, Model Context Protocol (MCP),
Retrieval-Augmented Generation (RAG),
Embeddings & Vector Databases,
Recommender Systems,
Representation Learning,
LLM Evaluation & Quality Assessment,
Reward Modeling,
Synthetic Data Generation,
Long-context Reasoning,
Quantization,
Serving & Deployment,
Supervised Learning,
Feature Engineering,
Explainability,
Graph-based Modeling,
Time-series Forecasting,
Transfer Learning.
Frameworks & Tools:
PyTorch, Transformers, PEFT, TRL, LangChain, OpenAI, Claude, Gemini,
Elasticsearch, FastAPI, AWS (SageMaker, EC2), GCP (Compute Engine), Azure,
scikit-learn, spaCy, NetworkX, Pandas, NumPy,
Streamlit, Docker, Git.
Programming:
Python, Java, C++, MATLAB, R, SQL/NoSQL.
Selected Publications
Author order as in papers. * indicates workshop or preprint where applicable.
-
Utilizing Metadata for Better Retrieval-Augmented Generation
Raquib Bin Yousuf, Shengzhe Xu, Mandar Sharma, Andrew Neeser, Chris Latimer, Naren Ramakrishnan.
Proceedings of the 48th European Conference on Information Retrieval (ECIR 2026). (accepted).
-
LLM Augmentations to Support Analytical Reasoning over Multiple Documents
Raquib Bin Yousuf, Nicholas Defelice, Mandar Sharma, Shengzhe Xu, Naren Ramakrishnan.
Proceedings of the IEEE International Conference on Big Data (BigData), 2024. Best Paper
-
Can an LLM Induce a Graph? Investigating Memory Drift and Context Length
Raquib Bin Yousuf, Aadyant Khatri, Shengzhe Xu, Mandar Sharma, Naren Ramakrishnan.
Proceedings of the IEEE International Conference on Knowledge Graph (ICKG), 2025.
-
Information Guided Regularization for Fine-tuning Language Models
Mandar Sharma, Nithin Muralidhar, Shengzhe Xu, Raquib Bin Yousuf, Naren Ramakrishnan.
Proceedings of the 1st Conference on Language Modeling (COLM), 2024.
-
Lessons from Deep Learning Applied to Scholarly Information Extraction: What Works, What Doesn’t, and Future Directions
Raquib Bin Yousuf, Subhodip Biswas, Kulendra Kumar Kaushal, James Dunham, Rebecca Gelles, Sathappan Muthiah, Nathan Self, Patrick Butler, Naren Ramakrishnan.
Data-driven Science of Science Workshop at the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2022.*
-
Optimizing Product Provenance Verification using Data Valuation Methods
Raquib Bin Yousuf, Hoang Anh Just, Shengzhe Xu, Brian Mayer, Victor Deklerck, Jakub Truszkowski, John C. Simeone, Jade Saunders, Chang-Tien Lu, Ruoxi Jia, Naren Ramakrishnan.
Proceedings of the AAAI Conference on Artificial Intelligence 2026 (Accepted) arXiv:2502.15177, 2025-2026.
-
Chasing the Timber Trail: Machine Learning to Reveal Harvest Location Misrepresentation
Shailik Sarkar, Raquib Bin Yousuf, Linhan Wang, Brian Mayer, Thomas Mortier, Victor Deklerck, Jakub Truszkowski, John C. Simeone, Marigold Norman, Jade Saunders, Chang-Tien Lu, Naren Ramakrishnan.
Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2025.
-
Why LLMs Are Bad at Synthetic Table Generation (and what to do about it)
Shengzhe Xu, Cho-Ting Lee, Mandar Sharma, Raquib Bin Yousuf, Nikhil Muralidhar, Naren Ramakrishnan.
Structured Knowledge for LLMs Workshop at the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2025.*
-
Forecasting Migration Patterns and Land Border Encounters
Raquib Bin Yousuf, Shengzhe Xu, Patrick Butler, Brian Mayer, Nathan Self, David Mares, Naren Ramakrishnan.
Proceedings of the IEEE International Conference on Big Data (BigData), 2024.
-
A Probabilistic Approach to Estimating Timber Harvest Location
Jakub Truszkowski, Roi Maor, Raquib Bin Yousuf, Subhodip Biswas, Caspar Chater, Peter Gasson, Scot McQueen, Marigold Norman, Jade Saunders, John Simeone, Naren Ramakrishnan, Alexandre Antonelli, Victor Deklerck.
Ecological Applications, 35(1): e3077, 2025.
-
Mining Developer Questions about Major Web Frameworks
Zakaria Mehrab, Raquib Bin Yousuf, Ibrahim Asadullah Tahmid, Rifat Shahriyar.
Proceedings of the International Conference on Web Information Systems and Technologies (WEBIST), 2018.
Honors & Service
- Best Paper Award — IEEE BigData 2024
- Paul E. Torgersen Graduate Student Research Excellence Award (PhD Finalist) — Virginia Tech, 2026
- Conference Travel Grants — Virginia Tech (KDD 2022, BigData 2024, ICKG 2025)
- Reviewer — COLM 2026; IEEE Transactions on Big Data
Research & Industry Experience
The Washington Post — Machine Learning Intern 2023
- Fine-tuned and evaluated LLMs on AWS (EC2, SageMaker) to explore newsroom applications including subheadline generation, summarization and question answering.
- Contributed to early development of the "Ask the Post" chatbot, advising on RAG design and evaluation in collaboration with newsroom stakeholders.
Virginia Tech — Graduate Research Assistant 2019–Present
- Developed memory-augmented LLM architectures for multi-document reasoning, achieving 25% relative improvement in classification performance and 50% improvement in generation quality; introduced a graph-based benchmark revealing memory-drift onset at only 1-6% of advertised context capacity.
- Designed metadata-aware dual-encoder retrieval methods for RAG, achieving 70% average relative improvement over text-only baselines; deployed a human-in-the-loop agentic system supporting hundreds of newsroom sessions with 86% schema-alignment reliability.
- Proposed permutation-aware tabular generation and Fisher-guided regularization for LLM fine-tuning, reducing synthetic data rule violations by 70% and improving generalization in low-data regimes across 9/10 GLUE tasks with no added computational overhead
- Built a full-text scientific information extraction system using domain-adapted transformers, achieving 11% improvement over prior baselines and 26% higher salient task/method extraction.
- Led development of production ML systems for forest product provenance and large-scale migration forecasting, deploying regulatory and real-time predictive pipelines used in compliance and policy settings; supported assessment of 59 wood products, identified 260+ tons of allegedly illegal timber, and contributed to 9+ enforcement investigations.
- Contributed to development of funded research proposals for projects supported by DARPA, NSF, and industry partners.
Teaching
- Lecturer, Eastern University Bangladesh — Advanced Programming; Digital Logic Design (2018).
- GTA, Virginia Tech — OOP; Software Design & Data Structures; Social Media Analytics (2019–2021).
Talks & Presentations
- LLM Augmentations for Multi-Document Reasoning — IEEE BigData 2024
- Graph Induction and Memory Drift in LLMs — IEEE ICKG 2025
- LLM Systems for Newsroom Applications — The Washington Post
Academic Service
- Reviewer — Conference on Language Modeling (COLM), IEEE Transactions on Big Data
- Mentored undergraduate and junior graduate researchers on ML and LLM projects