About
I build reliable language and learning systems that connect research innovation with real-world impact.
My work spans large language models (LLMs), retrieval-augmented generation (RAG), and applied machine learning at scale,
with recent projects on memory-augmented reasoning, long-context robustness, metadata-aware retrieval, synthetic data generation, and information-guided fine-tuning.
I enjoy designing, evaluating, and deploying end-to-end ML and GenAI solutions in collaboration with cross-disciplinary teams across academia, media, and industry.
Education
-
Ph.D. Candidate, Computer Science — Virginia Tech
-
M.S., Computer Science — Virginia Tech 2022
-
B.S., Computer Science — Bangladesh University of Engineering & Technology (BUET) 2017
Honors & Awards
- Best Paper Award — IEEE BigData 2024
- Dean’s List — BUET (2015–2017)
- University Travel Grant — Virginia Tech (ACM KDD 2022)
- Top Micro-controller Project Award — BUET (2016)
Selected Publications
Author order as in papers. * indicates workshop or preprint where applicable.
-
LLM Augmentations to Support Analytical Reasoning over Multiple Documents
Raquib Bin Yousuf, Nicholas Defelice, Mandar Sharma, Shengzhe Xu, Naren Ramakrishnan.
Proceedings of the IEEE International Conference on Big Data (BigData), 2024. Best Paper
-
Can an LLM Induce a Graph? Investigating Memory Drift and Context Length
Raquib Bin Yousuf, Aadyant Khatri, Shengzhe Xu, Mandar Sharma, Naren Ramakrishnan.
Proceedings of the IEEE International Conference on Knowledge Graph (ICKG), 2025.
-
Information Guided Regularization for Fine-tuning Language Models
Mandar Sharma, Nithin Muralidhar, Shengzhe Xu, Raquib Bin Yousuf, Naren Ramakrishnan.
Proceedings of the 1st Conference on Language Modeling (COLM), 2024.
-
Lessons from Deep Learning Applied to Scholarly Information Extraction: What Works, What Doesn’t, and Future Directions
Raquib Bin Yousuf, Subhodip Biswas, Kulendra Kumar Kaushal, James Dunham, Rebecca Gelles, Sathappan Muthiah, Nathan Self, Patrick Butler, Naren Ramakrishnan.
Data-driven Science of Science Workshop at the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2022.*
-
Chasing the Timber Trail: Machine Learning to Reveal Harvest Location Misrepresentation
Shailik Sarkar, Raquib Bin Yousuf, Linhan Wang, Brian Mayer, Thomas Mortier, Victor Deklerck, Jakub Truszkowski, John C. Simeone, Marigold Norman, Jade Saunders, Chang-Tien Lu, Naren Ramakrishnan.
Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2025.
-
Why LLMs Are Bad at Synthetic Table Generation (and what to do about it)
Shengzhe Xu, Cho-Ting Lee, Mandar Sharma, Raquib Bin Yousuf, Nikhil Muralidhar, Naren Ramakrishnan.
Structured Knowledge for LLMs Workshop at the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2025.*
-
Forecasting Migration Patterns and Land Border Encounters
Raquib Bin Yousuf, Shengzhe Xu, Patrick Butler, Brian Mayer, Nathan Self, David Mares, Naren Ramakrishnan.
Proceedings of the IEEE International Conference on Big Data (BigData), 2024.
-
A Probabilistic Approach to Estimating Timber Harvest Location
Jakub Truszkowski, Roi Maor, Raquib Bin Yousuf, Subhodip Biswas, Caspar Chater, Peter Gasson, Scot McQueen, Marigold Norman, Jade Saunders, John Simeone, Naren Ramakrishnan, Alexandre Antonelli, Victor Deklerck.
Ecological Applications, 35(1): e3077, 2025.
-
Mining Developer Questions about Major Web Frameworks
Zakaria Mehrab, Raquib Bin Yousuf, Ibrahim Asadullah Tahmid, Rifat Shahriyar.
Proceedings of the International Conference on Web Information Systems and Technologies (WEBIST), 2018.
-
Optimizing Product Provenance Verification using Data Valuation Methods
Raquib Bin Yousuf, Hoang Anh Just, Shengzhe Xu, Brian Mayer, Victor Deklerck, Jakub Truszkowski, John C. Simeone, Jade Saunders, Chang-Tien Lu, Ruoxi Jia, Naren Ramakrishnan.
arXiv preprint arXiv:2502.15177, 2025.*
Research & Industry Experience
The Washington Post — Data Science & ML Intern 2023
- Built and tested LLM-based Gen-AI pipelines on AWS to improve newsroom efficiency; work led to a funded VT–Washington Post collaboration (“Ask the Post”).
- Advised junior PhD students on follow-on efforts.
Virginia Tech — Graduate Research Assistant 2019–Present
- Developed novel LLM architectures and evaluation frameworks to assess reasoning quality and factual grounding, including memory-augmented models for multi-document reasoning, investigations of long-context drift via graph induction, and metadata-aware retrieval for improved RAG grounding.
- Proposed Fisher-information-guided regularization and tabular data synthesis for low-data regimes.
- Developed and deployed Transformer-based NLP systems for automated research-entity extraction and narrative recommendation, leveraging encoder-decoder architectures, graph modeling, and prompt engineering to preserve expertise and contextual relevance.
- Deployed ML/data-valuation models for Stable Isotope Ratio Analysis in the WFID platform for EU deforestation compliance; helped detect 260+ tons of illegal timber.
- Constructed migration-flow forecasting pipelines for the Americas, used for live policy/security insights.
Teaching
- Lecturer, Eastern University Bangladesh — Advanced Programming; Digital Logic Design (2018).
- GTA, Virginia Tech — OOP; Software Design & Data Structures; Social Media Analytics (2019–2021).
Service & Outreach
- Reviewer — IEEE Transactions on Big Data (2023).
- Graduate Student Volunteer — Sanghani Center for AI & Data Analytics (2022–Present).
- President — MGBS Science Club (2009–2010).
Skills
Machine Learning & GenAI: Large Language Models (LLMs), Prompt Engineering, Fine-tuning (SFT, RLHF, LoRA/QLoRA), Agentic AI & LLM AgentsRAG (Retrieval-Augmented Generation), Embeddings & Vector Databases, Reward Modeling, LLM Evaluation & Quality Assessment, Synthetic Data Generation, Long-context Reasoning, Model Quantization, Supervised Learning, Feature Engineering & Model Explainability, Time-series Forecasting, Graph-based Modeling, Transfer Learning.
Frameworks & Tools: PyTorch, Hugging Face Transformers, PEFT, TRL, LangChain, OpenAI, scikit-learn, spaCy, NLTK, Pandas, NumPy, NetworkX, AWS (SageMaker, S3, EC2, Lambda), Docker, Git.
Programming: Python, Java, C++, MATLAB, R, SQL/NoSQL.