About Me

I am interested in developing data mining and machine learning methods to solve scientific and socially relevant problems. A primary focus of my research is to advance the growing field of theory-guided data science, where machine learning methods are systematically coupled with scientific knowledge (or physics) to accelerate scientific discovery. I am looking forward to work at the boundaries of data science and scientific domains by forging inter-disciplinary collaborations. I am also looking for bright and ambitious students who are motivated to solve real-world problems by pursuing novel research in data science. Before joining Virginia Tech, I received my Ph.D. from the University of Minnesota under the guidance of Prof. Vipin Kumar. I have also had the chance to co-author the textbook, Introduction to Data Mining (2nd edition).


New: I am teaching an advanced topics course on CS 6804: Machine Learning Meets Physics this Fall semester. This course will provide an overview of theory-guided data science and prepare students to pursue this research paradigm in inter-disciplinary problems of their choice. If you are a prospective student, please feel free to contact me if you have any questions about the course.

Link to Request Force-Add: https://hosting.cs.vt.edu/gpc/ForceAdd.html.
Background Survey (Assignment 0): https://virginiatech.qualtrics.com/jfe/form/SV_9MLf4daSTDfYjTD.


Recent Updates

[08-10-2018]Joined as an Assistant Professor in the Department of Computer Science at Virginia Tech.
[07-29-2018]I will be giving an invited talk in the session on "Artificial Intelligence Applications in the Geosciences: Promises and Challenges for the Future" at AGU Fall Meetings 2018.
[07-23-2018]Serving as a Program Committee Member of Association for Advancement of Artificial Intelligence (AAAI) Conference 2019.
[07-04-2018]My article on "Machine Learning for Geosciences: Challenges and Opportunities" has been accepted for publication at IEEE Transactions on Knowledge and Data Engineering (TKDE). Preprint of article available on arXiv.
[06-23-2018]Invited to speak at a Symposium on 'Physics/Chemistry-aware machine learning' organized as part of the SIAM Conference on Computational Science and Engineering 2019.
[06-15-2018]I am serving on the program committee for a cross-disciplinary workshop on "A new paradigm in lake and reservoir research and management through global monitoring, modeling, and engaging and empowering people networks," which will be held from Sep 5–7, 2018 in Washington D.C. This workshop will showcase our efforts in modeling and monitoring water bodies and bring together water scientists and machine learning researchers under the theme of theory-guided data science.
[05-30-2018]Invited to speak at the Institute for Pure and Applied Mathematics (IPAM) Workshop on HPC for Computationally and Data-Intensive Problems.
[05-03-2018]Gave an invited talk at National Center for Atmospheric Research (NCAR) in a seminar conducted by the Computational & Informational Systems Lab.
[04-30-2018]I am joining as a Review Editor for the Editorial Board of 'Data-driven Climate Sciences' section of Frontiers in Big Data.
[03-21-2018]I am co-organizing a workshop at KDD 2018 on Fragile Earth: Theory Guided Data Science to Enhance Scientific Discovery.
[03-06-2018]Gave an invited talk at Oak Ridge National Laboratory on my research in theory-guided data science and its potential for accelerating knowledge discovery in various scientific disciplines.
[02-03-2018]I am pleased to announce that our textbook, Introduction to Data Mining, 2nd edition, is finally out! More information can be found from the book's website that contains companion materials such as slides and additional resources for instructors and students.
[01-25-2018]Presented my work at the 2nd workshop on Physics Informed Machine Learning, which was highly relevant for my research on theory-guided data science. It was great interacting with everyone from different disciplines working at the intersections of machine learning and natural sciences.
[12-27-2017] I am serving as a Program Committee Member for ACM SIGKDD (Research Track) 2018.
[12-15-2017] Presented my latest work on theory-guided data science for geoscience problems at AGU Fall Meeting 2017.
[12-09-2017]Presented my work on physics-guided neural networks at the Deep Learning for Physical Sciences Workshop at NIPS 2017.
[11-21-2017]Gave an Invited Keynote Talk at ICDM 2017 Workshop on Data Mining for Earth System Science.
[11-13-2017]Survey article on spatio-temporal data mining got accepted at ACM Computing Surveys. Preprint of survey available on arXiv.
[10-31-2017] Started a collaboration with DC Water and other agencies (USGS, EPA, Xylum, Limnotech, ESRI) to create a ‘‘digital twin’’ of the Anacostia Watershed. Our team will contribute to the monitoring of environmental processes (land use changes) and the modeling of water quality indicators using hybrid-physics-data models.
[10-31-2017] Preprint of my article on physics-guided neural networks is available on arXiv.
[10-30-2017] Gave an invited talk at the Big Data and Sustainability session of the Annual Meeting of the American Institute of Chemical Engineers (AIChE).
[10-22-2017] Our grant proposal on ‘‘Model Integration Through Knowledge-Rich Data and Process Composition (MINT), ’’ led by Prof. Yolanda Gil, got accepted for the DARPA World Modelers Program.
[10-12-2017] Paper got accepted at IEEE International Conference on Big Data 2017.
[10-11-2017] We have secured one of University of Michigan submission slots for the upcoming NSF Science and Technology Center (STC) program. This is a multi-institution project led by Prof. Christiane Jablonowski that seeks to establish an inter-disciplinary center for extreme weather and physics-aware data science.
[10-01-2017] Started my PostDoctoral Associate at the University of Minnesota with Prof. Vipin Kumar. The primary focus of my post-doctoral research is on advancing theory-guided data science.
[09-27-2017] Defended my Ph.D. dissertation on ‘‘Predictive Learning with Heterogeneity in Populations’’. Officially, Dr. Anuj Karpatne!
[09-19-2017] Our grant proposal for ‘‘NSF Innovations at the Nexus of Food, Energy and Water Systems (INFEWS)’’ got accepted for funding. See this NSF news story to get more information about this project.
[08-23-2017] Attended the SAMSI Climate Opening Workshop in Raleigh, NC. We had some great talks and discussions by a closely-connected group of leading experts on statistics and machine learning for climate science.
[07-25-2017] I served as an instructor for the summer school on ‘‘Intelligent Systems for Geosciences (IS-GEO)’’ at the University of Texas at Austin. It was great interacting with motivated students and researchers from diverse academic and professional backgrounds to get them excited about the growing field of machine learning and its importance in advancing scientific discovery in geosciences.
[07-19-2017] I am serving as a convener for the session on ‘‘Intelligent Systems for Geosciences: Accelerating Discovery and Building Community’’ at AGU Fall Meeting, Dec 10-15, 2017.
[07-14-2017] Our NSF Expeditions in Computing grant on ‘‘Water in the 21st century: A data-guided approach’’ got accepted for final round of review (acceptance rate less than 4%) and has been invited for reserve site visit at NSF headquarters in November 2017. This 5-year $10M grant builds upon my work on theory-guided data science for creating hybrid models of physics and data science in hydrology.
[06-29-2017] I served on a panel on ‘‘Theory-guided Data Science’’ at the 29th International Conference on Scientific and Statistical Database Management (SSDBM) in Chicago. We had a lively and engaging discussion and it was great to learn from everyone in the panel as well as the audience.
[06-28-2017] My perspective article on theory-guided data science got published at IEEE TKDE. A preprint of this article has already received 1500+ reads on ResearchGate (and still counting) even before publication. This overwhelming response indicates the promise in integrating scientific knowledge with data science methods—a trend that is simultaneously being realized in several scientific disciplines.
[05-25-2017] Our paper on monitoring surface water dynamics (in collaboration with Dr. Dennis Lettenmaier's research group at UCLA) got accepted for publication at Remote Sensing of Environment (RSE) 2017, a top-tier journal in remote sensing.
[05-16-2017] Our paper got accepted at KDD 2017.
[04-29-2017] I served on a panel at the SDM Workshop on Mining Big Data in Climate and Environment, where the topic of discussion was ‘‘Understanding and Narrowing Gaps Between Data Science and Mechanistic Theories in Physical Sciences’’. It was great interacting with everyone and discussing the future of theory-guided data science.
[10-11-2016] Our paper got accepted at IEEE International Conference on Big Data 2016.
[09-21-2016] Our global surface water monitoring system has been invited to contribute to the next generation of Essential Climate Variables (ECV), which will support the climate change adaptation and mitigation efforts of the United Nations Framework Convention on Climate Change (UNFCCC).
[08-23-2016] My work on monitoring surface water dynamics was featured as the central highlight of an NSF news story!
[08-14-2016] Presented our work on modeling the food-energy-water nexus in critical biodiverse landscapes in Cambodia at ACM KDD Workshop on Data Science for Food, Energy and Water. It was a great experience to interact with everyone at KDD!
[06-15-2016] My paper got published in IEEE Geoscience and Remote Sensing Magazine 2016, a top-tier publication venue in remote sensing.
[03-28-2016] Officially became the co-author of the second edition of the textbook ‘‘Introduction to Data Mining.’’ I am excited to be part of this challenging but immensely gratifying journey!
[12-15-2015] My paper got published in IEEE Computing in Science & Engineering 2015.
[11-17-2015] Presented my paper at International Conference on Data Mining (ICDM) 2015.
[05-02-2015] Presented my paper at SIAM International Conference on Data Mining (SDM) 2015.
[04-28-2015] Received the University of Minnesota Doctoral Dissertation Award 2015-16.
[12-15-2014] Received the University of Minnesota Informatics Institute Fellowship 2015-16.
[04-26-2014] Presented my paper at SIAM International Conference on Data Mining (SDM) 2014.
[05-27-2013] Will be working as a summer intern at IBM Research, Yorktown Heights, NY, for the next three months. I am excited to work on spatio-temporal problems in analyzing crime data sets as part of the Smarter Planet Group at IBM.
[10-26-2012] Presented two papers at NASA Conference on Intelligent Data Understanding (CIDU) 2012.

Projects


Theory-guided Data Science

Theory-guided data science is an emerging paradigm of scientific discovery that aims to integrate scientific knowledge with data science methods to produce physically consistent results. My research builds the foundation of this paradigm and I am currently exploring this paradigm for problems in diverse disciplines such as hydrology, climate science, and computational chemistry.
TKDE 2017 ArXiv 2017

Introduction to Data Mining (Second Edition)

The second edition of this textbook presents new and improved content on several essential topics in data mining such as model overfitting, model evaluation, deep learning, class imbalance, and anomaly detection. Additionally, it introduces an entirely new chapter on avoiding false discoveries for data mining problems–a contribution missing in alternate resources at the required depth and breadth despite its importance.
Book2017

Spatio-temporal Data Mining

Space and time introduce several challenges and opportunities for classical data mining algorithms given the variety of data types, representations, problems, and methods in spatio-temporal settings. My recent survey provides an over-arching structure to the vast and diverse field of spatio-temporal data mining. A recurring theme of my research is to equip data mining methods with a better ability to deal with spatio-temporal data from Earth and environmental sciences.
CSUR2017 KDD2017 BigData2017 BigData2016 GRSM2016 CISE2015 ArXiv 2017

Predictive Learning with Heterogeneity in Populations

A central challenge in applying standard predictive learning methods for real-world problems is the heterogeneity in data populations, i.e., different groups of instances show different nature of predictive relationships. My dissertation research introduced several novel ways for addressing this challenge, building on ideas from multi-task learning and group-specific local learning.
Thesis2017 ICDM2015 SDM2015 SDM2014 CIDU2012

Global System for Mapping Surface Water Dynamics

My research has enabled a global surface water monitoring system that provides the first global history of surface water every 8 days for the last 15 years using high-resolution satellite data. This system captures vital information about changes occurring in surface water such as droughts, dam constructions, river meandering, and melting glacial lakes, which was featured in an NSF news story.
RSE2017 CompSus2016


Publications


Book

[B1] P. Tan, M. Steinbach, A. Karpatne, and V. Kumar, Introduction to Data Mining, Pearson Addison–Wesley (Second Edition), ISBN-13: 978-0133128901, 2018 [Book Website].  

Journal Articles

[J10] A. Karpatne, G. Atluri, J. Faghmous, M. Steinbach, A. Banerjee, A. Ganguly, S. Shekhar, N. Samatova, and V. Kumar, Theory-guided Data Science: A New Paradigm for Scientific Discovery from Data, IEEE Transactions on Knowledge and Data Engineering (TKDE), 29(10), 2318–2331, 2017 [arXiv, DOI].  
[J9] G. Atluri^{star}, A. Karpatne^{star}, and V. Kumar, Spatio-temporal Data Mining: A Survey of Problems and Methods, ACM Computing Surveys, 2017 (accepted; ^{star} equal contribution ) [arXiv].  
[J8] A. Karpatne, I. Ebert-Uphoff, S. Ravela, H. A. Babaie, and V. Kumar, Machine Learning for the Geosciences: Challenges and Research Opportunities, IEEE TKDE, 2017 (in review) [arXiv].  
[J7] A. Khandelwal^{star}, A. Karpatne^{star}, M.E. Marlier^{star}, J. Kim, D. P. Lettenmaier, and V. Kumar, An Approach for Global Monitoring of Surface Water Extent Variations using MODIS Data, Remote Sensing of Environment, Elsevier, 2017 (^{star} equal contribution) [DOI].  
[J6] A. Karpatne, Z. Jiang, R. R. Vatsavai, S. Shekhar, and V. Kumar, Monitoring Land Cover Changes: A Machine Learning Perspective, IEEE Geoscience and Remote Sensing Magazine, 4(2), 8–21, 2016. [DOI].  
[J5] A. Karpatne and S. Liess, A Guide to Earth Science Data: Summary and Research Challenges, IEEE Computing in Science & Engineering, 17(6), 14–18, 2015. [DOI].  
[J4] F. Schrodt, J. Kattge, H. Shan, F. Fazayeli, J. Joswig, A. Banerjee, M. Reichstein, G. Bonisch, S. Diaz, J. Dickie, A. Gillison, A. Karpatne, S. Lavorel, P.W. Leadley, C. Wirth, I. Wright, S.J. Wright, and P.B. Reich, BHPMF - A Hierarchical Bayesian Approach to Gap-filling and Trait Prediction for Macroecology and Functional Biogeography, Global Ecology and Biogeography, 24(12), 1510–1521, 2015. [DOI].
[J3] R. Khemchandani, A. Karpatne, and S. Chandra, Twin Support Vector Regression for the Simultaneous Learning of a Function and its Derivatives, International Journal of Machine Learning and Cybernetics, 4(1), 51–63, 2013. [DOI].
[J2] R. Khemchandani, A. Karpatne, and S. Chandra, Proximal Support Tensor Machines, International Journal of Machine Learning and Cybernetics, 4(6), 703–712, 2013. [DOI].
[J1] R. Khemchandani, A. Karpatne, and S. Chandra, Generalized Eigenvalue Proximal Support Vector Regressor, Expert Systems with Applications, 38(10), 13136–13142, 2011 [DOI].

Peer-reviewed Conference Papers

[C9] A. Karpatne, W. Watkins, J. Read, and V. Kumar, Physics-guided Neural Networks (PGNN): An Application in Lake Temperature Modeling, SIAM International Conference on Data Mining (SDM), 2018 (in review) [arXiv].  
[C8] X. Jia, Y. Hu, A. Khandelwal, A. Karpatne, and V. Kumar, Joint Sparse Auto-encoder: A Semi-supervised Spatio-temporal Approach in Mapping Large-scale Croplands, IEEE International Conference on Big Data, 2017.  
[C7] S. Agrawal, G. Atluri, A. Karpatne, S. Chatterjee, S. Liess, and V. Kumar, Tripoles: A New Class of Relationships in Time Series Data, ACM International Conference on Knowledge Discovery and Data Mining (KDD), 697–706, 2017 [DOI].  
[C6] X. Jia, X. Chen, A. Karpatne, and V. Kumar, Identifying Dynamic Changes with Noisy Labels in Spatial-temporal Data: A Study on Large-scale Water Monitoring Application, IEEE International Conference on Big Data, 1328–1333, 2016 [DOI].  
[C5] A. Karpatne and V. Kumar, Adaptive heterogeneous ensemble learning using the context of test instances, IEEE International Conference on Data Mining (ICDM), 787–792, 2015. [DOI].  
[C4] A. Karpatne, A. Khandelwal, and V. Kumar, Ensemble learning methods for binary classification with multi-modality within the classes, SDM, (82) 730–738, 2015. [DOI].  
[C3] A. Karpatne, A. Khandelwal, S. Boriah, and V. Kumar, Predictive learning in the presence of heterogeneity and limited training data, SDM, (29) 253–261, 2014. [DOI].  
[C2] A. Karpatne, M. Blank, M. Lau, S. Boriah, K. Steinhaeuser, M. Steinbach, and V. Kumar, Importance of vegetation type in forest cover estimation, NASA Conference on Intelligent Data Understanding (CIDU), 71–78, 2012. [DOI].  
[C1] X. Chen^{star}, A. Karpatne^{star}, Y. Chamber^{star}, V. Mithal, M. Lau, K. Steinhaeuser, S. Boriah, M. Steinbach, V. Kumar, C.S. Potter, S.A. Klooster, T. Abraham, J.D. Stanley, and J.C. Castilla-Rubio, A new data mining framework for forest fire mapping, CIDU, 104–111, 2012 (^{star} equal contribution). [DOI].

Book Chapters

[BC2] A. Karpatne, A. Khandelwal, X. Chen, V. Mithal, J. Faghmous, and V. Kumar, Global monitoring of inland water dynamics: State-of-the-art, challenges, and opportunities, In Computational Sustainability, J. Lassig, K. Kersting, and K. Morik (Eds.), Springer, 121–147, 2016. [DOI].  
[BC1] A. Karpatne, J. Faghmous, J. Kawale, L. Styles, M. Blank, V. Mithal, X. Chen, A. Khandelwal, S. Boriah, K. Steinhaeuser, M. Steinbach, and V. Kumar, Earth science applications of sensor data, In Managing and Mining Sensor Data, C. Aggarwal (Ed.), Springer, 505–530, 2013. [DOI].

Peer-reviewed Workshop Proceedings

[W7] A. Karpatne and V. Kumar, Learning Physics-based Models in Hydrology under the Framework of Generative Adversarial Networks, American Geophysical Union (AGU) Fall Meeting, 2017.
[W6] A. Karpatne, W. Watkins, J. Read, and V. Kumar, Physics-guided Learning of Neural Networks: An Application in Lake Temperature Modeling, NIPS Workshop on Deep Learning for Physical Sciences, 2017.
[W5] A. Karpatne, H. Babaie, S. Ravela, V. Kumar, and I. Ebert-Uphoff, Machine Learning for the Geosciences--Opportunities, Challenges, and Implications for the ML process, SDM Workshop on Mining Big Data in Climate and Environment, 2017.
[W4] S. Gopal, A. Karpatne, and V. Kumar, Modeling the Food-Energy-Water Nexus in Critical Biodiverse Landscapes: A Case Study of Tonle Sap, Cambodia and Tulalip Tribe, USA, ACM KDD Workshop on Data Science for Food, Energy and Water, 2016 [Video].
[W3] A. Karpatne, A. Khandelwal, R. Anderson, M. Blank, S. Boriah, and V. Kumar, Group-specific local learning for global lake monitoring, Fourth International Workshop on Climate Informatics, 2014.
[W2] A. Karpatne, J. Faghmous, M. Blank, R. Anderson, S. Boriah, S. Liess, and V. Kumar, Understanding the Influence of Sea Surface Temperatures on Terrestrial Ecosystem Disturbances, Third International Workshop on Climate Informatics, 2013.
[W1] A. Karpatne, M. Blank, J. Middleton, S. Boriah, K. Steinhaeuser, M. Steinbach, S. Chatterjee, and V. Kumar, Understanding relationships between fire activity and sea surface temperature anomalies, American Geophysical Union (AGU) Fall Meeting, 2012.

Ph.D. Dissertation

Predictive Learning with Heterogeneity in Populations, University of Minnesota, 2017.  


For Prospective Students

I am looking forward to work with bright and ambitious students who are motivated to pursue research in machine learning and enable solutions to problems of great scientific and societal relevance. I find working on real-world problems to be both intellectually stimulating and socially rewarding, given the variety of challenges faced in analyzing complex physical data that offer fertile grounds for novel research. A major focus of my current work is in the area of theory-guided data science and I have several exciting projects in this space of research. If you are an undergraduate or graduate student who is interested in working with me, please feel free to shoot me an email with your CV/resume, and information about yourself including your major, technical background, specific application areas of interest (if any), and prior research experience (if any).