As we advance into the Era of Big Data, machine learning (and recently, deep learning) methods have found immense success in extracting complex knowledge by sifting through large volumes of data, be it in the field of computer vision, speech recognition, or natural language translation. Given their accomplishments in commercial applications involving “Internet-scale” data, there is a huge anticipation to see if machine learning (ML) methods can accelerate knowledge discovery in scientific domains that have traditionally progressed via (scientific) theory-based models.
“Can black-box ML methods make existing theory-based models obsolete? Can we disregard existing scientific theories and completely rely on the information contained in data?”
This course will systematically investigate the pitfalls of black-box applications of ML in real-world scientific problems, involving incomplete/imperfect data with non-stationary and chaotic behavior. To overcome their limitations, this course will present an emerging paradigm of research, termed as theory-guided data science (TGDS), that aims to fully leverage the power of ML methods to automatically extract patterns and models from the data, but without ignoring the treasure of knowledge accumulated in scientific theories. This course will cover a variety of topics in the emerging field of TGDS, ranging from physics-inspired design of ML models to joint use of theory-based models and ML models. These topics will be illustrated using examples from diverse scientific domains including climate science, hydrology, biology, fluid dynamics, aerospace, and chemistry. The course will also provide hands-on experience in TGDS through course projects and research presentations.