Phd defence: Mads Herbert Kerrn
Title
Machine Learning in Protein Discovery and Optimization: Leveraging representation learning in data scarce regimes
Abstract
This thesis explores machine learning and statistical methods in proteomics and protein engineering, addressing challenges in analyzing and optimizing protein properties for scientific and industrial applications. We present three projects addressing these challenges.
First, we concentrate on unwanted variability in mass spectrometry data.
These are often affected by tissue heterogeneity, which introduces artifacts in the statistical analysis.
We develop statistical methods that leverage simple representation learning and linear models to mitigate artifacts caused by tissue heterogeneity. Our approach improves the detection of tissue-specific protein signals, providing a robust framework for proteomics data analysis.
Subsequently, we focus on predicting protein variant effects, a critical task in protein engineering. We introduce Kermut, a Gaussian process model that integrates prior biological knowledge and advanced protein representations derived from protein language models, inverse folding models, and structure prediction models. Kermut achieves state-of-the-art prediction performance.
Finally, we exploit Kermut’s capabilities to develop KABOOM, a Bayesian optimization strategy for protein engineering. KABOOM leverages Kermut’s predictive performance and uncertainty estimates to guide protein optimization in the vast space of protein variants. Using computational benchmarks, we demonstrate KABOOM’s performance compared to traditional and recent machine learning guided protein engineering approaches.
Assessment committee:
Associate Professor Oswin Krause (Chairperson)
Professor Søren Hauberg
Professor Carl Henrik Ek
Principal supervisor during the PhD programme:
Associate Professor Wouter Krogh Boomsma
Place
Lille UP1, Universitetsparken 1, København N.
Ask for a copi of the thesis at madskerrn@gmail.com