Evaluation of the Generalizability of Machine Learning-Assisted Protein Engineering Methods

Author: Li, Francesca-Zhoufan

Year: 2025

Degree: Dissertation (Ph.D.)

Advisors: Arnold, Frances Hamilton; Yue, Yisong

Committee Members: Murray, Richard M.; Mayo, Stephen L.; Yang, Kevin K.; Arnold, Frances Hamilton; Yue, Yisong

Option: Bioengineering

DOI: 10.7907/yzb2-cb66

Abstract

Engineered proteins can carry out a vast array of functions and have become indispensable across numerous industrial applications. To accelerate wet-lab protein engineering efforts, machine learning-based methods have advanced rapidly. However, a gap remains between state-of-the-art machine learning methods and their practical adoption. A key factor contributing to this disconnect is the lack of application-relevant benchmarking and generalizable insights across protein engineering tasks. This thesis evaluates machine learning-assisted protein engineering approaches to identify generalizable strategies. The central problem considered is learning the mapping from protein sequence to function—known as the fitness landscape—to enable the prediction of unseen variant fitness. Chapter 1 introduces the background and context for machine learning-assisted protein engineering and highlights the practical constraint of limited experimental budgets. Chapter 2 investigates transfer learning, which leverages models pretrained on large protein sequence databases to generate informative representations for modeling task specific sequence-function relationships. Evaluation across ten diverse tasks shows that while transfer learning is effective in structure prediction, it underperforms in variant fitness prediction—a key objective in protein engineering. Chapter 3 evaluates alternative strategies with a focus on combinatorial fitness landscapes, a common setting in protein engineering. Across 16 diverse landscapes, focused training improves the performance of various machine learning approaches by strategically selecting training variants using zero-shot predictors, which estimate variant fitness from auxiliary information without relying on experimental data. Building on these insights, Chapter 4 addresses the specific challenge of engineering enzymes—proteins that convert substrates into products—for novel chemistries. While six general zero-shot predictors without substrate information can predict enzyme activity on non-native substrates, they fail on more out-of-distribution, new-to-nature chemistries. Incorporating substrate information into zero-shot predictors leads to more generalizable performance across all tested chemistries, spanning 22 substrates. Overall, this thesis identifies generalizable strategies for machine learning-assisted protein engineering by systematically evaluating and improving how sequence-to-function relationships are modeled across diverse tasks.

Files