CaltechTHESIS
A Caltech Library Service

Evaluation of the Generalizability of Machine Learning-Assisted Protein Engineering Methods

Citation

Li, Francesca-Zhoufan (2025) Evaluation of the Generalizability of Machine Learning-Assisted Protein Engineering Methods. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/yzb2-cb66. https://resolver.caltech.edu/CaltechTHESIS:04292025-041809434

Abstract

Engineered proteins can carry out a vast array of functions and have become indispensable across numerous industrial applications. To accelerate wet-lab protein engineering efforts, machine learning-based methods have advanced rapidly. However, a gap remains between state-of-the-art machine learning methods and their practical adoption. A key factor contributing to this disconnect is the lack of application-relevant benchmarking and generalizable insights across protein engineering tasks. This thesis evaluates machine learning-assisted protein engineering approaches to identify generalizable strategies. The central problem considered is learning the mapping from protein sequence to function—known as the fitness landscape—to enable the prediction of unseen variant fitness. Chapter 1 introduces the background and context for machine learning-assisted protein engineering and highlights the practical constraint of limited experimental budgets. Chapter 2 investigates transfer learning, which leverages models pretrained on large protein sequence databases to generate informative representations for modeling task specific sequence-function relationships. Evaluation across ten diverse tasks shows that while transfer learning is effective in structure prediction, it underperforms in variant fitness prediction—a key objective in protein engineering. Chapter 3 evaluates alternative strategies with a focus on combinatorial fitness landscapes, a common setting in protein engineering. Across 16 diverse landscapes, focused training improves the performance of various machine learning approaches by strategically selecting training variants using zero-shot predictors, which estimate variant fitness from auxiliary information without relying on experimental data. Building on these insights, Chapter 4 addresses the specific challenge of engineering enzymes—proteins that convert substrates into products—for novel chemistries. While six general zero-shot predictors without substrate information can predict enzyme activity on non-native substrates, they fail on more out-of-distribution, new-to-nature chemistries. Incorporating substrate information into zero-shot predictors leads to more generalizable performance across all tested chemistries, spanning 22 substrates. Overall, this thesis identifies generalizable strategies for machine learning-assisted protein engineering by systematically evaluating and improving how sequence-to-function relationships are modeled across diverse tasks.

Item Type: Thesis (Dissertation (Ph.D.))
Subject Keywords: protein engineering, machine learning, evaluation, generalizability, fitness prediction, protein language models, transfer learning, combinatorial mutagenesis, directed evolution, epistasis, zero-shot predictor, enzyme engineering, substrate-aware, non-native substrate, new-to-nature
Degree Grantor: California Institute of Technology
Division: Biology and Biological Engineering
Major Option: Bioengineering
Thesis Availability: Public (worldwide access)
Research Advisor(s):
  • Arnold, Frances Hamilton (advisor)
  • Yue, Yisong (advisor)
Thesis Committee:
  • Murray, Richard M. (chair)
  • Mayo, Stephen L. (co-chair)
  • Yang, Kevin K.
  • Arnold, Frances Hamilton
  • Yue, Yisong
Defense Date: 12 May 2025
Non-Caltech Author Email: francesca.zf.l (AT) berkeley.edu
Funders:
Funding Agency Grant Number
Graduate Research Fellowship Program UNSPECIFIED
Amazon AI4Science Fellowship UNSPECIFIED
NSF Division of Chemical, Bioengineering, Environmental and Transport Systems CBET 1937902
Amgen Chem-Bio-Engineering Award CBEA AMGEN.ARNOLD22
U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences DE-SC0022218
Record Number: CaltechTHESIS:04292025-041809434
Persistent URL: https://resolver.caltech.edu/CaltechTHESIS:04292025-041809434
DOI: 10.7907/yzb2-cb66
Related URLs:
URL URL Type Description
https://proceedings.mlr.press/v235/li24a.html DOI Article adapted for chapter 2
https://doi.org/10.1101/2024.10.24.619774 DOI Article adapted for chapter 3
https://openreview.net/forum?id=IqPlnXw1BJ arXiv Article adapted for chapter 4
ORCID:
Author ORCID
Li, Francesca-Zhoufan 0000-0002-5710-9512
Default Usage Policy: No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code: 17182
Collection: CaltechTHESIS
Deposited By: Francesca-Zhoufan Li
Deposited On: 04 Jun 2025 22:33
Last Modified: 11 Jun 2025 17:09

Thesis Files

[img] PDF - Final Version
See Usage Policy.

29MB

Repository Staff Only: item control page