Strategies and Tools for Machine Learning-Assisted Protein Engineering

Author: Wittmann, Bruce James

Year: 2022

Degree: Dissertation (Ph.D.)

Advisor: Arnold, Frances Hamilton

Committee Members: Pachter, Lior S.; Reisman, Sarah E.; Mayo, Stephen L.; Arnold, Frances Hamilton

Option: Bioengineering

Abstract

Proteins perform critical roles in a growing list of human-devised applications, and as demands for new applications arise, new proteins must be engineered to meet them. Machine learning-assisted protein engineering (MLPE) has recently arisen as a new philosophy of protein engineering, promising to overcome many of the limitations of existing engineering strategies. Despite its promise, however, as a relatively new approach to protein engineering, MLPE faces many challenges that hinder its routine application. This thesis is focused on addressing a number of them. Chapter 1 provides a theoretical overview of protein engineering, introduces the core steps of a typical MLPE pipeline, and discusses the challenges that currently hinder MLPE’s advancement. This chapter is written to be accessible to all members of the highly multidisciplinary audience that either use or develop MLPE tools, in turn providing a resource that eliminates the steep barrier to entry that can hinder broader participation in the field. Chapter 2 provides a solution to the challenge of applying MLPE to proteins whose fitness landscapes are dominated by “holes” (protein variants with zero or extremely low fitness). Using my development of the strategy “focused training machine learning-assisted directed evolution (ftMLDE)” as an example, I demonstrate how auxiliary information from protein sequence and structure can be used to navigate landscapes despite holes, in turn dramatically improving the efficiency of MLPE. Chapter 3 explores strategies for reducing the amount of sequence-fitness data needed for building MLPE models. Specifically, I detail the motivation behind and development of a new model designed to augment limited protein sequence-fitness datasets with information extracted from raw protein sequence and structure data. Finally, chapter 4 introduces “every variant sequencing” (evSeq), a collection of tools and protocols that enables extremely low-cost, routine collection of large protein sequence-fitness datasets. Not only does this technology drastically improve the financial feasibility of numerous MLPE applications, but it also potentiates the construction of a massive database of diverse protein sequence-fitness data, the likes of which would revolutionize our ability to engineer proteins with data-driven methods. Overall, the work described in this thesis advances both our understanding of MLPE and our ability to engineer proteins using it.

Files

Data S1.csv (application/csv)
Data S2.csv (application/csv)
Data S3.csv (application/csv)
WittmannBruce_Thesis.pdf (application/pdf)