Data-Driven Protein Engineering

Author: Wu, Zachary

Year: 2021

Degree: Dissertation (Ph.D.)

Advisor: Arnold, Frances Hamilton

Committee Members: Tirrell, David A.; Wang, Zhen-Gang; Yue, Yisong; Arnold, Frances Hamilton

Option: Chemical Engineering

DOI: 10.7907/nx3c-qb44

Abstract

Directed evolution has enabled the adaptation of natural protein sequences for an endless variety of human applications. Given a starting point - a sequence with measurable activity - directed evolution is able to improve protein sequences by iteratively accumulating beneficial mutations. However, directed evolution requires investing large experimental effort, which continues to be the major bottleneck in efficient protein optimization. To this end, we describe a framework for incorporating machine learning in the directed evolution process to maximize the utility of generated experimental data in Chapter 2. In Chapter 3, we then show that this framework outperforms traditional directed evolution methods on an empirical fitness landscape. However, directed evolution is fundamentally limited by its need for a starting point, or a sequence with measurable activity. To tackle this issue, we test the ability of nascent deep learning techniques for generating short, functional amino acid sequences in Chapter 4. Encouraged by this success, we attempted to generate full length enzymatic sequences for desired substrates without success. However, we were able to apply this deep learning approach to model other aspects of enzymatic protein sequences in Chapter 5. Finally, the field of data-driven protein sequence generation is enjoying a recent surge in interest, and we provide an updated review of protein engineering with machine learning, focusing on recent work in deep generative modeling in Chapter 1.

Files