Machine Learning and Modeling Methods for Protein Engineering

Author: Aceves, Aiden Joseph

Year: 2022

Degree: Dissertation (Ph.D.)

Advisor: Mayo, Stephen L.

Committee Members: Bjorkman, Pamela J.; Miller, Thomas F.; Yue, Yisong; Van Valen, David A.; Mayo, Stephen L.

Option: Bioengineering

DOI: 10.7907/5j34-0d55

Abstract

Computation has been an integral part of structural biology, ever since the first protein macromolecular structure was solved via Fourier Synthesis on the EDSAC Mark I electronic computer in 1958 (Kendrew et al., 1958). Throughout my time at Caltech, I have endeavored to develop new methods to apply machine learning and molecular modeling to the study of biological macromolecules. These efforts have taken two distinct tracks, but are unified by a focus on studying proteins on a structural level.

Through the application of molecular dynamics and modeling, I have studied insulin from several angles, including the incorporation of non-canonical amino acids, and how these modifications might be responsible for the modification of critical properties such as hexamer dissociation and fibrillation formation. Additionally, I have probed how insulin behaves at the interface of water and silica, a property which is critical for the effective dissemination and administration of this therapeutic molecule. I have helped to develop a novel computationally guided workflow for integrating drug conjugates into antibody CDRs. This technique yields molecules which exhibit synergistic binding and an enhanced ability for selective binding.

The second major thrust of my research has focused on applying machine learning to protein engineering problems, particularly developing tools for working with structural data, and for making efficient re-use of data which has already been laboriously collected by other groups. The basic data parsing and processing tools which were created and refined over the course of my time at Caltech has enabled many other projects, both of my own and of collaborators. Studies into the use of generative networks for protein-protein docking have been conducted which lend useful insights for network architecture, the inclusion of intermediate learning objectives, and overcoming sparsity. The technique introduced in our ICLR 2021 paper demonstrates a regularization method which enables data from past protein engineering campaigns to be leveraged to learn policies which optimally select molecules to synthesize in unrelated engineering efforts, to potentially save a significant amount of time and money for future projects.

Reference

Kendrew, J. C.; Bodo, G.; Dintzis, H. M.; Parrish, R. G.; Wyckoff, H.; Phillips, D. C. A. "Three-Dimensional Model of the Myoglobin Molecule Obtained by X-Ray Analysis". Nature 1958, 181 (4610), 662–666.

Files