CaltechTHESIS
A Caltech Library Service

Rewriting the Sequence and Structure Rules of Deep Protein Space

Citation

Subramanian, Arjuna Michael (2026) Rewriting the Sequence and Structure Rules of Deep Protein Space. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/p4st-m614. https://resolver.caltech.edu/CaltechTHESIS:09162025-184128136

Abstract

With a 20-letter alphabet, conceivable protein sequence-space is enormous; sparks of structure and function are vanishingly rare. Despite massive advances in AI-guided protein design, we remain largely ignorant of the sequences and structures that populate the depths of protein space more than a handful of mutations away from what nature has tried. In this work, we leverage the potential of one specific class of AI protein model — the protein language model, or PLM — to internalize the essential features of the protein sequence-structure map while retaining the capacity to explore its extremes. Guided by a "novelty first, fitness next" mentality, we harness this balance towards systematic discovery of new-to-nature sequences and structures throughout deep protein space.

In the first section, we dissect the ability of PLMs to explore natural and novel regimes of sequence and structure during free generation. We find that while these models readily emit novel sequences encoding artificial proteins that appear biophysically feasible in silico, they fail to completely or representatively capture the known distribution of natural protein structures. We expose a fundamental tradeoff between the ability of a PLM to generate with sequence novelty or structural coverage but not both simultaneously; prioritizing sampling of far-from-natural sequences triggers a collapse to a handful of simple structural motifs and disordered regions.

Turning this sequence novelty vs. structural breadth tradeoff to our advantage, the second section is devoted to the development of "foldtuning" — a structure-preserving, sequence-remodeling engine for navigating the far corners of sequence-space with PLM-based probes. We successfully scale and deploy foldtuning for >700 targets, pushing artificial sequences past the point of detectable homology to any real protein documented in nature, discovering novel sequence-level semantics and grammar for mimicking known protein folds, and accessing potential reservoirs of downstream structural and functional innovation. Experimental validation of select targets reveals that foldtuning produces realizable and functional binders in contexts including a toxin/antitoxin system and peptide hormone signaling.

Shifting to focus on structural novelty, the final section introduces two PLM-driven methods for the discovery of new-to-nature structures. We show that with appropriate steering functions, PLMs readily yield well-structured domains (featuring diverse secondary and supersecondary elements) outside the several thousand such families cataloged from among known proteins. Overall, this work makes substantial inroads towards the challenge of locating viable far-from-natural regions of protein density across the global sequence-structure map, and revises our notions of the physical constraints on sequence and structure in valid proteins. Moreover, it sets the stage for future assembly of synthetic biological systems composed fully of new-to-nature parts and ultimately for modeling efforts that close the design loop from sequence all the way to complex phenotype.

Item Type: Thesis (Dissertation (Ph.D.))
Subject Keywords: synthetic biology; protein design; artificial intelligence; protein language models; protein structure; structural bioinformatics
Degree Grantor: California Institute of Technology
Division: Biology and Biological Engineering
Major Option: Biochemistry and Molecular Biophysics
Awards: Everhart Distinguished Graduate Student Lecturer Award, 2025
Thesis Availability: Public (worldwide access)
Research Advisor(s):
  • Thomson, Matthew
Thesis Committee:
  • Mayo, Stephen L. (chair)
  • Murray, Richard M.
  • Thomson, Matthew
  • Winfree, Erik
Defense Date: 15 September 2025
Funders:
Funding Agency Grant Number
NIH R01-GM150125
Gordon and Betty Moore Foundation 12500072
Frontier Model Forum UNSPECIFIED
Record Number: CaltechTHESIS:09162025-184128136
Persistent URL: https://resolver.caltech.edu/CaltechTHESIS:09162025-184128136
DOI: 10.7907/p4st-m614
Related URLs:
URL URL Type Description
https://doi.org/10.1101/2025.10.01.679905 DOI Article adapted for chapter 2
https://doi.org/10.1101/2023.12.22.573145 DOI Article adapted for chapters 3-4
https://doi.org/10.1101/2024.12.20.629847 DOI Article adapted in part for chapter 4
https://doi.org/10.1101/2025.10.02.679910 DOI Article adapted for chapter 5
ORCID:
Author ORCID
Subramanian, Arjuna Michael 0009-0004-2790-0209
Default Usage Policy: No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code: 17682
Collection: CaltechTHESIS
Deposited By: Arjuna Subramanian
Deposited On: 05 Oct 2025 12:37
Last Modified: 14 Oct 2025 19:57

Thesis Files

[img] PDF (Complete Thesis) - Final Version
See Usage Policy.

98MB
[img] PDF (Chapter 1) - Final Version
See Usage Policy.

295kB
[img] PDF (Chapter 2) - Final Version
See Usage Policy.

14MB
[img] PDF (Chapter 3) - Final Version
See Usage Policy.

39MB
[img] PDF (Chapter 4) - Final Version
See Usage Policy.

32MB
[img] PDF (Chapter 5) - Final Version
See Usage Policy.

25MB
[img] PDF (Chapter 6) - Final Version
See Usage Policy.

253kB

Repository Staff Only: item control page