Rewriting the Sequence and Structure Rules of Deep Protein Space

Author: Subramanian, Arjuna Michael

Year: 2026

Degree: Dissertation (Ph.D.)

Advisor: Thomson, Matthew

Committee Members: Mayo, Stephen L.; Murray, Richard M.; Thomson, Matthew; Winfree, Erik

Option: Biochemistry and Molecular Biophysics

DOI: 10.7907/p4st-m614

Abstract

With a 20-letter alphabet, conceivable protein sequence-space is enormous; sparks of structure and function are vanishingly rare. Despite massive advances in AI-guided protein design, we remain largely ignorant of the sequences and structures that populate the depths of protein space more than a handful of mutations away from what nature has tried. In this work, we leverage the potential of one specific class of AI protein model — the protein language model, or PLM — to internalize the essential features of the protein sequence-structure map while retaining the capacity to explore its extremes. Guided by a "novelty first, fitness next" mentality, we harness this balance towards systematic discovery of new-to-nature sequences and structures throughout deep protein space.

In the first section, we dissect the ability of PLMs to explore natural and novel regimes of sequence and structure during free generation. We find that while these models readily emit novel sequences encoding artificial proteins that appear biophysically feasible in silico, they fail to completely or representatively capture the known distribution of natural protein structures. We expose a fundamental tradeoff between the ability of a PLM to generate with sequence novelty or structural coverage but not both simultaneously; prioritizing sampling of far-from-natural sequences triggers a collapse to a handful of simple structural motifs and disordered regions.

Turning this sequence novelty vs. structural breadth tradeoff to our advantage, the second section is devoted to the development of "foldtuning" — a structure-preserving, sequence-remodeling engine for navigating the far corners of sequence-space with PLM-based probes. We successfully scale and deploy foldtuning for >700 targets, pushing artificial sequences past the point of detectable homology to any real protein documented in nature, discovering novel sequence-level semantics and grammar for mimicking known protein folds, and accessing potential reservoirs of downstream structural and functional innovation. Experimental validation of select targets reveals that foldtuning produces realizable and functional binders in contexts including a toxin/antitoxin system and peptide hormone signaling.

Shifting to focus on structural novelty, the final section introduces two PLM-driven methods for the discovery of new-to-nature structures. We show that with appropriate steering functions, PLMs readily yield well-structured domains (featuring diverse secondary and supersecondary elements) outside the several thousand such families cataloged from among known proteins. Overall, this work makes substantial inroads towards the challenge of locating viable far-from-natural regions of protein density across the global sequence-structure map, and revises our notions of the physical constraints on sequence and structure in valid proteins. Moreover, it sets the stage for future assembly of synthetic biological systems composed fully of new-to-nature parts and ultimately for modeling efforts that close the design loop from sequence all the way to complex phenotype.

Files