Tokens, Topologies, Taxa: Towards Declarative Biology and Bioengineering

Author: Martinez, Zachary A.

Year: 2026

Degree: Dissertation (Ph.D.)

Advisors: Thomson, Matthew W.; Murray, Richard M.

Committee Members: Mazmanian, Sarkis K.; Wang, Kaihang; Bois, Justin; Thomson, Matthew W.; Murray, Richard M.

Option: Bioengineering

DOI: 10.7907/b5ye-jy33

Abstract

Contemporary deep-learning models for the life-sciences have outpaced the tooling that lets experimentalists compose them. Three contributions are presented in response, a software platform, exemplary tasks built on it, and a predicted structural proteome of a defined gut microbiome. The underlying argument is that for experimentalists who use rather than build deep-learning methods, difficulties with composition and usability now outpace availability.

TRILL, a platform for AI-based protein engineering and analysis, is open-source, runs locally, and wraps models/methods behind a uniform vocabulary of thirteen top-level commands. Furthermore, TRILL is scalable, ranging from parallel fine-tuning of large models on a supercomputer to democratized, parameter off-loading in compute-limited scenarios. Models can be swapped with a one-argument change rather than a pipeline rewrite, and fast predictions can be paired with physics-based validation where overconfidence costs most.

Protein language models were fine-tuned using a homology-aware strategy, decreasing data leakage when evaluating generated proteins. Classifiers for cellulase, antimicrobial, and toxin activity were trained and applied to a scan of over two hundred million proteins from the NCBI non-redundant catalogue. An end-to-end pipeline carried seventeen predicted toxins of unknown function through structure prediction, binder design, and molecular dynamics on nearly nine hundred designed complexes.

The third contribution targets hCom2, a defined synthetic gut consortium. We present a structural resource, where roughly four hundred thousand structures of its proteome were predicted using TRILL, segmented into eight hundred thousand domains, and assigned CATH designations. A case study demonstrating the utility of this structural database identifies nineteen carriers of the Helicobacter pylori virulence-factor TIPalpha fold across fourteen strains where sequence-only annotation fails.

Files