Beyond Text: The Rudiments of Next Generation Foundation Models

Author: Talukder, Sabera

Year: 2026

Degree: Dissertation (Ph.D.)

Advisors: Yue, Yisong; Gkioxari, Georgia

Committee Members: Perona, Pietro; Yue, Yisong; Gkioxari, Georgia; Wierman, Adam C.

Option: Neurobiology

DOI: 10.7907/s12m-5692

Abstract

This thesis builds non-text and multimodal foundation models that overcome the difficulties of non-text data. These challenges are namely image’s and time series’ (e.g. audio’s and video’s): data heterogeneity, data continuity, and large memory requirements. In order to overcome these attributes we must build models that are information dense, generalizable, and multimodal. By the end of this thesis we will empirically demonstrate that the recipe for performant non-text and multimodal foundation models is: create discrete information dense representations, train models with large scale data in the most generalizable manner possible, and fuse data modalities early in the modeling stack.