Beyond Text: The Rudiments of Next Generation Foundation Models
Author: Talukder, Sabera
Year: 2026
Degree: Dissertation (Ph.D.)
Advisors: Yue, Yisong; Gkioxari, Georgia
Committee Members: Perona, Pietro; Yue, Yisong; Gkioxari, Georgia; Wierman, Adam C.
Option: Neurobiology
DOI: 10.7907/s12m-5692
Abstract
This thesis builds non-text and multimodal foundation models that overcome the difficulties of non-text data. These challenges are namely image’s and time series’ (e.g. audio’s and video’s): data heterogeneity, data continuity, and large memory requirements. In order to overcome these attributes we must build models that are information dense, generalizable, and multimodal. By the end of this thesis we will empirically demonstrate that the recipe for performant non-text and multimodal foundation models is: create discrete information dense representations, train models with large scale data in the most generalizable manner possible, and fuse data modalities early in the modeling stack.