Batik: a Vision Language Model for End-to-End Social Behavior Discovery, Interpretation and Annotation

Author: Kolhe, Rohan Rajendra

Year: 2025

Degree: Senior thesis (Major)

Advisor: Anderson, David J.

Committee Member: None, None

Option: Computation and Neural Systems

Abstract

Quantitative analysis of animal behavior is a burgeoning field. By converting behavior into measurable features, the field replaces anecdotal observations with precise, data-driven insights into how animals interact with their environment and with one another. Through certain analyses, reproducible structure and diversity within behaviors are revealed, illuminating complex behavioral patterns. Most current state-of-the-art methods focus on annotation and segmentation of behavior using pose-estimation; these methods attach nodes to body parts of mice which then compute a features space. This feature space is then used for discovery of behavior classes or training supervised behavior classifiers. However, this excludes the time-consuming task of interpreting resulting behavioral syllables and has multiple failure modes, such as an inability to attend to frames where there are other objects of interest or frames where the nodes are all on top of one another. The majority of these methods use a convolutional neural network structure. In recent years, a new set of feed-forward neural networks called transformers have been proven to surpass CNNs on most vision-related tasks. Batik addresses the long time-commitments of interpreting these syllables by using multimodal transformers to extract unsupervised features directly from raw video, and perform end-to-end analysis, bypassing pose estimation. Alongside state-of-the-art supervised annotation, Batik leverages fine-tuned large language models to automate discovery and provide expert human-level interpretation of behavior syllables, offering researchers a transformative UI-based tool for behavioral analysis through vision-language models. Through these methods, we show a 96% accuracy for syllables like attack and mount, a large jump from previous methods (85%). We also accurately identify differences in behavior in different metabolic states, as well as an interpretation with sub behaviors for the broad investigative behavior. We further apply our method to other species datasets, correctly classifying distinct fly aggressive behaviors with no additional fine-tuning of the underlying model, showing our model’s generalizability.