Understanding and Improving Efficiency in Training of Deep Neural Networks

Author: Zhao, Jiawei

Year: 2025

Degree: Dissertation (Ph.D.)

Advisor: Anandkumar, Anima

Committee Members: Wierman, Adam C.; Anandkumar, Anima; Mazumdar, Eric V.; Chen, Beidi; Tian, Yuandong

Option: Computing and Mathematical Sciences

Abstract

As deep neural networks (DNNs) continue to drive progress in fields like computer vision and natural language processing, their increasing complexity presents significant challenges for training efficiency, particularly in large language models (LLMs). These challenges include memory limitations, energy consumption, and bandwidth constraints during training.

In this thesis, I address these challenges by analyzing the training dynamics of DNNs and proposing hardware-efficient learning algorithms to enhance training efficiency. First, I focus on mitigating memory limitations in LLM training. Training large models like LLMs requires substantial memory for parameters, gradients, and optimizer states, often exceeding standard hardware capacity. To tackle this, I propose GaLore, a memory-efficient training algorithm that reduces the memory footprint of LLM training by up to 65.5% while preserving performance. Additionally, I introduce InRank, an incremental low-rank learning algorithm that further reduces memory usage by gradually increasing matrix rank.

Next, I address the issue of high energy consumption during training. Training large models like LLMs demands considerable energy, contributing to environmental impact. To mitigate this, I propose LNS-Madam, a low-precision training algorithm leveraging the logarithmic number system (LNS) to lower energy consumption without compromising accuracy. LNS-Madam achieves up to 90% energy savings compared to a full-precision baseline model.

Finally, I focus on bandwidth limitations in distributed training. Training LLMs often requires distributing computations across multiple devices to accelerate training. However, network bandwidth constraints can cause communication bottlenecks that slow down training. To resolve this, I introduce signSGD with Majority Vote, a communication-efficient training algorithm that reduces the overhead associated with distributed training.

Files

thesis_jiaweizhao_final_v1.pdf (application/pdf)