Optimizing Distributed Training on Frontier for Large Language Models

Sajal Dash, Isaac R. Lyngaas, Junqi Yin, Xiao Wang, Romain Egele, J. Austin Ellis, Matthias Maiterth, Guojing Cong, Feiyi Wang, Prasanna Balaprakash

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Large language models (LLMs) have demonstrated remarkable success as foundational models, benefiting various downstream applications through fine-tuning. Loss scaling studies have demonstrated the superior performance of larger LLMs compared to their smaller counterparts. Nevertheless, training LLMs with billions of parameters poses significant challenges and requires considerable computational resources. For example, training a one trillion parameter GPT-style model on 20 trillion tokens requires a staggering 120 million exaflops. This research explores efficient distributed training strategies to extract this computation from Frontier, the world's first exascale supercomputer. We enable and investigate various model and data parallel training techniques, such as tensor parallelism, pipeline parallelism, and sharded data parallelism, to facilitate training a trillion-parameter model on Frontier. We empirically assess these techniques and their associated parameters to determine their impact on memory footprint, communication latency, and GPU's computational efficiency. We analyze the complex interplay among these techniques and find a strategy to combine them to achieve high throughput through hyperparameter tuning. We have identified efficient strategies for training large LLMs of varying sizes through empirical analysis and hyperparameter tuning. For 22 Billion, 175 Billion, and 1 Trillion parameters, we achieved GPU throughputs of 38.38%, 36.14%, and 31.96%, respectively. For the training of the 175 Billion parameter model and the 1 Trillion parameter model, we achieved 100% weak scaling efficiency on 1024 and 3072 MI250X GPUs, respectively. We also achieved strong scaling efficiencies of 89% and 87% for these two models. We trained these models only tens of iterations instead of training till completion.

Original languageEnglish
Title of host publicationResearch Paper Proceedings of the ISC High Performance 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9783982633602
StatePublished - 2024
Event39th International Conference on High Performance Computing, ISC High Performance 2024 - Hamburg, Germany
Duration: May 12 2024May 16 2024

Publication series

NameResearch Paper Proceedings of the ISC High Performance 2024

Conference

Conference39th International Conference on High Performance Computing, ISC High Performance 2024
Country/TerritoryGermany
CityHamburg
Period05/12/2405/16/24

Fingerprint

Dive into the research topics of 'Optimizing Distributed Training on Frontier for Large Language Models'. Together they form a unique fingerprint.

Cite this