Abstract
Large language models (LLMs) have arisen rapidly to the center stage of artificial intelligence as the foundation models applicable to many downstream learning tasks. However, how to effectively build, train, and serve such models for many high-stake and first-principle-based scientific use cases are both of great interests and of great challenges. Moreover, pre-training LLMs with billions or even trillions of parameters can be prohibitively expensive not just for academic institutions, but also for well-funded industrial and government labs. Furthermore, the energy cost and the environmental impact of developing LLMs must be kept in mind. In this work, we conduct a first-of-its-kind performance analysis to understand the time and energy cost of pre-training LLMs on the Department of Energy (DOE)’s leadership-class supercomputers. Employing state-of-the-art distributed training techniques, we evaluate the computational performance of various parallelization approaches at scale for a range of model sizes, and establish a projection model for the cost of full training. Our findings provide baseline results, best practices, and heuristics for pre-training such large models that should be valuable to HPC community at large. We also offer insights and optimization strategies for using the first exascale computing system, Frontier, to train models of the size of GPT-3 and beyond.
Original language | English |
---|---|
Pages (from-to) | 20747-20768 |
Number of pages | 22 |
Journal | Journal of Supercomputing |
Volume | 79 |
Issue number | 18 |
DOIs | |
State | Published - Dec 2023 |
Funding
J.Y. would like to thank Quentin Anthony and Stella Biderman from EleutherAI, and Less Wright and Geeta Chauhan from PyTorch team for the valuable discussions. This research was sponsored by and used resources of the Oak Ridge Leadership Computing Facility (OLCF), which is a DOE Office of Science User Facility at the Oak Ridge National Laboratory supported by the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
Funders | Funder number |
---|---|
U.S. Department of Energy | DE-AC05-00OR22725 |
Office of Science |
Keywords
- AI foundation model
- Distributed training
- Performance analysis