Abstract
Earth system predictability is challenged by the complexity of environmental dynamics and the multitude of variables involved. Current AI foundation models, although advanced by leveraging large and heterogeneous data, are often constrained by their size and data integration, limiting their effectiveness in addressing the full range of Earth system prediction challenges. To overcome these limitations, we introduce the Oak Ridge Base Foundation Model for Earth System Predictability (ORBIT), an advanced vision transformer model that scales up to 113 billion parameters using a novel hybrid tensor-data orthogonal parallelism technique. As the largest model of its kind, ORBIT surpasses the current climate AI foundation model size by a thousandfold. Performance scaling tests conducted on the Frontier supercomputer have demonstrated that ORBIT achieves 684 petaFLOPS to 1.6 exaFLOPS sustained throughput, with scaling efficiency maintained at 41% to 85% across 49,152 AMD GPUs. These breakthroughs establish new advances in AIdriven climate modeling and demonstrate promise to significantly improve the Earth system predictability.
Original language | English |
---|---|
Title of host publication | Proceedings of SC 2024 |
Subtitle of host publication | International Conference for High Performance Computing, Networking, Storage and Analysis |
Publisher | IEEE Computer Society |
ISBN (Electronic) | 9798350352917 |
DOIs | |
State | Published - 2024 |
Event | 2024 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2024 - Atlanta, United States Duration: Nov 17 2024 → Nov 22 2024 |
Publication series
Name | International Conference for High Performance Computing, Networking, Storage and Analysis, SC |
---|---|
ISSN (Print) | 2167-4329 |
ISSN (Electronic) | 2167-4337 |
Conference
Conference | 2024 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2024 |
---|---|
Country/Territory | United States |
City | Atlanta |
Period | 11/17/24 → 11/22/24 |
Funding
The authors thank Ver onica G. Melesse Vergara, Mallikarjun (Arjun) Shankar and Bronson Messer for their support of high performance computing resources. Additionally, we thank Vishwas Rao for his valuable feedback to the development of this paper. This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The U.S. government retains and the publisher acknowledges that the US government retains a nonexclusive worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so for US government purposes. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory (ORNL), which is supported by the Office of Science of the U.S. Department of Energy (DOE). This research was primary supported by the ORNL's AI Initiative sponsored by the Director's Research and Development Program at ORNL, additionally supported by the BER-ASCR SciDAC Program in the DOE, and by DOE Early Career Project sponsored by the BER program.