Abstract
Large-scale DL on HPC systems like Frontier and Summit uses distributed node-local caching to address scalability and performance challenges. However, as these systems grow more complex, the risk of node failures increases, and current caching approaches lack fault tolerance, jeopardizing large-scale training jobs. We analyzed six months of SLURM job logs from Frontier and found that over 30% of jobs failed after an average of 75 minutes. To address this, we propose fault-tolerance strategies that recache data lost from failed nodes using a hash ring technique for balanced data recaching in the distributed node-local caching, reducing reliance on the PFS. Our extensive evaluations on Frontier showed that the hash ring-based recaching approach reduced training time by approximately 25% compared to the approach that redirects I/O to the PFS after node failures and demonstrated effective load balancing of training data across nodes.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of SC 2024-W |
| Subtitle of host publication | Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 1349-1357 |
| Number of pages | 9 |
| ISBN (Electronic) | 9798350355543 |
| DOIs | |
| State | Published - 2024 |
| Event | 2024 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC Workshops 2024 - Atlanta, United States Duration: Nov 17 2024 → Nov 22 2024 |
Publication series
| Name | Proceedings of SC 2024-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis |
|---|
Conference
| Conference | 2024 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC Workshops 2024 |
|---|---|
| Country/Territory | United States |
| City | Atlanta |
| Period | 11/17/24 → 11/22/24 |
Funding
This work was supported in part by the Korea Institute of Science and Technology Information (Grant No. K24L2M1C1) and by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. RS-2024-00416666). This research used resources of the Oak Ridge Leadership Computing Facility, located at the National Center for Computational Sciences at the Oak Ridge National Laboratory, which is supported by the Office of Science of the DOE under Contract DE-AC05-00OR22725. The authors would also like to thank Professor Xubin He from Temple University for his valuable insights and contributions to discussions related to this paper. Youngjae Kim is the corresponding author.
Keywords
- Distributed Deep Learning
- Fault Tolerance
- HPC
- NVMe Cache