Abstract
Large-scale HPC systems demand extensive disk-based storage for data generated by HPC applications, necessitating scalable reliability, availability, and failure management. Extracted failure data from HPC storage offers valuable insights for preventing and managing failures, spanning understanding storage robustness, guiding system design and deployment, and creating durable data protection schemes. This paper introduces a failure dataset from OLCF's Summit supercomputer's file system, Alpine, encompassing 4000+ events over 2.75 years from 32000+ disks. Before analysis, we delve into Alpine's components and introduce IBM Spectrum Scale technology, then assess collected data for failure distribution and burst correlations. We infer that, proximity to enclosure fan modules heightens disk failure rates. Also, burst failure analysis highlights 1/3rd of failures occurring in bursts, with 90% non-spatially correlated, impacting multiple racks.
Original language | English |
---|---|
Title of host publication | Proceedings of 2023 SC Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC Workshops 2023 |
Publisher | Association for Computing Machinery |
Pages | 502-506 |
Number of pages | 5 |
ISBN (Electronic) | 9798400707858 |
DOIs | |
State | Published - Nov 12 2023 |
Event | 2023 International Conference on High Performance Computing, Network, Storage, and Analysis, SC Workshops 2023 - Denver, United States Duration: Nov 12 2023 → Nov 17 2023 |
Publication series
Name | ACM International Conference Proceeding Series |
---|
Conference
Conference | 2023 International Conference on High Performance Computing, Network, Storage, and Analysis, SC Workshops 2023 |
---|---|
Country/Territory | United States |
City | Denver |
Period | 11/12/23 → 11/17/23 |
Funding
ACKNOWLEDGMENTS This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.