Abstract
Disk failure data provides valuable insights for preventing failures, enhancing storage robustness, guiding system design and deployment, and ensuring reliable operations at data centers. This paper introduces two disk failure datasets collected from large-scale HPC production environments over the past five years, comprising over 5,000 failure records from more than 40,000 disks. We analyzed these datasets across multiple dimensions, including temporal, spatial, and relational trends, and performed a comprehensive reliability assessment. Our analysis yielded numerous observations and insights that influence various operational aspects of HPC storage systems. We believe this study offers a holistic understanding of disk failure trends likely to interest the HPC storage community.
Original language | English |
---|---|
Title of host publication | Proceedings of SC 2024-W |
Subtitle of host publication | Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 484-495 |
Number of pages | 12 |
ISBN (Electronic) | 9798350355543 |
DOIs | |
State | Published - 2024 |
Event | 2024 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC Workshops 2024 - Atlanta, United States Duration: Nov 17 2024 → Nov 22 2024 |
Publication series
Name | Proceedings of SC 2024-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis |
---|
Conference
Conference | 2024 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC Workshops 2024 |
---|---|
Country/Territory | United States |
City | Atlanta |
Period | 11/17/24 → 11/22/24 |
Funding
This research was sponsored by and used resources of the Oak Ridge Leadership Computing Facility (OLCF), which is a DOE Office of Science User Facility at the Oak Ridge National Laboratory supported by the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
Keywords
- Cause effect analysis
- Failure data analysis
- HPC storage
- Hard disk drives
- Reliability
- Summit
- Supercomputer