Abstract
The increasing data demands from high-performance computing applications significantly accelerate the capacity, capability and reliability requirements of storage systems. As systems scale, component failures and repair times increase, significantly impacting data availability. A wide array of decision points must be balanced in designing such systems. We propose a systematic approach that balances and optimizes both initial and continuous spare provisioning based on a detailed investigation of the anatomy and field failure data analysis of extreme-scale storage systems. We consider the component failure characteristics and its cost and impact at the system level simultaneously. We build a tool to evaluate different provisioning schemes, and the results demonstrate that our optimized provisioning can reduce the duration of data unavailability by as much as 52% under a fixed budget. We also observe that non-disk components have much higher failure rates than disks, and warrant careful considerations in the overall provisioning process.
Original language | English |
---|---|
Title of host publication | Proceedings of SC 2015 |
Subtitle of host publication | The International Conference for High Performance Computing, Networking, Storage and Analysis |
Publisher | IEEE Computer Society |
ISBN (Electronic) | 9781450337236 |
DOIs | |
State | Published - Nov 15 2015 |
Event | International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015 - Austin, United States Duration: Nov 15 2015 → Nov 20 2015 |
Publication series
Name | International Conference for High Performance Computing, Networking, Storage and Analysis, SC |
---|---|
Volume | 15-20-November-2015 |
ISSN (Print) | 2167-4329 |
ISSN (Electronic) | 2167-4337 |
Conference
Conference | International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015 |
---|---|
Country/Territory | United States |
City | Austin |
Period | 11/15/15 → 11/20/15 |
Funding
We would like to thank the reviewers for their comments. This work was supported in part by the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at ORNL, which is managed by UT Battelle, LLC for the U.S. DOE (under the contract No. DE-AC05-00OR22725). The work was also supported by a JDRD grant by the Science Alliance of the University of Tennessee and the National Science Foundation grant 0953238.