A practical approach to reconciling availability, performance, and capacity in provisioning extreme-scale storage systems

Lipeng Wan, Feiyi Wang, Sarp Oral, Devesh Tiwari, Sudharshan S. Vazhkudai, Qing Cao

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

The increasing data demands from high-performance computing applications significantly accelerate the capacity, capability and reliability requirements of storage systems. As systems scale, component failures and repair times increase, significantly impacting data availability. A wide array of decision points must be balanced in designing such systems. We propose a systematic approach that balances and optimizes both initial and continuous spare provisioning based on a detailed investigation of the anatomy and field failure data analysis of extreme-scale storage systems. We consider the component failure characteristics and its cost and impact at the system level simultaneously. We build a tool to evaluate different provisioning schemes, and the results demonstrate that our optimized provisioning can reduce the duration of data unavailability by as much as 52% under a fixed budget. We also observe that non-disk components have much higher failure rates than disks, and warrant careful considerations in the overall provisioning process.

Original languageEnglish
Title of host publicationProceedings of SC 2015
Subtitle of host publicationThe International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherIEEE Computer Society
ISBN (Electronic)9781450337236
DOIs
StatePublished - Nov 15 2015
EventInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015 - Austin, United States
Duration: Nov 15 2015Nov 20 2015

Publication series

NameInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
Volume15-20-November-2015
ISSN (Print)2167-4329
ISSN (Electronic)2167-4337

Conference

ConferenceInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015
Country/TerritoryUnited States
CityAustin
Period11/15/1511/20/15

Funding

We would like to thank the reviewers for their comments. This work was supported in part by the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at ORNL, which is managed by UT Battelle, LLC for the U.S. DOE (under the contract No. DE-AC05-00OR22725). The work was also supported by a JDRD grant by the Science Alliance of the University of Tennessee and the National Science Foundation grant 0953238.

Fingerprint

Dive into the research topics of 'A practical approach to reconciling availability, performance, and capacity in provisioning extreme-scale storage systems'. Together they form a unique fingerprint.

Cite this