TY - BOOK
T1 - A Report on Simulation-Driven Reliability and Failure Analysis of Large-Scale Storage Systems
AU - Wan, Lipeng
AU - Wang, Feiyi
AU - Oral, H. Sarp
AU - Vazhkudai, Sudharshan S.
AU - Cao, Qing
PY - 2014
Y1 - 2014
N2 - High-performance computing (HPC) storage systems provide data availability and reliability using various hardware and software fault tolerance techniques. Usually, reliability and availability are calculated at the subsystem or component level using limited metrics such as, mean time to failure (MTTF) or mean time to data loss (MTTDL). This often means settling on simple and disconnected failure models (such as exponential failure rate) to achieve tractable and close-formed solutions. However, such models have been shown to be insufficient in assessing end-to-end storage system reliability and availability. We propose a generic simulation framework aimed at analyzing the reliability and availability of storage systems at scale, and investigating what-if scenarios. The framework is designed for an end-to-end storage system, accommodating the various components and subsystems, their interconnections, failure patterns and propagation, and performs dependency analysis to capture a wide-range of failure cases. We evaluate the framework against a large-scale storage system that is in production and analyze its failure projections toward and beyond the end of lifecycle. We also examine the potential operational impact by studying how different types of components affect the overall system reliability and availability, and present the preliminary results
AB - High-performance computing (HPC) storage systems provide data availability and reliability using various hardware and software fault tolerance techniques. Usually, reliability and availability are calculated at the subsystem or component level using limited metrics such as, mean time to failure (MTTF) or mean time to data loss (MTTDL). This often means settling on simple and disconnected failure models (such as exponential failure rate) to achieve tractable and close-formed solutions. However, such models have been shown to be insufficient in assessing end-to-end storage system reliability and availability. We propose a generic simulation framework aimed at analyzing the reliability and availability of storage systems at scale, and investigating what-if scenarios. The framework is designed for an end-to-end storage system, accommodating the various components and subsystems, their interconnections, failure patterns and propagation, and performs dependency analysis to capture a wide-range of failure cases. We evaluate the framework against a large-scale storage system that is in production and analyze its failure projections toward and beyond the end of lifecycle. We also examine the potential operational impact by studying how different types of components affect the overall system reliability and availability, and present the preliminary results
U2 - 10.2172/1185665
DO - 10.2172/1185665
M3 - Commissioned report
BT - A Report on Simulation-Driven Reliability and Failure Analysis of Large-Scale Storage Systems
CY - United States
ER -