Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems

Saurabh Gupta, Devesh Tiwari, Christopher Jantzi, James Rogers, Don Maxwell

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

59 Scopus citations

Abstract

As we approach exascale, the scientific simulations are expected to experience more interruptions due to increased system failures. Designing better HPC resilience techniques requires understanding the key characteristics of system failures on these systems. While temporal properties of system failures on HPC systems have been well-investigated, there is limited understanding about the spatial characteristics of system failures and its impact on the resilience mechanisms. Therefore, we examine the spatial characteristics and behavior of system failures. We investigate the interaction between spatial and temporal characteristics of failures and its implications for system operations and resilience mechanisms on large-scale HPC systems. We show that system failures have 'spatial locality' at different granularity in the system, study impact of different failure-types, and investigate the correlation among different failure-types. Finally, we propose a novel scheme that exploits the spatial locality in failures to improve application and system performance. Our evaluation shows that the proposed scheme significantly improves the system performance in a dynamic and production-level HPC system.

Original languageEnglish
Title of host publicationProceedings - 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015
PublisherIEEE Computer Society
Pages37-44
Number of pages8
ISBN (Electronic)9781479986293
DOIs
StatePublished - Sep 14 2015
Externally publishedYes
Event45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015 - Rio de Janeiro, Brazil
Duration: Jun 22 2015Jun 25 2015

Publication series

NameProceedings of the International Conference on Dependable Systems and Networks
Volume2015-September

Conference

Conference45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015
Country/TerritoryBrazil
CityRio de Janeiro
Period06/22/1506/25/15

Keywords

  • Fault tolerance
  • High Performance Computing
  • Resilience
  • Spatial Locality
  • System Failures

Fingerprint

Dive into the research topics of 'Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems'. Together they form a unique fingerprint.

Cite this