Supporting the development of soft-error resilient message passing applications using simulation

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Radiation-induced bit flip faults are of particular concern in extreme-scale high-performance computing systems. This paper presents a simulation-based tool that enables the development of soft-error resilient message passing applications by permitting the investigation of their correctness and performance under various fault conditions. The documented extensions to the Extreme-scale Simulator (xSim) enable the injection of bit flip faults at specific of injection location(s) and fault activation time(s), while supporting a significant degree of configurability of the fault type. Experiments show that the simulation overhead with the new feature is ∼2,325% for serial execution and ∼1,730% at 128 MPI processes, both with very fine-grain fault injection. Fault injection experiments demonstrate the usefulness of the new feature by injecting bit flips in the input and output matrices of a matrix-matrix multiply application, revealing vulnerability of data structures, masking and error propagation. xSim is the very first simulation-based MPI performance tool that supports both, the injection of process failures and bit flip faults.

Original languageEnglish
Title of host publicationProceedings of the 13th IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2016
PublisherActa Press
Pages250-257
Number of pages8
ISBN (Electronic)9780889869790
DOIs
StatePublished - 2016
Event13th IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2016 - Innsbruck, Austria
Duration: Feb 15 2016Feb 16 2016

Publication series

NameProceedings of the 13th IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2016

Conference

Conference13th IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2016
Country/TerritoryAustria
CityInnsbruck
Period02/15/1602/16/16

Keywords

  • Fault injection
  • Fault tolerance
  • High-performance computing
  • Parallel discrete event simulation

Fingerprint

Dive into the research topics of 'Supporting the development of soft-error resilient message passing applications using simulation'. Together they form a unique fingerprint.

Cite this