TY - GEN
T1 - Supporting the development of soft-error resilient message passing applications using simulation
AU - Engelmann, Christian
AU - Naughton, Thomas
PY - 2016
Y1 - 2016
N2 - Radiation-induced bit flip faults are of particular concern in extreme-scale high-performance computing systems. This paper presents a simulation-based tool that enables the development of soft-error resilient message passing applications by permitting the investigation of their correctness and performance under various fault conditions. The documented extensions to the Extreme-scale Simulator (xSim) enable the injection of bit flip faults at specific of injection location(s) and fault activation time(s), while supporting a significant degree of configurability of the fault type. Experiments show that the simulation overhead with the new feature is ∼2,325% for serial execution and ∼1,730% at 128 MPI processes, both with very fine-grain fault injection. Fault injection experiments demonstrate the usefulness of the new feature by injecting bit flips in the input and output matrices of a matrix-matrix multiply application, revealing vulnerability of data structures, masking and error propagation. xSim is the very first simulation-based MPI performance tool that supports both, the injection of process failures and bit flip faults.
AB - Radiation-induced bit flip faults are of particular concern in extreme-scale high-performance computing systems. This paper presents a simulation-based tool that enables the development of soft-error resilient message passing applications by permitting the investigation of their correctness and performance under various fault conditions. The documented extensions to the Extreme-scale Simulator (xSim) enable the injection of bit flip faults at specific of injection location(s) and fault activation time(s), while supporting a significant degree of configurability of the fault type. Experiments show that the simulation overhead with the new feature is ∼2,325% for serial execution and ∼1,730% at 128 MPI processes, both with very fine-grain fault injection. Fault injection experiments demonstrate the usefulness of the new feature by injecting bit flips in the input and output matrices of a matrix-matrix multiply application, revealing vulnerability of data structures, masking and error propagation. xSim is the very first simulation-based MPI performance tool that supports both, the injection of process failures and bit flip faults.
KW - Fault injection
KW - Fault tolerance
KW - High-performance computing
KW - Parallel discrete event simulation
UR - http://www.scopus.com/inward/record.url?scp=85015857044&partnerID=8YFLogxK
U2 - 10.2316/P.2016.834-005
DO - 10.2316/P.2016.834-005
M3 - Conference contribution
AN - SCOPUS:85015857044
T3 - Proceedings of the 13th IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2016
SP - 250
EP - 257
BT - Proceedings of the 13th IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2016
PB - Acta Press
T2 - 13th IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2016
Y2 - 15 February 2016 through 16 February 2016
ER -