TY - GEN
T1 - Toward a performance/resilience tool for hardware/software co-Design of high-Performance computing systems
AU - Engelmann, Christian
AU - Naughton, Thomas
PY - 2013
Y1 - 2013
N2 - xSim is a simulation-based performance investigation toolkit that permits running high-performance computing (HPC) applications in a controlled environment with millions of concurrent execution threads, while observing application performance in a simulated extreme-scale system for hardware/ software co-design. The presented work details newly developed features for xSim that permit the injection of MPI process failures, the propagation/detection/ notification of such failures within the simulation, and their handling using application-level checkpoint/restart. These new capabilities enable the observation of application behavior and performance under failure within a simulated future-generation HPC system using the most common fault handling technique.
AB - xSim is a simulation-based performance investigation toolkit that permits running high-performance computing (HPC) applications in a controlled environment with millions of concurrent execution threads, while observing application performance in a simulated extreme-scale system for hardware/ software co-design. The presented work details newly developed features for xSim that permit the injection of MPI process failures, the propagation/detection/ notification of such failures within the simulation, and their handling using application-level checkpoint/restart. These new capabilities enable the observation of application behavior and performance under failure within a simulated future-generation HPC system using the most common fault handling technique.
KW - Fault injection
KW - High-performance computing
KW - Message passing interface
KW - Parallel discrete event simulation
UR - http://www.scopus.com/inward/record.url?scp=84893208518&partnerID=8YFLogxK
U2 - 10.1109/ICPP.2013.114
DO - 10.1109/ICPP.2013.114
M3 - Conference contribution
AN - SCOPUS:84893208518
SN - 9780769551173
T3 - Proceedings of the International Conference on Parallel Processing
SP - 960
EP - 969
BT - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 42nd Annual International Conference on Parallel Processing, ICPP 2013
Y2 - 1 October 2013 through 4 October 2013
ER -