TY - GEN
T1 - Shrink or Substitute
T2 - 26th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2018
AU - Ashraf, Rizwan A.
AU - Hukerikar, Saurabh
AU - Engelmann, Christian
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/6/6
Y1 - 2018/6/6
N2 - Efficient utilization of today's high-performance computing (HPC) systems with complex software and hardware components requires that the HPC applications are designed to tolerate process failures at runtime. With low mean-time-to-failure (MTTF) of current and future HPC systems, long running simulations on these systems requires capabilities for gracefully handling process failures by the applications themselves. In this paper, we explore the use of fault tolerance extensions to Message Passing Interface (MPI) called user-level failure mitigation (ULFM) for handling process failures without the need to discard the progress made by the application. We explore two alternative recovery strategies, which use ULFM along with application-driven in-memory checkpointing. In the first case, the application is recovered with only the surviving processes, and in the second case, spares are used to replace the failed processes, such that the original configuration of the application is restored. Our experimental results demonstrate that graceful degradation is a viable alternative for recovery in environments where spares may not be available.
AB - Efficient utilization of today's high-performance computing (HPC) systems with complex software and hardware components requires that the HPC applications are designed to tolerate process failures at runtime. With low mean-time-to-failure (MTTF) of current and future HPC systems, long running simulations on these systems requires capabilities for gracefully handling process failures by the applications themselves. In this paper, we explore the use of fault tolerance extensions to Message Passing Interface (MPI) called user-level failure mitigation (ULFM) for handling process failures without the need to discard the progress made by the application. We explore two alternative recovery strategies, which use ULFM along with application-driven in-memory checkpointing. In the first case, the application is recovered with only the surviving processes, and in the second case, spares are used to replace the failed processes, such that the original configuration of the application is restored. Our experimental results demonstrate that graceful degradation is a viable alternative for recovery in environments where spares may not be available.
KW - Checkpoint/Restart
KW - Fault Tolerance
KW - Message Passing Interface
KW - Process Failures
UR - http://www.scopus.com/inward/record.url?scp=85048841107&partnerID=8YFLogxK
U2 - 10.1109/PDP2018.2018.00032
DO - 10.1109/PDP2018.2018.00032
M3 - Conference contribution
AN - SCOPUS:85048841107
T3 - Proceedings - 26th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2018
SP - 178
EP - 185
BT - Proceedings - 26th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2018
A2 - Kotenko, Igor
A2 - Merelli, Ivan
A2 - Lio, Pietro
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 21 March 2018 through 23 March 2018
ER -