TY - JOUR
T1 - An evaluation of User-Level Failure Mitigation support in MPI
AU - Bland, Wesley
AU - Bouteiller, Aurelien
AU - Herault, Thomas
AU - Hursey, Joshua
AU - Bosilca, George
AU - Dongarra, Jack J.
PY - 2013/12
Y1 - 2013/12
N2 - As the scale of computing platforms becomes increasingly extreme, the requirements for application fault tolerance are increasing as well. Techniques to address this problem by improving the resilience of algorithms have been developed, but they currently receive no support from the programming model, and without such support, they are bound to fail. This paper discusses the failure-free overhead and recovery impact of the user-level failure mitigation proposal presented in the MPI Forum. Experiments demonstrate that fault-aware MPI has little or no impact on performance for a range of applications, and produces satisfactory recovery times when there are failures.
AB - As the scale of computing platforms becomes increasingly extreme, the requirements for application fault tolerance are increasing as well. Techniques to address this problem by improving the resilience of algorithms have been developed, but they currently receive no support from the programming model, and without such support, they are bound to fail. This paper discusses the failure-free overhead and recovery impact of the user-level failure mitigation proposal presented in the MPI Forum. Experiments demonstrate that fault-aware MPI has little or no impact on performance for a range of applications, and produces satisfactory recovery times when there are failures.
KW - Fault tolerance
KW - MPI
KW - User-level fault mitigation
UR - http://www.scopus.com/inward/record.url?scp=84889101867&partnerID=8YFLogxK
U2 - 10.1007/s00607-013-0331-3
DO - 10.1007/s00607-013-0331-3
M3 - Article
AN - SCOPUS:84889101867
SN - 0010-485X
VL - 95
SP - 1171
EP - 1184
JO - Computing
JF - Computing
IS - 12
ER -