An evaluation of User-Level Failure Mitigation support in MPI

Wesley Bland, Aurelien Bouteiller, Thomas Herault, Joshua Hursey, George Bosilca, Jack J. Dongarra

Research output: Contribution to journalArticlepeer-review

25 Scopus citations

Abstract

As the scale of computing platforms becomes increasingly extreme, the requirements for application fault tolerance are increasing as well. Techniques to address this problem by improving the resilience of algorithms have been developed, but they currently receive no support from the programming model, and without such support, they are bound to fail. This paper discusses the failure-free overhead and recovery impact of the user-level failure mitigation proposal presented in the MPI Forum. Experiments demonstrate that fault-aware MPI has little or no impact on performance for a range of applications, and produces satisfactory recovery times when there are failures.

Original languageEnglish
Pages (from-to)1171-1184
Number of pages14
JournalComputing
Volume95
Issue number12
DOIs
StatePublished - Dec 2013
Externally publishedYes

Funding

FundersFunder number
National Science Foundation1063019

    Keywords

    • Fault tolerance
    • MPI
    • User-level fault mitigation

    Fingerprint

    Dive into the research topics of 'An evaluation of User-Level Failure Mitigation support in MPI'. Together they form a unique fingerprint.

    Cite this