Post-failure recovery of MPI communication capability: Design and rationale

Wesley Bland, Aurelien Bouteiller, Thomas Herault, George Bosilca, Jack Dongarra

Research output: Contribution to journalArticlepeer-review

144 Scopus citations

Abstract

As supercomputers are entering an era of massive parallelism where the frequency of faults is increasing, the MPI Standard remains distressingly vague on the consequence of failures on MPI communications. Advanced fault-tolerance techniques have the potential to prevent full-scale application restart and therefore lower the cost incurred for each failure, but they demand from MPI the capability to detect failures and resume communications afterward. In this paper, we present a set of extensions to MPI that allow communication capabilities to be restored, while maintaining the extreme level of performance to which MPI users have become accustomed. The motivation behind the design choices are weighted against alternatives, a task that requires simultaneously considering MPI from the viewpoint of both the user and the implementor. The usability of the interfaces for expressing advanced recovery techniques is then discussed, including the difficult issue of enabling separate software layers to coordinate their recovery.

Original languageEnglish
Pages (from-to)244-254
Number of pages11
JournalInternational Journal of High Performance Computing Applications
Volume27
Issue number3
DOIs
StatePublished - Aug 2013
Externally publishedYes

Funding

This document describes combined research conducted under the following contracts: NSF-0904952 and NSF-1144042 between the U.S. National Science Foundation and the University of Tennessee, and DE-FC02-11ER26059 supported by the U.S. Department of Energy.

FundersFunder number
National Science Foundation0904952
U.S. Department of Energy
University of TennesseeDE-FC02-11ER26059

    Keywords

    • Fault tolerance
    • message passing interface
    • user-level failure mitigation

    Fingerprint

    Dive into the research topics of 'Post-failure recovery of MPI communication capability: Design and rationale'. Together they form a unique fingerprint.

    Cite this