An evaluation of user-level failure mitigation support in MPI

Wesley Bland, Aurelien Bouteiller, Thomas Herault, Joshua Hursey, George Bosilca, Jack J. Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

65 Scopus citations

Abstract

As the scale of computing platforms becomes increasingly extreme, the requirements for application fault tolerance are increasing as well. Techniques to address this problem by improving the resilience of algorithms have been developed, but they currently receive no support from the programming model, and without such support, they are bound to fail. This paper discusses the failure-free overhead and recovery impact aspects of the User-Level Failure Mitigation proposal presented in the MPI Forum. Experiments demonstrate that fault-aware MPI has little or no impact on performance for a range of applications, and produces satisfactory recovery times when there are failures.

Original languageEnglish
Title of host publicationRecent Advances in the Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012, Proceedings
Pages193-203
Number of pages11
DOIs
StatePublished - 2012
Externally publishedYes
Event19th European MPI Users' Group Meeting on Recent Advances in the Message Passing Interface, EuroMPI 2012 - Vienna, Austria
Duration: Sep 23 2012Sep 26 2012

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume7490 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference19th European MPI Users' Group Meeting on Recent Advances in the Message Passing Interface, EuroMPI 2012
Country/TerritoryAustria
CityVienna
Period09/23/1209/26/12

Fingerprint

Dive into the research topics of 'An evaluation of user-level failure mitigation support in MPI'. Together they form a unique fingerprint.

Cite this