Skip to main navigation Skip to search Skip to main content

Plan B: Interruption of ongoing MPI operations to support failure recovery

  • Aurelien Bouteiller
  • , George Bosilca
  • , Jack J. Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

8 Scopus citations

Abstract

Advanced failure recovery strategies in HPC system benefit tremendously from in-place failure recovery, in which the MPI infrastructure can survive process crashes and resume communication services. In this paper we present the rationale behind the specification, and an effective implementation of the Revoke MPI operation. The purpose of the Revoke operation is the propagation of failure knowledge, and the interruption of ongoing, pending communication, under the control of the user. We explain that the Revoke operation can be implemented with a reliable broadcast over the scalable and failure resilient Binomial Graph (BMG) over- lay network. Evaluation at scale, on a Cray XC30 super- computer, demonstrates that the Revoke operation has a small latency, and does not introduce system noise outside of failure recovery periods.

Original languageEnglish
Title of host publicationProceedings of the 22nd European MPI Users' Group Meeting, EuroMPI 2015
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450337953
DOIs
StatePublished - Sep 21 2015
Externally publishedYes
Event22nd European MPI Users' Group Meeting, EuroMPI 2015 - Bordeaux, France
Duration: Sep 21 2015Sep 23 2015

Publication series

NameACM International Conference Proceeding Series
Volume21-23-September-2015

Conference

Conference22nd European MPI Users' Group Meeting, EuroMPI 2015
Country/TerritoryFrance
CityBordeaux
Period09/21/1509/23/15

Funding

This work is partially supported by the CREST project of the Japan Science and Technology Agency (JST), and by NSF award #1339820.

Fingerprint

Dive into the research topics of 'Plan B: Interruption of ongoing MPI operations to support failure recovery'. Together they form a unique fingerprint.

Cite this