Failure Detection and Propagation in HPC systems

George Bosilca, Aurelien Bouteiller, Amina Guermouche, Thomas Herault, Yves Robert, Pierre Sens, Jack Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

20 Scopus citations

Abstract

Building an infrastructure for Exascale applications requires, in addition to many other key components, a stable and efficient failure detector. This paper describes the design and evaluation of a robust failure detector, able to maintain and distribute the correct list of alive resources within proven and scalable bounds. The detection and distribution of the fault information follow different overlay topologies that together guarantee minimal disturbance to the applications. A virtual observation ring minimizes the overhead by allowing each node to be observed by another single node, providing an unobtrusive behavior. The propagation stage is using a non-uniform variant of a reliable broadcast over a circulant graph overlay network, and guarantees a logarithmic fault propagation. Extensive simulations, together with experiments on the Titan ORNL supercomputer, show that the algorithm performs extremely well, and exhibits all the desired properties of an Exascale-ready algorithm.

Original languageEnglish
Title of host publicationProceedings of SC 2016
Subtitle of host publicationThe International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherIEEE Computer Society
Pages312-322
Number of pages11
ISBN (Electronic)9781467388153
DOIs
StatePublished - Jul 2 2016
Event2016 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016 - Salt Lake City, United States
Duration: Nov 13 2016Nov 18 2016

Publication series

NameInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
Volume0
ISSN (Print)2167-4329
ISSN (Electronic)2167-4337

Conference

Conference2016 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016
Country/TerritoryUnited States
CitySalt Lake City
Period11/13/1611/18/16

Keywords

  • Failure Detection
  • Fault-Tolerance
  • MPI

Fingerprint

Dive into the research topics of 'Failure Detection and Propagation in HPC systems'. Together they form a unique fingerprint.

Cite this