Scalable and fault tolerant failure detection and consensus

Amogh Katti, Giuseppe Di Fatta, Thomas Naughton, Christian Engelmann

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

10 Scopus citations

Abstract

Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has introduced an operation, MPI Comm shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI Comm shrink operation requires a fault tolerant failure detection and consensus algorithm. This paper presents and compares two novel failure detection and consensus algorithms. The proposed algorithms are based on Gossip protocols and are inherently fault- Tolerant and scalable. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that in both algorithms the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus.

Original languageEnglish
Title of host publicationProceedings of the 22nd European MPI Users' Group Meeting, EuroMPI 2015
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450337953
DOIs
StatePublished - Sep 21 2015
Event22nd European MPI Users' Group Meeting, EuroMPI 2015 - Bordeaux, France
Duration: Sep 21 2015Sep 23 2015

Publication series

NameACM International Conference Proceeding Series
Volume21-23-September-2015

Conference

Conference22nd European MPI Users' Group Meeting, EuroMPI 2015
Country/TerritoryFrance
CityBordeaux
Period09/21/1509/23/15

Funding

This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research. The author Amogh Katti is supported by the Felix Scholarship for his PhD project

FundersFunder number
U.S. Department of Energy
Office of Science
Advanced Scientific Computing Research

    Keywords

    • Consensus
    • Failure detection
    • Fault- Tolerant MPI
    • Gossip protocols
    • User-level failure mitigation

    Fingerprint

    Dive into the research topics of 'Scalable and fault tolerant failure detection and consensus'. Together they form a unique fingerprint.

    Cite this