Scalable, fault tolerant membership for MPI tasks on HPC systems

Jyothish Varma, Chao Wang, Frank Mueller, Christian Engelmann, Stephen L. Scott

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

14 Scopus citations

Abstract

Reliability is increasingly becoming a challenge for high-performance computing (HPC) systems with thousands of nodes, such as IBM's Blue Gene/L. A shorter mean-time-to-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can progress. However, existing approaches fall short in providing scalability and small recon guration overhead within the fault-tolerant layer.This paper contributes a scalable approach to recon gure the communication infrastructure after node failures. We propose a decentralized (peer-to-peer) protocol that maintains a consistent view of active nodes in the presence of faults. Our protocol shows response times in the order of hundreds of microseconds and single-digit milliseconds for recon guration using MPI over BlueGene/L and TCP over Gigabit, respectively. The protocol can be adapted to match the network topology to further increase performance. We also verify experimental results against a performance model, which demonstrates the scalability of the approach. Hence, the membership service is suitable for deployment in the communication layer of MPI runtime systems, and we have integrated an early version into LAM/MPI.

Original languageEnglish
Title of host publicationProceedings of the 20th Annual International Conference on Supercomputing, ICS 2006
Pages219-228
Number of pages10
DOIs
StatePublished - 2006
Event20th Annual International Conference on Supercomputing, ICS 2006 - Cairns, Queensland, Australia
Duration: Jun 28 2006Jul 1 2006

Publication series

NameProceedings of the International Conference on Supercomputing

Conference

Conference20th Annual International Conference on Supercomputing, ICS 2006
Country/TerritoryAustralia
CityCairns, Queensland
Period06/28/0607/1/06

Keywords

  • Group communication
  • High-performance computing
  • Message passing
  • Node failure
  • Reliability
  • Scalability

Fingerprint

Dive into the research topics of 'Scalable, fault tolerant membership for MPI tasks on HPC systems'. Together they form a unique fingerprint.

Cite this