Constructing resiliant communication infrastructure for runtime environments

George Bosilca, Camille Coti, Thomas Herault, Pierre Lemarinier, Jack Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

High performance computing platforms are becoming larger, leading to scalability and fault-tolerance issues for both applications and runtime environments (RTE) dedicated to run on such machines. After being deployed, usually following a spanning tree, a RTE needs to build its own communication infrastructure to manage and monitor the tasks of parallel applications. Previous works have demonstrated that the Binomial Graph topology (BMG) is a good candidate as a communication infrastructure for supporting scalable and fault-tolerant RTE. In this paper, we present and analyze a self-stabilizing algorithm to transform the underlying communication infrastructure provided by the launching service into a BMG, and maintain it in spite of failures. We demonstrate that this algorithm is scalable, tolerates transient failures, and adapts itself to topology changes.

Original languageEnglish
Title of host publicationParallel Computing
Subtitle of host publicationFrom Multicores and GPU's to Petascale
PublisherIOS Press BV
Pages441-451
Number of pages11
ISBN (Print)9781607505297
DOIs
StatePublished - 2010
Externally publishedYes

Publication series

NameAdvances in Parallel Computing
Volume19
ISSN (Print)0927-5452

Keywords

  • Self-stabilization
  • binomial graph
  • scalability

Fingerprint

Dive into the research topics of 'Constructing resiliant communication infrastructure for runtime environments'. Together they form a unique fingerprint.

Cite this