Scalable fault tolerant protocol for parallel runtime environments

Thara Angskun, Graham E. Fagg, George Bosilca, Jelena Pješivac-Grbović, Jack J. Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

12 Scopus citations

Abstract

The number of processors embedded on high performance computing platforms is growing daily to satisfy users desire for solving larger and more complex problems. Parallel runtime environments have to support and adapt to the underlying libraries and hardware which require a high degree of scalability in dynamic environments. This paper presents the design of a scalable and fault tolerant protocol for supporting parallel runtime environment communications. The protocol is designed to support transmission of messages across multiple nodes with in a self-healing topology to protect against recursive node and process failures. A formal protocol verification has validated the protocol for both the normal and failure cases. We have implemented multiple routing algorithms for the protocol and concluded that the variant rule-based routing algorithm yields the best overall results for damaged and incomplete topologies.

Original languageEnglish
Title of host publicationRecent Advances in Parallel Virtual Machine and Message Passing Interface - 13th European PVM/MPI User's Group Meeting, Proceedings
PublisherSpringer Verlag
Pages141-149
Number of pages9
ISBN (Print)354039110X, 9783540391104
DOIs
StatePublished - 2006
Externally publishedYes
Event13th European PVM/MPI User's Group Meeting - Bonn, Germany
Duration: Sep 17 2006Sep 20 2006

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4192 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference13th European PVM/MPI User's Group Meeting
Country/TerritoryGermany
CityBonn
Period09/17/0609/20/06

Fingerprint

Dive into the research topics of 'Scalable fault tolerant protocol for parallel runtime environments'. Together they form a unique fingerprint.

Cite this