Self-healing network for scalable fault tolerant runtime environments

Thara Angskun, Graham E. Fagg, George Bosilca, Jelena Pješivac-Grbović, Jack J. Dongarra

Research output: Chapter in Book/Report/Conference proceedingChapterpeer-review

Abstract

Scalable and fault tolerant runtime environments are needed to support and adapt to the underlying libraries and hardware which require a high degree of scalability in dynamic large-scale environments. This paper presents a self-healing network (SHN) for supporting scalable and fault-tolerant runtime environments. The SHN is designed to support transmission of messages across multiple nodes while also protecting against recursive node and process failures. It will automatically recover itself after a failure occurs. SHN is implemented on top of a scalable fault-tolerant protocol (SFTP). The experimental results show that both the latest multicast and broadcast routing algorithms used in SHN are faster than the original SFTP routing algorithms.

Original languageEnglish
Title of host publicationDistributed and Parallel Systems
Subtitle of host publicationFrom Cluster to Grid Computing
PublisherSpringer US
Pages73-80
Number of pages8
ISBN (Print)0387698574, 9780387698571
DOIs
StatePublished - 2007
Externally publishedYes

Keywords

  • Fault tolerance
  • Routing
  • Runtime Environment
  • Scalability
  • Self-healing

Fingerprint

Dive into the research topics of 'Self-healing network for scalable fault tolerant runtime environments'. Together they form a unique fingerprint.

Cite this