Fault tolerance in message passing and in action

Jack J. Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

This talk will describe an implementation of MPI which extends the message passing model to allow for recovery in the presence of a faulty process. Our implementation allows a user to catch the fault and then provide for a recovery. We will also touch on the issues related to using diskless checkpointing to allow for effective recovery of an application in the presence of a process fault.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
EditorsDieter Kranzlmuller, Peter Kacsuk, Jack Dongarra
PublisherSpringer Verlag
Pages6
Number of pages1
ISBN (Print)3540231633
DOIs
StatePublished - 2004
Event11th European Conference on Parallel Virtual Machine and Message Passing Interface Users Group Meeting, PVM/MPI 2004 - Budapest, Hungary
Duration: Sep 19 2004Sep 22 2004

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume3241
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference11th European Conference on Parallel Virtual Machine and Message Passing Interface Users Group Meeting, PVM/MPI 2004
Country/TerritoryHungary
CityBudapest
Period09/19/0409/22/04

Bibliographical note

Publisher Copyright:
© Springer-Verlag Berlin Heidelberg 2004.

Fingerprint

Dive into the research topics of 'Fault tolerance in message passing and in action'. Together they form a unique fingerprint.

Cite this