TY - JOUR
T1 - Process fault tolerance
T2 - Semantics, design and applications for high performance computing
AU - Fagg, Graham E.
AU - Gabriel, Edgar
AU - Chen, Zizhong
AU - Angskun, Thara
AU - Bosilca, George
AU - Pjesivac-Grbovic, Jelena
AU - Dongarra, Jack J.
PY - 2005/12
Y1 - 2005/12
N2 - With increasing numbers of processors on current machines, the probability for node or link failures is also increasing. Therefore, application-level fault tolerance is becoming more of an important issue for both end-users and the institutions running the machines. In this paper we present the semantics of a fault-tolerant version of the message passing interface (MPI), the de-facto standard for communication in scientific applications, which gives applications the possibility to recover from a node or link error and continue execution in a well-defined way. We present the architecture of fault-tolerant MPI, an implementation of MPI using the semantics presented above as well as benchmark results with various applications. An example of a fault-tolerant parallel equation solver, performance results as well as the time for recovering from a process failure are furthermore detailed.
AB - With increasing numbers of processors on current machines, the probability for node or link failures is also increasing. Therefore, application-level fault tolerance is becoming more of an important issue for both end-users and the institutions running the machines. In this paper we present the semantics of a fault-tolerant version of the message passing interface (MPI), the de-facto standard for communication in scientific applications, which gives applications the possibility to recover from a node or link error and continue execution in a well-defined way. We present the architecture of fault-tolerant MPI, an implementation of MPI using the semantics presented above as well as benchmark results with various applications. An example of a fault-tolerant parallel equation solver, performance results as well as the time for recovering from a process failure are furthermore detailed.
KW - Fault tolerance
KW - MPI and message passing
KW - Parallel computing
UR - http://www.scopus.com/inward/record.url?scp=27844508605&partnerID=8YFLogxK
U2 - 10.1177/1094342005056137
DO - 10.1177/1094342005056137
M3 - Article
AN - SCOPUS:27844508605
SN - 1094-3420
VL - 19
SP - 465
EP - 477
JO - International Journal of High Performance Computing Applications
JF - International Journal of High Performance Computing Applications
IS - 4
ER -