TY - GEN
T1 - Scalable, fault tolerant membership for MPI tasks on HPC systems
AU - Varma, Jyothish
AU - Wang, Chao
AU - Mueller, Frank
AU - Engelmann, Christian
AU - Scott, Stephen L.
PY - 2006
Y1 - 2006
N2 - Reliability is increasingly becoming a challenge for high-performance computing (HPC) systems with thousands of nodes, such as IBM's Blue Gene/L. A shorter mean-time-to-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can progress. However, existing approaches fall short in providing scalability and small recon guration overhead within the fault-tolerant layer.This paper contributes a scalable approach to recon gure the communication infrastructure after node failures. We propose a decentralized (peer-to-peer) protocol that maintains a consistent view of active nodes in the presence of faults. Our protocol shows response times in the order of hundreds of microseconds and single-digit milliseconds for recon guration using MPI over BlueGene/L and TCP over Gigabit, respectively. The protocol can be adapted to match the network topology to further increase performance. We also verify experimental results against a performance model, which demonstrates the scalability of the approach. Hence, the membership service is suitable for deployment in the communication layer of MPI runtime systems, and we have integrated an early version into LAM/MPI.
AB - Reliability is increasingly becoming a challenge for high-performance computing (HPC) systems with thousands of nodes, such as IBM's Blue Gene/L. A shorter mean-time-to-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can progress. However, existing approaches fall short in providing scalability and small recon guration overhead within the fault-tolerant layer.This paper contributes a scalable approach to recon gure the communication infrastructure after node failures. We propose a decentralized (peer-to-peer) protocol that maintains a consistent view of active nodes in the presence of faults. Our protocol shows response times in the order of hundreds of microseconds and single-digit milliseconds for recon guration using MPI over BlueGene/L and TCP over Gigabit, respectively. The protocol can be adapted to match the network topology to further increase performance. We also verify experimental results against a performance model, which demonstrates the scalability of the approach. Hence, the membership service is suitable for deployment in the communication layer of MPI runtime systems, and we have integrated an early version into LAM/MPI.
KW - Group communication
KW - High-performance computing
KW - Message passing
KW - Node failure
KW - Reliability
KW - Scalability
UR - http://www.scopus.com/inward/record.url?scp=34547440282&partnerID=8YFLogxK
U2 - 10.1145/1183401.1183433
DO - 10.1145/1183401.1183433
M3 - Conference contribution
AN - SCOPUS:34547440282
SN - 1595932828
SN - 9781595932822
T3 - Proceedings of the International Conference on Supercomputing
SP - 219
EP - 228
BT - Proceedings of the 20th Annual International Conference on Supercomputing, ICS 2006
T2 - 20th Annual International Conference on Supercomputing, ICS 2006
Y2 - 28 June 2006 through 1 July 2006
ER -