TY - GEN
T1 - Failure Detection and Propagation in HPC systems
AU - Bosilca, George
AU - Bouteiller, Aurelien
AU - Guermouche, Amina
AU - Herault, Thomas
AU - Robert, Yves
AU - Sens, Pierre
AU - Dongarra, Jack
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/7/2
Y1 - 2016/7/2
N2 - Building an infrastructure for Exascale applications requires, in addition to many other key components, a stable and efficient failure detector. This paper describes the design and evaluation of a robust failure detector, able to maintain and distribute the correct list of alive resources within proven and scalable bounds. The detection and distribution of the fault information follow different overlay topologies that together guarantee minimal disturbance to the applications. A virtual observation ring minimizes the overhead by allowing each node to be observed by another single node, providing an unobtrusive behavior. The propagation stage is using a non-uniform variant of a reliable broadcast over a circulant graph overlay network, and guarantees a logarithmic fault propagation. Extensive simulations, together with experiments on the Titan ORNL supercomputer, show that the algorithm performs extremely well, and exhibits all the desired properties of an Exascale-ready algorithm.
AB - Building an infrastructure for Exascale applications requires, in addition to many other key components, a stable and efficient failure detector. This paper describes the design and evaluation of a robust failure detector, able to maintain and distribute the correct list of alive resources within proven and scalable bounds. The detection and distribution of the fault information follow different overlay topologies that together guarantee minimal disturbance to the applications. A virtual observation ring minimizes the overhead by allowing each node to be observed by another single node, providing an unobtrusive behavior. The propagation stage is using a non-uniform variant of a reliable broadcast over a circulant graph overlay network, and guarantees a logarithmic fault propagation. Extensive simulations, together with experiments on the Titan ORNL supercomputer, show that the algorithm performs extremely well, and exhibits all the desired properties of an Exascale-ready algorithm.
KW - Failure Detection
KW - Fault-Tolerance
KW - MPI
UR - http://www.scopus.com/inward/record.url?scp=85017281046&partnerID=8YFLogxK
U2 - 10.1109/SC.2016.26
DO - 10.1109/SC.2016.26
M3 - Conference contribution
AN - SCOPUS:85017281046
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
SP - 312
EP - 322
BT - Proceedings of SC 2016
PB - IEEE Computer Society
T2 - 2016 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016
Y2 - 13 November 2016 through 18 November 2016
ER -