TY - JOUR
T1 - Correlated set coordination in fault tolerant message logging protocols for many-core clusters
AU - Bouteiller, Aurelien
AU - Herault, Thomas
AU - Bosilca, George
AU - Dongarra, Jack J.
PY - 2013/2
Y1 - 2013/2
N2 - With our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases because of the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols but eliminates the need for costly payload logging between coordinated processes.
AB - With our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases because of the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols but eliminates the need for costly payload logging between coordinated processes.
KW - checkpoint/restart
KW - fault tolerance
KW - multicore clusters
UR - http://www.scopus.com/inward/record.url?scp=84874118584&partnerID=8YFLogxK
U2 - 10.1002/cpe.2859
DO - 10.1002/cpe.2859
M3 - Article
AN - SCOPUS:84874118584
SN - 1532-0626
VL - 25
SP - 572
EP - 585
JO - Concurrency and Computation: Practice and Experience
JF - Concurrency and Computation: Practice and Experience
IS - 4
ER -