TY - GEN
T1 - Correlated set coordination in fault tolerant message logging protocols
AU - Bouteiller, Aurelien
AU - Herault, Thomas
AU - Bosilca, George
AU - Dongarra, Jack J.
PY - 2011
Y1 - 2011
N2 - Based on our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases, due to the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes, but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols, but eliminates the need for costly payload logging between coordinated processes.
AB - Based on our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases, due to the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes, but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols, but eliminates the need for costly payload logging between coordinated processes.
UR - http://www.scopus.com/inward/record.url?scp=80052306159&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-23397-5_6
DO - 10.1007/978-3-642-23397-5_6
M3 - Conference contribution
AN - SCOPUS:80052306159
SN - 9783642233968
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 51
EP - 64
BT - Euro-Par 2011 Parallel Processing - 17th International Conference, Proceedings
T2 - 17th International Conference on Parallel Processing, Euro-Par 2011
Y2 - 29 August 2011 through 2 September 2011
ER -