TY - GEN
T1 - Revisiting the double checkpointing algorithm
AU - Dongarra, Jack
AU - Herault, Thomas
AU - Robert, Yves
PY - 2013
Y1 - 2013
N2 - Fast check pointing algorithms require distributed access to stable storage. This paper revisits the approach base upon double check pointing, and compares the blocking algorithm of Zheng, Shi and Kalé, with the non-blocking algorithm of Ni, Meneses and Kalé, in terms of both performance and risk. We also extend their model proposed to assess the impact of the overhead associated to non-blocking communications. We then provide a new peer-to-peer check pointing algorithm, called the triple check pointing algorithm, that can work at constant memory, and achieves both higher efficiency and better risk handling than the double check pointing algorithm. We provide performance and risk models for all the evaluated protocols, and compare them through comprehensive simulations.
AB - Fast check pointing algorithms require distributed access to stable storage. This paper revisits the approach base upon double check pointing, and compares the blocking algorithm of Zheng, Shi and Kalé, with the non-blocking algorithm of Ni, Meneses and Kalé, in terms of both performance and risk. We also extend their model proposed to assess the impact of the overhead associated to non-blocking communications. We then provide a new peer-to-peer check pointing algorithm, called the triple check pointing algorithm, that can work at constant memory, and achieves both higher efficiency and better risk handling than the double check pointing algorithm. We provide performance and risk models for all the evaluated protocols, and compare them through comprehensive simulations.
KW - checkpoint
KW - in-memory checkpoint
KW - performance model
KW - scheduling
UR - http://www.scopus.com/inward/record.url?scp=84899758705&partnerID=8YFLogxK
U2 - 10.1109/IPDPSW.2013.11
DO - 10.1109/IPDPSW.2013.11
M3 - Conference contribution
AN - SCOPUS:84899758705
SN - 9780769549798
T3 - Proceedings - IEEE 27th International Parallel and Distributed Processing Symposium Workshops and PhD Forum, IPDPSW 2013
SP - 706
EP - 715
BT - Proceedings - IEEE 27th International Parallel and Distributed Processing Symposium Workshops and PhD Forum, IPDPSW 2013
PB - IEEE Computer Society
T2 - 2013 IEEE 37th Annual Computer Software and Applications Conference, COMPSAC 2013
Y2 - 22 July 2013 through 26 July 2013
ER -