TY - GEN
T1 - Hybrid checkpointing for MPI jobs in HPC environments
AU - Wang, Chao
AU - Mueller, Frank
AU - Engelmann, Christian
AU - Scott, Stephen L.
PY - 2010
Y1 - 2010
N2 - As the core count in high-performance computing systems keeps increasing, faults are becoming common place. Check pointing addresses such faults but captures full process images even though only a subset of the process image changes between checkpoints. We have designed a hybrid check pointing technique for MPI tasks of high-performance applications. This technique alternates between full and incremental checkpoints: At incremental checkpoints, only data changed since the last checkpoint is captured. Our implementation integrates new BLCR and LAM/MPI features that complement traditional full checkpoints. This results in significantly reduced checkpoint sizes and overheads with only moderate increases in restart overhead. After accounting for cost and savings, benefits due to incremental checkpoints are an order of magnitude larger than overheads on restarts. We further derive qualitative results indicating an optimal balance between full/incremental checkpoints of our novel approach at a ratio of 1:9, which outperforms both always-full and always-incremental check pointing.
AB - As the core count in high-performance computing systems keeps increasing, faults are becoming common place. Check pointing addresses such faults but captures full process images even though only a subset of the process image changes between checkpoints. We have designed a hybrid check pointing technique for MPI tasks of high-performance applications. This technique alternates between full and incremental checkpoints: At incremental checkpoints, only data changed since the last checkpoint is captured. Our implementation integrates new BLCR and LAM/MPI features that complement traditional full checkpoints. This results in significantly reduced checkpoint sizes and overheads with only moderate increases in restart overhead. After accounting for cost and savings, benefits due to incremental checkpoints are an order of magnitude larger than overheads on restarts. We further derive qualitative results indicating an optimal balance between full/incremental checkpoints of our novel approach at a ratio of 1:9, which outperforms both always-full and always-incremental check pointing.
KW - Checkpoint/restart
KW - Fault tolerance
KW - High-performance computing
UR - http://www.scopus.com/inward/record.url?scp=79951790076&partnerID=8YFLogxK
U2 - 10.1109/ICPADS.2010.48
DO - 10.1109/ICPADS.2010.48
M3 - Conference contribution
AN - SCOPUS:79951790076
SN - 9780769543079
T3 - Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS
SP - 524
EP - 533
BT - Proceedings - 16th International Conference on Parallel and Distributed Systems, ICPADS 2010
T2 - 16th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2010
Y2 - 8 December 2010 through 10 December 2010
ER -