TY - GEN
T1 - A scalable checkpoint encoding algorithm for diskless checkpointing
AU - Chen, Zizhong
AU - Dongarra, Jack
PY - 2008
Y1 - 2008
N2 - Diskless checkpointing is an efficient technique to save the state of a long running application in a distributed environment without relying on stable storage. In this paper, we introduce several scalable encoding strategies into diskless checkpointing and reduce the overhead to survive k failures in p processes from 2[log p].k((β+ 2γ)m + α) to (1 + 0(1/√m)).k{β + 2γ)m, where α is the communication latency, 1/β is the network bandwidth between processes. 1/γ is the rate to perform calculations, and m is the size of local checkpoint per process. The introduced algorithm is scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. We evaluate the performance overhead of the introduced algorithm by using a preconditioned conjugate gradient equation solver as an example. Experimental results demonstrate that the introduced techniques are highly scalable.
AB - Diskless checkpointing is an efficient technique to save the state of a long running application in a distributed environment without relying on stable storage. In this paper, we introduce several scalable encoding strategies into diskless checkpointing and reduce the overhead to survive k failures in p processes from 2[log p].k((β+ 2γ)m + α) to (1 + 0(1/√m)).k{β + 2γ)m, where α is the communication latency, 1/β is the network bandwidth between processes. 1/γ is the rate to perform calculations, and m is the size of local checkpoint per process. The introduced algorithm is scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. We evaluate the performance overhead of the introduced algorithm by using a preconditioned conjugate gradient equation solver as an example. Experimental results demonstrate that the introduced techniques are highly scalable.
KW - Checkpoint
KW - Diskless checkpointing
KW - Fault tolerance
KW - High performance computing
KW - Parallel and distributed systems
KW - Reed-solomon encoding
UR - http://www.scopus.com/inward/record.url?scp=58449086437&partnerID=8YFLogxK
U2 - 10.1109/HASE.2008.13
DO - 10.1109/HASE.2008.13
M3 - Conference contribution
AN - SCOPUS:58449086437
SN - 9780769534824
T3 - Proceedings of IEEE International Symposium on High Assurance Systems Engineering
SP - 71
EP - 79
BT - Proceedings - 11th IEEE High Assurance Systems Engineering Symposium, HASE 2008
T2 - 11th IEEE High Assurance Systems Engineering Symposium, HASE 2008
Y2 - 3 December 2008 through 5 December 2008
ER -