TY - GEN

T1 - A scalable checkpoint encoding algorithm for diskless checkpointing

AU - Chen, Zizhong

AU - Dongarra, Jack

PY - 2008

Y1 - 2008

N2 - Diskless checkpointing is an efficient technique to save the state of a long running application in a distributed environment without relying on stable storage. In this paper, we introduce several scalable encoding strategies into diskless checkpointing and reduce the overhead to survive k failures in p processes from 2[log p].k((β+ 2γ)m + α) to (1 + 0(1/√m)).k{β + 2γ)m, where α is the communication latency, 1/β is the network bandwidth between processes. 1/γ is the rate to perform calculations, and m is the size of local checkpoint per process. The introduced algorithm is scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. We evaluate the performance overhead of the introduced algorithm by using a preconditioned conjugate gradient equation solver as an example. Experimental results demonstrate that the introduced techniques are highly scalable.

AB - Diskless checkpointing is an efficient technique to save the state of a long running application in a distributed environment without relying on stable storage. In this paper, we introduce several scalable encoding strategies into diskless checkpointing and reduce the overhead to survive k failures in p processes from 2[log p].k((β+ 2γ)m + α) to (1 + 0(1/√m)).k{β + 2γ)m, where α is the communication latency, 1/β is the network bandwidth between processes. 1/γ is the rate to perform calculations, and m is the size of local checkpoint per process. The introduced algorithm is scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. We evaluate the performance overhead of the introduced algorithm by using a preconditioned conjugate gradient equation solver as an example. Experimental results demonstrate that the introduced techniques are highly scalable.

KW - Checkpoint

KW - Diskless checkpointing

KW - Fault tolerance

KW - High performance computing

KW - Parallel and distributed systems

KW - Reed-solomon encoding

UR - http://www.scopus.com/inward/record.url?scp=58449086437&partnerID=8YFLogxK

U2 - 10.1109/HASE.2008.13

DO - 10.1109/HASE.2008.13

M3 - Conference contribution

AN - SCOPUS:58449086437

SN - 9780769534824

T3 - Proceedings of IEEE International Symposium on High Assurance Systems Engineering

SP - 71

EP - 79

BT - Proceedings - 11th IEEE High Assurance Systems Engineering Symposium, HASE 2008

T2 - 11th IEEE High Assurance Systems Engineering Symposium, HASE 2008

Y2 - 3 December 2008 through 5 December 2008

ER -