TY - GEN
T1 - A diskless checkpointing algorithm for super-scale architectures applied to the Fast Fourier Transform
AU - Engelmann, Christian
AU - Geist, Al
N1 - Publisher Copyright:
© 2003 IEEE.
PY - 2003
Y1 - 2003
N2 - This paper discusses the issue of fault-tolerance in distributed computer systems with tens or hundreds of thousands of diskless processor units. Such systems, like the IBM BlueGene/L, are predicted to be deployed in the next five to ten years. Since a 100,000-processor system is going to be less reliable, scientific applications need to be able to recover from occurring failures more efficiently. In this paper, we adapt the present technique of diskless checkpointing to such huge distributed systems in order to equip existing scientific algorithms with super-scalable fault-tolerance. First, we discuss the method of diskless checkpointing, then we adapt this technique to super-scale architectures and finally we present results from an implementation of the Fast Fourier Transform that uses the adapted technique to achieve super-scale fault-tolerance.
AB - This paper discusses the issue of fault-tolerance in distributed computer systems with tens or hundreds of thousands of diskless processor units. Such systems, like the IBM BlueGene/L, are predicted to be deployed in the next five to ten years. Since a 100,000-processor system is going to be less reliable, scientific applications need to be able to recover from occurring failures more efficiently. In this paper, we adapt the present technique of diskless checkpointing to such huge distributed systems in order to equip existing scientific algorithms with super-scalable fault-tolerance. First, we discuss the method of diskless checkpointing, then we adapt this technique to super-scale architectures and finally we present results from an implementation of the Fast Fourier Transform that uses the adapted technique to achieve super-scale fault-tolerance.
KW - Application software
KW - Bandwidth
KW - Checkpointing
KW - Computer architecture
KW - Computer science
KW - Concurrent computing
KW - Delay
KW - Distributed computing
KW - Fast Fourier transforms
KW - Fault tolerant systems
UR - http://www.scopus.com/inward/record.url?scp=84943545232&partnerID=8YFLogxK
U2 - 10.1109/CLADE.2003.1209999
DO - 10.1109/CLADE.2003.1209999
M3 - Conference contribution
AN - SCOPUS:84943545232
T3 - Proceedings of the International Workshop on Challenges of Large Applications in Distributed Environments, CLADE 2003
SP - 47
EP - 52
BT - Proceedings of the International Workshop on Challenges of Large Applications in Distributed Environments, CLADE 2003
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - International Workshop on Challenges of Large Applications in Distributed Environments, CLADE 2003
Y2 - 21 June 2003
ER -