A diskless checkpointing algorithm for super-scale architectures applied to the Fast Fourier Transform

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

21 Scopus citations

Abstract

This paper discusses the issue of fault-tolerance in distributed computer systems with tens or hundreds of thousands of diskless processor units. Such systems, like the IBM BlueGene/L, are predicted to be deployed in the next five to ten years. Since a 100,000-processor system is going to be less reliable, scientific applications need to be able to recover from occurring failures more efficiently. In this paper, we adapt the present technique of diskless checkpointing to such huge distributed systems in order to equip existing scientific algorithms with super-scalable fault-tolerance. First, we discuss the method of diskless checkpointing, then we adapt this technique to super-scale architectures and finally we present results from an implementation of the Fast Fourier Transform that uses the adapted technique to achieve super-scale fault-tolerance.

Original languageEnglish
Title of host publicationProceedings of the International Workshop on Challenges of Large Applications in Distributed Environments, CLADE 2003
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages47-52
Number of pages6
ISBN (Electronic)0769519849, 9780769519845
DOIs
StatePublished - 2003
EventInternational Workshop on Challenges of Large Applications in Distributed Environments, CLADE 2003 - Seattle, United States
Duration: Jun 21 2003 → …

Publication series

NameProceedings of the International Workshop on Challenges of Large Applications in Distributed Environments, CLADE 2003

Conference

ConferenceInternational Workshop on Challenges of Large Applications in Distributed Environments, CLADE 2003
Country/TerritoryUnited States
CitySeattle
Period06/21/03 → …

Keywords

  • Application software
  • Bandwidth
  • Checkpointing
  • Computer architecture
  • Computer science
  • Concurrent computing
  • Delay
  • Distributed computing
  • Fast Fourier transforms
  • Fault tolerant systems

Fingerprint

Dive into the research topics of 'A diskless checkpointing algorithm for super-scale architectures applied to the Fast Fourier Transform'. Together they form a unique fingerprint.

Cite this