Skip to main navigation Skip to search Skip to main content

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

  • Tommaso Benacchio
  • , Luca Bonaventura
  • , Mirco Altenbernd
  • , Chris D. Cantwell
  • , Peter D. Düben
  • , Mike Gillard
  • , Luc Giraud
  • , Dominik Göddeke
  • , Erwan Raffin
  • , Keita Teranishi
  • , Nils Wedi

Research output: Contribution to journalArticlepeer-review

15 Scopus citations

Abstract

Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.

Original languageEnglish
Pages (from-to)285-311
Number of pages27
JournalInternational Journal of High Performance Computing Applications
Volume35
Issue number4
DOIs
StatePublished - Jul 2021
Externally publishedYes

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the ESCAPE-2 project, European Union’s Horizon 2020 Research and Innovation Programme (Grant Agreement No. 800897); the ESiWACE2 Centre of Excellence, European Union’s Horizon 2020 Research and Innovation Programme (Grant Agreement No. 823988); and the Deutsche Forschungsgemeinschaft under Germany’s Excellence Strategy – EXC-2075 (Grant Agreement No. 390740016). We thank the authors of , ), namely E Agullo, L Giraud, A Guermouche, J Roman, P Salas, and M Zounon, for the permission to report the description and numerical testing of the interpolation-restart strategy in Sections 3.4 and 4, and Dr Christian Kühnlein for the data on solver share of wall-clock time in dynamical core runs. PDD gratefully acknowledges funding from the Royal Society for his University Research Fellowship.

Keywords

  • Fault-tolerant computing
  • application-level resilience
  • high-performance computing
  • iterative solvers
  • numerical weather prediction

Fingerprint

Dive into the research topics of 'Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction'. Together they form a unique fingerprint.

Cite this