Fault tolerance in an inner-outer solver: A GVR-enabled case study

Ziming Zheng, Andrew A. Chien, Keita Teranishi

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates.We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/recovery. Experimental results validate correct execution with low performance overhead under varied error conditions.

Original languageEnglish
Title of host publicationHigh Performance Computing for Computational Science - VECPAR 2014 - 11th International Conference, Revised Selected Papers
EditorsOsni Marques, Michel Dayde, Kengo Nakajima
PublisherSpringer Verlag
Pages124-132
Number of pages9
ISBN (Print)9783319173528
DOIs
StatePublished - 2015
Externally publishedYes
Event11th International Conference on High Performance Computing for Computational Science, VECPAR 2014 - Eugene, United States
Duration: Jun 30 2014Jul 3 2014

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8969
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference11th International Conference on High Performance Computing for Computational Science, VECPAR 2014
Country/TerritoryUnited States
CityEugene
Period06/30/1407/3/14

Funding

We thank Mark Hoemmen from Sandia National Laboratories for his advice. This work supported by the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Award DE-SC0008603 and Contract DE-AC02-06CH11357. Also under the DOE National Nuclear Security Administration (NNSA) Advanced Simulation and Computing (ASC) program. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy National Nuclear Security Administration under contract DE-AC04-94AL85000.

Keywords

  • High performance computing
  • Numerical solver
  • Resilience

Fingerprint

Dive into the research topics of 'Fault tolerance in an inner-outer solver: A GVR-enabled case study'. Together they form a unique fingerprint.

Cite this