Abstract
Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates.We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/recovery. Experimental results validate correct execution with low performance overhead under varied error conditions.
| Original language | English |
|---|---|
| Title of host publication | High Performance Computing for Computational Science - VECPAR 2014 - 11th International Conference, Revised Selected Papers |
| Editors | Osni Marques, Michel Dayde, Kengo Nakajima |
| Publisher | Springer Verlag |
| Pages | 124-132 |
| Number of pages | 9 |
| ISBN (Print) | 9783319173528 |
| DOIs | |
| State | Published - 2015 |
| Externally published | Yes |
| Event | 11th International Conference on High Performance Computing for Computational Science, VECPAR 2014 - Eugene, United States Duration: Jun 30 2014 → Jul 3 2014 |
Publication series
| Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
|---|---|
| Volume | 8969 |
| ISSN (Print) | 0302-9743 |
| ISSN (Electronic) | 1611-3349 |
Conference
| Conference | 11th International Conference on High Performance Computing for Computational Science, VECPAR 2014 |
|---|---|
| Country/Territory | United States |
| City | Eugene |
| Period | 06/30/14 → 07/3/14 |
Funding
We thank Mark Hoemmen from Sandia National Laboratories for his advice. This work supported by the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Award DE-SC0008603 and Contract DE-AC02-06CH11357. Also under the DOE National Nuclear Security Administration (NNSA) Advanced Simulation and Computing (ASC) program. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy National Nuclear Security Administration under contract DE-AC04-94AL85000.
Keywords
- High performance computing
- Numerical solver
- Resilience