Selective protection for sparse iterative solvers to reduce the resilience overhead

Hongyang Sun, Ana Gainaru, Manu Shantharam, Padma Raghavan

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

The increasing scale and complexity of today's high-performance computing (HPC) systems demand a renewed focus on enhancing the resilience of long-running scientific applications in the presence of faults. Many of these applications are iterative in nature as they operate on sparse matrices that concern the simulation of partial differential equations (PDEs) which numerically capture the physical properties on discretized spatial domains. While these applications currently benefit from many application-agnostic resilience techniques at the system level, such as checkpointing and replication, there is significant overhead in deploying these techniques. In this paper, we seek to develop application-aware resilience techniques that leverage an iterative application's intrinsic resiliency to faults and selectively protect certain elements, thereby reducing the resilience overhead. Specifically, we investigate the impact of soft errors on the widely used Preconditioned Conjugate Gradient (PCG) method, whose reliability depends heavily on the error propagation through the sparse matrix-vector multiplication (SpMV) operation. By characterizing the performance of PCG in correlation with a numerical property of the underlying sparse matrix, we propose a selective protection scheme that protects only certain critical elements of the operation based on an analytical model. An experimental evaluation using 20 sparse matrices from the SuiteSparse Matrix Collection shows that our proposed scheme is able to reduce the resilience overhead by as much as 70.2% and an average of 32.6% compared to the baseline techniques with full-protection or zero-protection.

Original languageEnglish
Title of host publicationProceedings - 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2020
PublisherIEEE Computer Society
Pages141-148
Number of pages8
ISBN (Electronic)9781728199245
DOIs
StatePublished - Sep 2020
Externally publishedYes
Event32nd IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2020 - Virtual, Porto, Portugal
Duration: Sep 8 2020Sep 11 2020

Publication series

NameProceedings - Symposium on Computer Architecture and High Performance Computing
Volume2020-September
ISSN (Print)1550-6533

Conference

Conference32nd IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2020
Country/TerritoryPortugal
CityVirtual, Porto
Period09/8/2009/11/20

Keywords

  • Iterative solvers
  • Preconditioned conjugate gradient
  • Resilience
  • Selective protection
  • Soft errors

Fingerprint

Dive into the research topics of 'Selective protection for sparse iterative solvers to reduce the resilience overhead'. Together they form a unique fingerprint.

Cite this