High performance dense linear system solver with resilience to multiple soft errors

Peng Du, Piotr Luszczek, Jack Dongarra

Research output: Contribution to journalConference articlepeer-review

19 Scopus citations

Abstract

In the multi-peta-flop era for supercomputers, the number of computing cores is growing exponentially. However, as integrated circuit technology scales below 65 nm, the critical charge required to flip a gate or a memory cell has been reduced and thus causing higher soft error rate from cosmic-radiations. Soft errors affect computers by producing silently data corruption which is hard to detect and correct. Current research of soft errors resilience for dense linear solver offers limited capability when facing large scale computing systems, and suffers from both soft error and round-off error due to floating point arithmetic. This work proposes a fault tolerant algorithm that recovers the solution of a dense linear system Ax = b from multiple spatial and temporal soft errors. Experimental results on Cray XT5 supercomputer confirm scalable performance of the proposed resilience functionality and negligible overhead in solution recovery.

Original languageEnglish
Pages (from-to)216-225
Number of pages10
JournalProcedia Computer Science
Volume9
DOIs
StatePublished - 2012
Externally publishedYes
Event12th Annual International Conference on Computational Science, ICCS 2012 - Omaha, NB, United States
Duration: Jun 4 2012Jun 6 2012

Funding

This research used resources of the Oak Ridge Leadership Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. Email addresses: [email protected] (Peng Du), [email protected] (Piotr Luszczek), [email protected] (Jack Dongarra) 1Corresponding author

Keywords

  • Dense linear system solver
  • Fault tolerance
  • Multiple errors
  • Soft error

Fingerprint

Dive into the research topics of 'High performance dense linear system solver with resilience to multiple soft errors'. Together they form a unique fingerprint.

Cite this