Soft error resilient QR factorization for hybrid system with GPGPU

Peng Du, Piotr Luszczek, Stan Tomov, Jack Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

The general purpose graphics processing units (GPGPU) are increasingly deployed for scientific computing due to their performance advantages over CPUs. As a result, fault tolerance has become a more serious concern compared to the period when GPGPUs were used exclusively for graphics applications. Using GPUs and CPUs together in a hybrid computing system increases flexibility and performance but also increases the possibility of the computations being affected by soft errors. In this work, we propose a soft error resilient algorithm for QR factorization on such hybrid systems. Our contributions include (1) a checkpointing and recovery mechanism for the left-factor Q whose performance is scalable on hybrid systems; (2) optimized Givens rotation utilities on GPGPUs to efficiently reduce an upper Hessenberg matrix to an upper triangular form for the protection of the right factor R, and (3) a recovery algorithm based on QR update on GPGPUs. Experimental results show that our fault tolerant QR factorization can success- fully detect and recover from soft errors in the entire matrix with little overhead on hybrid systems with GPGPUs.

Original languageEnglish
Title of host publicationScalA'11 - Proceedings of the 2011 ACM Workshop on Scalable Algorithms for Large-Scale Systems, Co-located with SC'11
Pages11-14
Number of pages4
DOIs
StatePublished - 2011
Externally publishedYes
Event2011 ACM Workshop on Scalable Algorithms for Large-Scale Systems, ScalA'11, Co-located with SC'11 - Seattle, WA, United States
Duration: Nov 14 2011Nov 14 2011

Publication series

NameScalA'11 - Proceedings of the 2011 ACM Workshop on Scalable Algorithms for Large-Scale Systems, Co-located with SC'11

Conference

Conference2011 ACM Workshop on Scalable Algorithms for Large-Scale Systems, ScalA'11, Co-located with SC'11
Country/TerritoryUnited States
CitySeattle, WA
Period11/14/1111/14/11

Keywords

  • GPGPU
  • QR factorization
  • soft error

Fingerprint

Dive into the research topics of 'Soft error resilient QR factorization for hybrid system with GPGPU'. Together they form a unique fingerprint.

Cite this