Hessenberg reduction with transient error resilience on GPU-based hybrid architectures

Yulu Jia, Piotr Luszczek, Jack Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Graphics Processing Units (GPUs) have been seeing widespread adoption in thefield of scientific computing, owing to the performance gains provided oncomputation-intensive applications. In this paper, we present the design andimplementation of a Hessenberg reduction algorithm immune to simultaneoussoft-errors, capable of taking advantage of hybrid GPU-CPU platforms. Thesesoft-errors are detected and corrected on the fly, preventing the propagationof the error to the rest of the data. Our design is at the intersection betweenseveral fault tolerant techniques and employs the algorithm-based faulttolerance technique, diskless checkpointing, and reverse computation to achieveits goal. By utilizing the idle time of the CPUs, and by overlapping bothhost-side and GPU-side workloads, we minimize the resilience overhead. Experimental results have validated our design decisions as our algorithmintroduced less than 2% performance overhead compared to the optimized, butfault-prone, hybrid Hessenberg reduction.

Original languageEnglish
Title of host publicationProceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages653-662
Number of pages10
ISBN (Electronic)9781509021406
DOIs
StatePublished - Jul 18 2016
Event30th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2016 - Chicago, United States
Duration: May 23 2016May 27 2016

Publication series

NameProceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

Conference

Conference30th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2016
Country/TerritoryUnited States
CityChicago
Period05/23/1605/27/16

Keywords

  • Fault-tolerance
  • GPGPU
  • Hessenberg reduction
  • Similarity transformation

Fingerprint

Dive into the research topics of 'Hessenberg reduction with transient error resilience on GPU-based hybrid architectures'. Together they form a unique fingerprint.

Cite this