CPU-GPU hybrid bidiagonal reduction with soft error resilience

Yulu Jia, Piotr Luszczek, George Bosilca, Jack J. Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

Soft errors pose a real challenge to applications running on modern hardware as the feature size becomes smaller and the integration density increases for both the modern processors and the memory chips. Soft errors manifest themselves as bit-flips that alter the user value, and numerical software is a category of software that is sensitive to such data changes. In this paper, we present a design of a bidiagonal reduction algorithm that is resilient to soft errors, and we also describe its implementation on hybrid CPU-GPU architectures. Our fault-tolerant algorithm employs Algorithm Based Fault Tolerance, combined with reverse computation, to detect, locate, and correct soft errors. The tests were performed on a Sandy Bridge CPU coupled with an NVIDIA Kepler GPU. The included experiments show that our resilient bidiagonal reduction algorithm adds very little overhead compared to the error-prone code. At matrix size 10110 × 10110, our algorithm only has a performance overhead of 1:085% when one error occurs, and 0:354% when no errors occur. Copyright is held by the owner/author(s).

Original languageEnglish
Title of host publicationProc. of ScalA 2013
Subtitle of host publicationWorkshop on Latest Adv. in Scalable Algorithms for Large-Scale Systems - Held in Conjunction with SC 2013: The Int. Conf. for High Perform. Comput., Networking, Storage and Anal.
DOIs
StatePublished - 2013
EventWorkshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2013 - Held in Conjunction with the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2013 - Denver, CO, United States
Duration: Nov 17 2013Nov 21 2013

Publication series

NameProc. of ScalA 2013: Workshop on Latest Adv. in Scalable Algorithms for Large-Scale Systems - Held in Conjunction with SC 2013: The Int. Conf. for High Perform. Comput., Networking, Storage and Anal.

Conference

ConferenceWorkshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2013 - Held in Conjunction with the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2013
Country/TerritoryUnited States
CityDenver, CO
Period11/17/1311/21/13

Keywords

  • ABFT
  • Bidiagonalization
  • GPU
  • Hybrid
  • Resilient
  • Reverse computation
  • Soft error

Fingerprint

Dive into the research topics of 'CPU-GPU hybrid bidiagonal reduction with soft error resilience'. Together they form a unique fingerprint.

Cite this