GPU-ABFT: Optimizing algorithm-based fault tolerance for heterogeneous systems with GPUs

Jieyang Chen, Sihuan Li, Zizhong Chen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

21 Scopus citations

Abstract

For matrix operations, the algorithm-based fault tolerance (ABFT) brings much lower fault tolerance overhead than the traditional Triple Modular Redundancy or Double Modular Redundancy approaches. Many works have been done to develop and optimize ABFT schemes on general purpose microprocessors. However, the ABFT schemes on heterogeneous systems with GPUs are not fully developed and optimized. Moreover, existing ABFT schemes can correct computing errors brings by the logic parts, however, many memory storage errors cannot be detected and corrected by current ABFT schemes. In this work, we designed a new ABFT scheme with both computing and memory storage protection. Then, we apply it to Cholesky decomposition on heterogeneous systems with GPUs. In addition, we develop several fault tolerance overhead reduction techniques specifically for heterogeneous systems with GPUs accelerators. Experimental results show that our ABFT scheme is able to correct both computing error and memory storage error with low overhead and comparable overall performance.

Original languageEnglish
Title of host publication2016 IEEE International Conference on Networking Architecture and Storage, NAS 2016 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781509033157
DOIs
StatePublished - Aug 23 2016
Event11th IEEE International Conference on Networking Architecture and Storage, NAS 2016 - Long Beach, United States
Duration: Aug 8 2016Aug 10 2016

Publication series

Name2016 IEEE International Conference on Networking Architecture and Storage, NAS 2016 - Proceedings

Conference

Conference11th IEEE International Conference on Networking Architecture and Storage, NAS 2016
Country/TerritoryUnited States
CityLong Beach
Period08/8/1608/10/16

Fingerprint

Dive into the research topics of 'GPU-ABFT: Optimizing algorithm-based fault tolerance for heterogeneous systems with GPUs'. Together they form a unique fingerprint.

Cite this