Evaluating the viability of application-driven cooperative CPU/GPU fault detection

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Trends in high performance computing are bringing increased heterogeneity among the computational resources within a single machine. The heterogeneous CPU/GPU platforms, however, exacerbate resilience problems faced by current large-scale systems. How to design efficient resilience strategies is critical for the wider adoption of heterogeneous platforms for future exascale systems. The conventional resilience strategy for GPU brings significant performance and power overhead, because they employ a one-size-fits-all approach to enforce uniform data protection. In addition, the isolation between CPU and GPU protection loses potential optimization opportunities provided by the heterogeneous CPU/GPU platforms. In this paper, we explore the viability of using an application-driven CPU/GPU cooperative method to detect faults occurred on GPU global memory. By selectively protecting application-critical data and leveraging time and space redundancy in CPU to detect faults, we bring only 2.2% performance overhead while capturing more than 90% errors that cause incorrect application results.

Original languageEnglish
Title of host publicationEuro-Par 2013
Subtitle of host publicationParallel Processing Workshops - BigDataCloud, DIHC, FedICI, HeteroPar, HiBB, LSDVE, MHPC, OMHI, PADABS, PROPER, Resilience, ROME, and UCHPC 2013, Revised Selected Papers
PublisherSpringer Verlag
Pages670-679
Number of pages10
ISBN (Print)9783642544194
DOIs
StatePublished - 2014
Event19th International Conference on Parallel Processing Workshops, Euro-Par 2013 - BigDataCloud, DIHC, FedICI, HeteroPar, HiBB, LSDVE, MHPC, OMHI, PADABS, PROPER, Resilience, ROME, and UCHPC 2013 - Aachen, Germany
Duration: Aug 26 2013Aug 27 2013

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8374 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference19th International Conference on Parallel Processing Workshops, Euro-Par 2013 - BigDataCloud, DIHC, FedICI, HeteroPar, HiBB, LSDVE, MHPC, OMHI, PADABS, PROPER, Resilience, ROME, and UCHPC 2013
Country/TerritoryGermany
CityAachen
Period08/26/1308/27/13

Keywords

  • fault detection
  • heterogeneous computing

Fingerprint

Dive into the research topics of 'Evaluating the viability of application-driven cooperative CPU/GPU fault detection'. Together they form a unique fingerprint.

Cite this