A tunable, software-based DRAM error detection and correction library for HPC

David Fiala, Kurt B. Ferreira, Frank Mueller, Christian Engelmann

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

8 Scopus citations

Abstract

Proposed exascale systems will present a number of considerable resiliency challenges. In particular, DRAM soft-errors, or bit-flips, are expected to greatly increase due to the increased memory density of these systems. Current hardware-based fault-tolerance methods will be unsuitable for addressing the expected soft error frequency rate. As a result, additional software will be needed to address this challenge. In this paper we introduce LIBSDC, a tunable, transparent silent data corruption detection and correction library for HPC applications. LIBSDC provides comprehensive SDC protection for program memory by implementing on-demand page integrity verification. Experimental benchmarks with Mantevo HPCCG show that once tuned, LIBSDC is able to achieve SDC protection with 50% overhead of resources, less than the 100% needed for double modular redundancy.

Original languageEnglish
Title of host publicationEuro-Par 2011
Subtitle of host publicationParallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Revised Selected Papers
PublisherSpringer Verlag
Pages251-261
Number of pages11
EditionPART 2
ISBN (Print)9783642297397
DOIs
StatePublished - 2012
Event17th Parallel Processing Workshops, Euro-Par 2011: CCPI 2011, CGWS 2011, HeteroPar 2011, HiBB 2011, HPCVirt 2011, HPPC 2011, HPSS 2011, MDGS 2011, ProPer 2011, Resilience 2011, UCHPC 2011, VHPC 2011 - Bordeaux, France
Duration: Aug 29 2011Sep 2 2011

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
NumberPART 2
Volume7156 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference17th Parallel Processing Workshops, Euro-Par 2011: CCPI 2011, CGWS 2011, HeteroPar 2011, HiBB 2011, HPCVirt 2011, HPPC 2011, HPSS 2011, MDGS 2011, ProPer 2011, Resilience 2011, UCHPC 2011, VHPC 2011
Country/TerritoryFrance
CityBordeaux
Period08/29/1109/2/11

Fingerprint

Dive into the research topics of 'A tunable, software-based DRAM error detection and correction library for HPC'. Together they form a unique fingerprint.

Cite this