TY - GEN
T1 - A tunable, software-based DRAM error detection and correction library for HPC
AU - Fiala, David
AU - Ferreira, Kurt B.
AU - Mueller, Frank
AU - Engelmann, Christian
PY - 2012
Y1 - 2012
N2 - Proposed exascale systems will present a number of considerable resiliency challenges. In particular, DRAM soft-errors, or bit-flips, are expected to greatly increase due to the increased memory density of these systems. Current hardware-based fault-tolerance methods will be unsuitable for addressing the expected soft error frequency rate. As a result, additional software will be needed to address this challenge. In this paper we introduce LIBSDC, a tunable, transparent silent data corruption detection and correction library for HPC applications. LIBSDC provides comprehensive SDC protection for program memory by implementing on-demand page integrity verification. Experimental benchmarks with Mantevo HPCCG show that once tuned, LIBSDC is able to achieve SDC protection with 50% overhead of resources, less than the 100% needed for double modular redundancy.
AB - Proposed exascale systems will present a number of considerable resiliency challenges. In particular, DRAM soft-errors, or bit-flips, are expected to greatly increase due to the increased memory density of these systems. Current hardware-based fault-tolerance methods will be unsuitable for addressing the expected soft error frequency rate. As a result, additional software will be needed to address this challenge. In this paper we introduce LIBSDC, a tunable, transparent silent data corruption detection and correction library for HPC applications. LIBSDC provides comprehensive SDC protection for program memory by implementing on-demand page integrity verification. Experimental benchmarks with Mantevo HPCCG show that once tuned, LIBSDC is able to achieve SDC protection with 50% overhead of resources, less than the 100% needed for double modular redundancy.
UR - http://www.scopus.com/inward/record.url?scp=84882605687&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-29740-3_29
DO - 10.1007/978-3-642-29740-3_29
M3 - Conference contribution
AN - SCOPUS:84882605687
SN - 9783642297397
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 251
EP - 261
BT - Euro-Par 2011
PB - Springer Verlag
T2 - 17th Parallel Processing Workshops, Euro-Par 2011: CCPI 2011, CGWS 2011, HeteroPar 2011, HiBB 2011, HPCVirt 2011, HPPC 2011, HPSS 2011, MDGS 2011, ProPer 2011, Resilience 2011, UCHPC 2011, VHPC 2011
Y2 - 29 August 2011 through 2 September 2011
ER -