Havens: Explicit reliable memory regions for HPC applications

Saurabh Hukerikar, Christian Engelmann

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Supporting error resilience in future exascale-class supercomputing systems is a critical challenge. Due to transistor scaling trends and increasing memory density, scientific simulations are expected to experience more interruptions caused by transient errors in the system memory. Existing hardware-based detection and recovery techniques will be inadequate to manage the presence of high memory fault rates. In this paper we propose a partial memory protection scheme based on region-based memory management. We define the concept of regions called havens that provide fault protection for program objects. We provide reliability for the regions through a software-based parity protection mechanism. Our approach enables critical program objects to be placed in these havens. The fault coverage provided by our approach is application agnostic, unlike algorithm-based fault tolerance techniques.

Original languageEnglish
Title of host publication2016 IEEE High Performance Extreme Computing Conference, HPEC 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781509035250
DOIs
StatePublished - Nov 28 2016
Event2016 IEEE High Performance Extreme Computing Conference, HPEC 2016 - Waltham, United States
Duration: Sep 13 2016Sep 15 2016

Publication series

Name2016 IEEE High Performance Extreme Computing Conference, HPEC 2016

Conference

Conference2016 IEEE High Performance Extreme Computing Conference, HPEC 2016
Country/TerritoryUnited States
CityWaltham
Period09/13/1609/15/16

Bibliographical note

Publisher Copyright:
© 2016 IEEE.

Fingerprint

Dive into the research topics of 'Havens: Explicit reliable memory regions for HPC applications'. Together they form a unique fingerprint.

Cite this