Abstract
Supporting error resilience in future exascale-class supercomputing systems is a critical challenge. Due to transistor scaling trends and increasing memory density, scientific simulations are expected to experience more interruptions caused by transient errors in the system memory. Existing hardware-based detection and recovery techniques will be inadequate to manage the presence of high memory fault rates. In this paper we propose a partial memory protection scheme based on region-based memory management. We define the concept of regions called havens that provide fault protection for program objects. We provide reliability for the regions through a software-based parity protection mechanism. Our approach enables critical program objects to be placed in these havens. The fault coverage provided by our approach is application agnostic, unlike algorithm-based fault tolerance techniques.
Original language | English |
---|---|
Title of host publication | 2016 IEEE High Performance Extreme Computing Conference, HPEC 2016 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
ISBN (Electronic) | 9781509035250 |
DOIs | |
State | Published - Nov 28 2016 |
Event | 2016 IEEE High Performance Extreme Computing Conference, HPEC 2016 - Waltham, United States Duration: Sep 13 2016 → Sep 15 2016 |
Publication series
Name | 2016 IEEE High Performance Extreme Computing Conference, HPEC 2016 |
---|
Conference
Conference | 2016 IEEE High Performance Extreme Computing Conference, HPEC 2016 |
---|---|
Country/Territory | United States |
City | Waltham |
Period | 09/13/16 → 09/15/16 |
Bibliographical note
Publisher Copyright:© 2016 IEEE.