TY - GEN
T1 - Havens
T2 - 2016 IEEE High Performance Extreme Computing Conference, HPEC 2016
AU - Hukerikar, Saurabh
AU - Engelmann, Christian
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/11/28
Y1 - 2016/11/28
N2 - Supporting error resilience in future exascale-class supercomputing systems is a critical challenge. Due to transistor scaling trends and increasing memory density, scientific simulations are expected to experience more interruptions caused by transient errors in the system memory. Existing hardware-based detection and recovery techniques will be inadequate to manage the presence of high memory fault rates. In this paper we propose a partial memory protection scheme based on region-based memory management. We define the concept of regions called havens that provide fault protection for program objects. We provide reliability for the regions through a software-based parity protection mechanism. Our approach enables critical program objects to be placed in these havens. The fault coverage provided by our approach is application agnostic, unlike algorithm-based fault tolerance techniques.
AB - Supporting error resilience in future exascale-class supercomputing systems is a critical challenge. Due to transistor scaling trends and increasing memory density, scientific simulations are expected to experience more interruptions caused by transient errors in the system memory. Existing hardware-based detection and recovery techniques will be inadequate to manage the presence of high memory fault rates. In this paper we propose a partial memory protection scheme based on region-based memory management. We define the concept of regions called havens that provide fault protection for program objects. We provide reliability for the regions through a software-based parity protection mechanism. Our approach enables critical program objects to be placed in these havens. The fault coverage provided by our approach is application agnostic, unlike algorithm-based fault tolerance techniques.
UR - http://www.scopus.com/inward/record.url?scp=85007035435&partnerID=8YFLogxK
U2 - 10.1109/HPEC.2016.7761593
DO - 10.1109/HPEC.2016.7761593
M3 - Conference contribution
AN - SCOPUS:85007035435
T3 - 2016 IEEE High Performance Extreme Computing Conference, HPEC 2016
BT - 2016 IEEE High Performance Extreme Computing Conference, HPEC 2016
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 13 September 2016 through 15 September 2016
ER -