TY - GEN
T1 - Programmer-guided reliability for extreme-scale applications
AU - Bernholdt, David E.
AU - Elwasif, Wael R.
AU - Kartsaklis, Christos
AU - Lee, Seyong
AU - Mintz, Tiffany M.
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/10/26
Y1 - 2015/10/26
N2 - We present 'programmer-guided reliability' (PGR) as a systematic conceptual approach to address the expected rise in soft errors in coming extreme-scale systems at the application level. The approach involves instrumentation of the application with code to detect data corruption errors. The location and nature of these error detectors are at the discretion of the programmer, who uses their knowledge and experience with the problem domain, the application, the solution algorithms, etc., to determine the most vulnerable areas of the code and the most appropriate ways to detect data corruption. To illustrate the approach, we provide examples of error detectors from four different benchmark-scale applications. We also describe a simple control framework that allows for runtime configuration of the error detectors without recompilation of the application, as well as dynamic reconfiguration during the execution of the application. Finally, we discuss a number of future directions building on the basic PGR approach, including the incorporation of some general error detectors into the programming environment in order to make them more easily usable by the programmer.
AB - We present 'programmer-guided reliability' (PGR) as a systematic conceptual approach to address the expected rise in soft errors in coming extreme-scale systems at the application level. The approach involves instrumentation of the application with code to detect data corruption errors. The location and nature of these error detectors are at the discretion of the programmer, who uses their knowledge and experience with the problem domain, the application, the solution algorithms, etc., to determine the most vulnerable areas of the code and the most appropriate ways to detect data corruption. To illustrate the approach, we provide examples of error detectors from four different benchmark-scale applications. We also describe a simple control framework that allows for runtime configuration of the error detectors without recompilation of the application, as well as dynamic reconfiguration during the execution of the application. Finally, we discuss a number of future directions building on the basic PGR approach, including the incorporation of some general error detectors into the programming environment in order to make them more easily usable by the programmer.
KW - Application resilience
KW - Fault tolerance
KW - Scientific computing
UR - http://www.scopus.com/inward/record.url?scp=84959288416&partnerID=8YFLogxK
U2 - 10.1109/CLUSTER.2015.105
DO - 10.1109/CLUSTER.2015.105
M3 - Conference contribution
AN - SCOPUS:84959288416
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 571
EP - 579
BT - Proceedings - 2015 IEEE International Conference on Cluster Computing, CLUSTER 2015
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - IEEE International Conference on Cluster Computing, CLUSTER 2015
Y2 - 8 September 2015 through 11 September 2015
ER -