Programmer-guided reliability for extreme-scale applications

David E. Bernholdt, Wael R. Elwasif, Christos Kartsaklis, Seyong Lee, Tiffany M. Mintz

Research output: Contribution to journalArticlepeer-review

Abstract

We present “programmer-guided reliability” (PGR) as a systematic conceptual approach to address the expected rise in soft errors in coming extreme-scale systems at the application level. The approach involves instrumentation of the application with code to detect data corruption errors. The location and nature of these error detectors are at the discretion of the programmer, who uses their knowledge and experience with the problem domain, the application, the solution algorithms, etc., to determine the most vulnerable areas of the code and the most appropriate ways to detect data corruption. To illustrate the approach, we provide examples of error detectors from four different benchmark-scale applications. We also describe a simple control framework that allows for runtime configuration of the error detectors without recompilation of the application, as well as dynamic reconfiguration during the execution of the application. Finally, we discuss a number of future directions building on the basic PGR approach, including the incorporation of some general error detectors into the programming environment in order to make them more easily usable by the programmer.

Original languageEnglish
Pages (from-to)598-612
Number of pages15
JournalInternational Journal of High Performance Computing Applications
Volume32
Issue number5
DOIs
StatePublished - Sep 1 2018

Funding

This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/down loads/doe-public-access-plan).

Keywords

  • Applications
  • error detection
  • fault tolerance
  • resilience
  • soft errors

Fingerprint

Dive into the research topics of 'Programmer-guided reliability for extreme-scale applications'. Together they form a unique fingerprint.

Cite this