Abstract
Recent reliability issues with one of the fastest supercomputers in the world, Titan at Oak Ridge National Laboratory (ORNL), demonstrated the need for resilience in large-scale heterogeneous computing. OpenMP currently does not address error and failure behavior. This paper takes a first step toward resilience for heterogeneous systems by providing the concepts for resilient OpenMP offload to devices. Using real-world error and failure observations, the paper describes the concepts and terminology for resilient OpenMP target offload, including error and failure classes and resilience strategies. It details the experienced general-purpose computing graphics processing unit (GPGPU) errors and failures in Titan. It further proposes improvements in OpenMP, including a preliminary prototype design, to support resilient offload to devices for efficient handling of errors and failures in heterogeneous high-performance computing (HPC) systems.
Original language | English |
---|---|
Title of host publication | OpenMP |
Subtitle of host publication | Conquering the Full Hardware Spectrum - 15th International Workshop on OpenMP, IWOMP 2019, Proceedings |
Editors | Xing Fan, Oliver Sinnen, Nasser Giacaman, Bronis R. de Supinski |
Publisher | Springer Verlag |
Pages | 78-93 |
Number of pages | 16 |
ISBN (Print) | 9783030285951 |
DOIs | |
State | Published - 2019 |
Event | 15th International Workshop on OpenMP, IWOMP 2019 - Auckland, New Zealand Duration: Sep 11 2019 → Sep 13 2019 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 11718 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 15th International Workshop on OpenMP, IWOMP 2019 |
---|---|
Country/Territory | New Zealand |
City | Auckland |
Period | 09/11/19 → 09/13/19 |
Funding
Research sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the U.S. Department of Energy. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paidup, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/ downloads/doe-public-access-plan).
Keywords
- OpenMP
- Resilience
- Supercomputing