Concepts for openMP target offload resilience

Christian Engelmann, Geoffroy R. Vallée, Swaroop Pophale

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Recent reliability issues with one of the fastest supercomputers in the world, Titan at Oak Ridge National Laboratory (ORNL), demonstrated the need for resilience in large-scale heterogeneous computing. OpenMP currently does not address error and failure behavior. This paper takes a first step toward resilience for heterogeneous systems by providing the concepts for resilient OpenMP offload to devices. Using real-world error and failure observations, the paper describes the concepts and terminology for resilient OpenMP target offload, including error and failure classes and resilience strategies. It details the experienced general-purpose computing graphics processing unit (GPGPU) errors and failures in Titan. It further proposes improvements in OpenMP, including a preliminary prototype design, to support resilient offload to devices for efficient handling of errors and failures in heterogeneous high-performance computing (HPC) systems.

Original languageEnglish
Title of host publicationOpenMP
Subtitle of host publicationConquering the Full Hardware Spectrum - 15th International Workshop on OpenMP, IWOMP 2019, Proceedings
EditorsXing Fan, Oliver Sinnen, Nasser Giacaman, Bronis R. de Supinski
PublisherSpringer Verlag
Pages78-93
Number of pages16
ISBN (Print)9783030285951
DOIs
StatePublished - 2019
Event15th International Workshop on OpenMP, IWOMP 2019 - Auckland, New Zealand
Duration: Sep 11 2019Sep 13 2019

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11718 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference15th International Workshop on OpenMP, IWOMP 2019
Country/TerritoryNew Zealand
CityAuckland
Period09/11/1909/13/19

Funding

Research sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the U.S. Department of Energy. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paidup, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/ downloads/doe-public-access-plan).

Keywords

  • OpenMP
  • Resilience
  • Supercomputing

Fingerprint

Dive into the research topics of 'Concepts for openMP target offload resilience'. Together they form a unique fingerprint.

Cite this