Versioned distributed arrays for resilience in scientific applications: Global View Resilience

  • A. Chien
  • , P. Balaji
  • , P. Beckman
  • , N. Dun
  • , A. Fang
  • , H. Fujita
  • , K. Iskra
  • , Z. Rubenstein
  • , Z. Zheng
  • , R. Schreiber
  • , J. Hammond
  • , J. Dinan
  • , I. Laguna
  • , D. Richards
  • , A. Dubey
  • , B. Van Straalen
  • , M. Hoemmen
  • , M. Heroux
  • , K. Teranishi
  • , A. Siegel

Research output: Contribution to journalConference articlepeer-review

24 Scopus citations

Abstract

Exascale studies project reliability challenges for future high-performance computing (HPC) systems. We propose the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We describe GVR's interfaces to distributed arrays, versioning, and cross-layer error recovery. Using several large applications (OpenMC, the preconditioned conjugate gradient solver PCG, ddcMD, and Chombo), we evaluate the programmer effort to add resilience. The required changes are small (<2% LOC), localized, and machine-independent, requiring no software architecture changes. We also measure the overhead of adding GVR versioning and show that generally overheads <2% are achieved. We conclude that GVR's interfaces and implementation are flexible and portable and create a gentle-slope path to tolerate growing error rates in future systems.

Original languageEnglish
Pages (from-to)29-38
Number of pages10
JournalProcedia Computer Science
Volume51
Issue number1
DOIs
StatePublished - 2015
EventInternational Conference on Computational Science, ICCS 2002 - Amsterdam, Netherlands
Duration: Apr 21 2002Apr 24 2002

Funding

This work was supported by the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Award DE-SC0008603 and Contract DE-AC02-06CH11357. This work was completed in part with resources provided by: the University of Chicago Research Computing Center, , the resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231, and resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

Keywords

  • Application-based fault tolerance
  • Exascale
  • Fault tolerance
  • Resilience
  • Scalable computing

Fingerprint

Dive into the research topics of 'Versioned distributed arrays for resilience in scientific applications: Global View Resilience'. Together they form a unique fingerprint.

Cite this