Strategies for fault tolerance in multicomponent applications

Aniruddha G. Shet, Wael R. Elwasif, Samantha S. Foley, Byung H. Park, David E. Bernholdt, Randall Bramley

Research output: Contribution to journalConference articlepeer-review

9 Scopus citations

Abstract

This paper discusses on-going work with the Integrated Plasma Simulator (IPS), a framework for coupled multiphysics simulations of plasmas, to allow simulations to run through the loss of nodes on which the simulation is executing. While many different techniques are available to improve the fault tolerance of computational science applications on high-performance computer systems, checkpoint/restart (C/R) remains virtually the only one that see widespread use in practice. Our focus here is to augment the traditional C/R approach with additional techniques that can provide a more localized and tailored response to faults based on the ability to restart failed tasks on an individual basis, and the use of information external to the application itself in order to guide decision-making, in many cases avoiding the need to stop and restart the entire simulation. This capability involves several features within the IPS framework, and leverages the Fault Tolerance Backplane, a publish/subscribe event service to disseminate fault-related information throughout HPC systems, to obtain information from the Reliability, Availability and Serviceability (RAS) subsystem of the HPC system. This work is described in the context of Cray XT-series computer systems for concreteness, but is applicable to other environments as well. As part of the analysis of this work, we discuss the requirements to generalize this approach to other complex simulation applications beyond the Integrated Plasma Simulator.

Original languageEnglish
Pages (from-to)2287-2296
Number of pages10
JournalProcedia Computer Science
Volume4
DOIs
StatePublished - 2011
Event11th International Conference on Computational Science, ICCS 2011 - Singapore, Singapore
Duration: Jun 1 2011Jun 3 2011

Funding

This work has been supported by the U. S. Department of Energy, O ce of Science, O ces of Advanced Scientific Computing Research (ASCR) and Fusion Energy Sciences (FES). It has also been supported by by the ORNL Postmasters and Postdoctoral Research Participation Programs and the ORNL Higher Eduction Research Experiences Program which are sponsored by ORNL and administered jointly by ORNL and by the Oak Ridge Institute for Science and Education (ORISE). This research also used resources of the Oak Ridge Leadership Computing Facility at ORNL. ORNL is managed by UT-Battelle, LLC for the U. S. Department of Energy under Contract No. DE-AC05-00OR22725. ORISE is managed by Oak Ridge Associated Universities for the U. S. Department of Energy under Contract No. DE-AC05-00OR22750.

Keywords

  • Application fault tolerance
  • Computational science
  • Multiphysics framework

Fingerprint

Dive into the research topics of 'Strategies for fault tolerance in multicomponent applications'. Together they form a unique fingerprint.

Cite this