Using performance tools to support experiments in HPC resilience

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The high performance computing (HPC) community is working to address fault tolerance and resilience concerns for current and future large scale computing platforms. This is driving enhancements in the programming environments, specifically research on enhancing message passing libraries to support fault tolerant computing capabilities. The community has also recognized that tools for resilience experimentation are greatly lacking. However, we argue that there are several parallels between "performance tools" and "resilience tools". As such, we believe the rich set of HPC performance-focused tools can be extended (repurposed) to benefit the resilience community. In this paper, we describe the initial motivation to leverage standard HPC performance analysis techniques to aid in developing diagnostic tools to assist fault tolerance experiments for HPC applications. These diagnosis procedures help to provide context for the system when the errors (failures) occurred. We describe our initial work in leveraging an MPI performance trace tool to assist in providing global context during fault injection experiments. Such tools will assist the HPC resilience community as they extend existing and new application codes to support fault tolerance.

Original languageEnglish
Title of host publicationEuro-Par 2013
Subtitle of host publicationParallel Processing Workshops - BigDataCloud, DIHC, FedICI, HeteroPar, HiBB, LSDVE, MHPC, OMHI, PADABS, PROPER, Resilience, ROME, and UCHPC 2013, Revised Selected Papers
PublisherSpringer Verlag
Pages727-736
Number of pages10
ISBN (Print)9783642544194
DOIs
StatePublished - 2014
Event19th International Conference on Parallel Processing Workshops, Euro-Par 2013 - BigDataCloud, DIHC, FedICI, HeteroPar, HiBB, LSDVE, MHPC, OMHI, PADABS, PROPER, Resilience, ROME, and UCHPC 2013 - Aachen, Germany
Duration: Aug 26 2013Aug 27 2013

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8374 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference19th International Conference on Parallel Processing Workshops, Euro-Par 2013 - BigDataCloud, DIHC, FedICI, HeteroPar, HiBB, LSDVE, MHPC, OMHI, PADABS, PROPER, Resilience, ROME, and UCHPC 2013
Country/TerritoryGermany
CityAachen
Period08/26/1308/27/13

Fingerprint

Dive into the research topics of 'Using performance tools to support experiments in HPC resilience'. Together they form a unique fingerprint.

Cite this