It’s not the heat, it’s the humidity: Scheduling resilience activity at scale

Patrick M. Widener, Kurt B. Ferreira, Scott Levy

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Maintaining the performance of high-performance computing (HPC) applications with the expected increase in failures is a major challenge for next-generation extreme-scale systems. With increasing scale, resilience activities (e.g. checkpointing) are expected to become more diverse, less tightly synchronized, and more computationally intensive. Few existing studies, however, have examined how decisions about scheduling resilience activities impact application performance. In this work, we examine the relationship between the duration and frequency of resilience activities and application performance. Our study reveals several key findings: (i) the aggregate amount of time consumed by resilience activities is not an effective metric for predicting application performance; (ii) the duration of the interruptions due to resilience activities has the greatest influence on application performance; shorter, but more frequent, interruptions are correlated with better application performance; and (iii) the differential impact of resilience activities across applications is related to the applications’ inter-collective frequencies; the performance of applications that perform infrequent collective operations scales better in the presence of resilience activities than the performance of applications that perform more frequent collective operations. This initial study demonstrates the importance of considering how resilience activities are scheduled. We provide critical analysis and direct guidance on how the resilience challenges of future systems can be met while minimizing the impact on application performance.

Original languageEnglish
Title of host publicationEuro-Par 2017
Subtitle of host publicationParallel Processing Workshops - Euro-Par 2017 International Workshops
EditorsDora B. Heras, Luc Bouge
PublisherSpringer Verlag
Pages581-592
Number of pages12
ISBN (Print)9783319751771
DOIs
StatePublished - 2018
Externally publishedYes
EventInternational Workshops on Parallel Processing, Euro-Par 2017 - Santiago de Compostela, Spain
Duration: Aug 28 2017Aug 29 2017

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10659 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferenceInternational Workshops on Parallel Processing, Euro-Par 2017
Country/TerritorySpain
CitySantiago de Compostela
Period08/28/1708/29/17

Funding

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.

FundersFunder number
U.S. Department of Energy
National Nuclear Security AdministrationDE-NA0003525

    Keywords

    • Collectives
    • Performance
    • Resilience
    • Scheduling

    Fingerprint

    Dive into the research topics of 'It’s not the heat, it’s the humidity: Scheduling resilience activity at scale'. Together they form a unique fingerprint.

    Cite this