Do moldable applications perform better on failure-prone HPC platforms?

Valentin Le Fèvre, George Bosilca, Aurelien Bouteiller, Thomas Herault, Atsushi Hori, Yves Robert, Jack Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

This paper compares the performance of different approaches to tolerate failures using checkpoint/restart when executed on large-scale failure-prone platforms. We study (i) RIGID applications, which use a constant number of processors throughout execution; (ii) MOLDABLE applications, which can use a different number of processors after each restart following a fail-stop error; and (iii) GRIDSHAPED applications, which are moldable applications restricted to use rectangular processor grids (such as many dense linear algebra kernels). For each application type, we compute the optimal number of failures to tolerate before relinquishing the current allocation and waiting until a new resource can be allocated, and we determine the optimal yield that can be achieved. We instantiate our performance model with a realistic applicative scenario and make it publicly available for further usage.

Original languageEnglish
Title of host publicationEuro-Par 2018
Subtitle of host publicationParallel Processing Workshops - Euro-Par 2018 International Workshops, Revised Selected Papers
EditorsGabriele Mencagli, Dora B. Heras
PublisherSpringer Verlag
Pages787-799
Number of pages13
ISBN (Print)9783030105488
DOIs
StatePublished - 2019
Externally publishedYes
Event24th International Conference on Parallel and Distributed Computing, Euro-Par 2018 - Turin, Italy
Duration: Aug 27 2018Aug 28 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11339 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference24th International Conference on Parallel and Distributed Computing, Euro-Par 2018
Country/TerritoryItaly
CityTurin
Period08/27/1808/28/18

Keywords

  • Allocation length
  • Checkpoint
  • Moldable applications
  • Resilience
  • Restart
  • Spare nodes
  • Wait time

Fingerprint

Dive into the research topics of 'Do moldable applications perform better on failure-prone HPC platforms?'. Together they form a unique fingerprint.

Cite this