Abstract
Resilience to faults, errors, and failures in extreme-scale high-performance computing (HPC) systems is a critical challenge. Resilience design patterns offer a new, structured hardware and software design approach for improving resilience. While prior work focused on developing performance, reliability, and availability models for resilience design patterns, this paper extends it by providing a Resilience Design Patterns Modeling (RDPM) tool which allows (1) exploring performance, reliability, and availability of each resilience design pattern, (2) offering customization of parameters to optimize performance, reliability, and availability, and (3) allowing investigation of trade-off models for combining multiple patterns for practical resilience solutions.
Original language | English |
---|---|
Title of host publication | Euro-Par 2021 |
Subtitle of host publication | Parallel Processing Workshops - Euro-Par 2021 International Workshops, 2021, Revised Selected Papers |
Editors | Ricardo Chaves, Dora B. Heras, Aleksandar Ilic, Didem Unat, Rosa M. Badia, Andrea Bracciali, Patrick Diehl, Anshu Dubey, Oh Sangyoon, Stephen L. Scott, Laura Ricci |
Publisher | Springer Science and Business Media Deutschland GmbH |
Pages | 283-297 |
Number of pages | 15 |
ISBN (Print) | 9783031061554 |
DOIs | |
State | Published - 2022 |
Event | 27th International Conference on Parallel and Distributed Computing, Euro-Par 2021 - Virtual, Online Duration: Aug 30 2021 → Aug 31 2021 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 13098 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 27th International Conference on Parallel and Distributed Computing, Euro-Par 2021 |
---|---|
City | Virtual, Online |
Period | 08/30/21 → 08/31/21 |
Funding
Acknowledgements. This work was supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, program managers Robinson Pino and Lucy Nowell. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. This work was sponsored by the U.S. Department of Energy’s Office of Advanced Scientific Computing Research. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/ downloads/doe-public-access-plan).
Keywords
- Design patterns
- High-performance computing
- Resilience
- Tool