TY - GEN
T1 - Models for Resilience Design Patterns
AU - Kumar, Mohit
AU - Engelmann, Christian
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/11
Y1 - 2020/11
N2 - Resilience plays an important role in supercomputers by providing correct and efficient operation in case of faults, errors, and failures. Resilience design patterns offer blueprints for effectively applying resilience technologies. Prior work focused on developing initial efficiency and performance models for resilience design patterns. This paper extends it by (1) describing performance, reliability, and availability models for all structural resilience design patterns, (2) providing more detailed models that include flowcharts and state diagrams, and (3) introducing the Resilience Design Pattern Modeling (RDPM) tool that calculates and plots the performance, reliability, and availability metrics of individual patterns and pattern combinations.
AB - Resilience plays an important role in supercomputers by providing correct and efficient operation in case of faults, errors, and failures. Resilience design patterns offer blueprints for effectively applying resilience technologies. Prior work focused on developing initial efficiency and performance models for resilience design patterns. This paper extends it by (1) describing performance, reliability, and availability models for all structural resilience design patterns, (2) providing more detailed models that include flowcharts and state diagrams, and (3) introducing the Resilience Design Pattern Modeling (RDPM) tool that calculates and plots the performance, reliability, and availability metrics of individual patterns and pattern combinations.
KW - design patterns
KW - high-performance computing
KW - models
KW - resilience
UR - http://www.scopus.com/inward/record.url?scp=85099597416&partnerID=8YFLogxK
U2 - 10.1109/FTXS51974.2020.00008
DO - 10.1109/FTXS51974.2020.00008
M3 - Conference contribution
AN - SCOPUS:85099597416
T3 - Proceedings of FTXS 2020: Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
SP - 21
EP - 30
BT - Proceedings of FTXS 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 10th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS 2020
Y2 - 11 November 2020
ER -