Abstract
For high-performance computing (HPC) system designers and users, meeting the myriad challenges of next-generation exascale supercomputing systems requires rethinking their approach to application and system software design. Among these challenges, providing resiliency and stability to the scientific applications in the presence of high fault rates requires new approaches to software architecture and design. As HPC systems become increasingly complex, they require intricate solutions for detection and mitigation for various modes of faults and errors that occur in these large-scale systems, as well as solutions for failure recovery. These resiliency solutions often interact with and affect other system properties, including application scalability, power and energy efficiency. Therefore, resilience solutions for HPC systems must be thoughtfully engineered and deployed.In previous work, we developed the concept of resilience design patterns, which consist of templated solutions based on well-established techniques for detection, mitigation and recovery. In this paper, we use these patterns as the foundation to propose new approaches to designing runtime systems for HPC systems. The instantiation of these patterns within a runtime system enables flexible and adaptable end-to-end resiliency solutions for HPC environments. The paper describes the architecture of the runtime system, named Plexus, and the strategies for dynamically composing and adapting pattern instances under runtime control. This runtime-based approach enables actively balancing the cost-benefit trade-off between performance overhead and protection coverage of the resilience solutions. Based on a prototype implementation of PLEXUS, we demonstrate the resiliency and performance gains achieved by the pattern-based runtime system for a parallel linear solver application.
Original language | English |
---|---|
Title of host publication | Proceedings - 2020 IEEE 25th Pacific Rim International Symposium on Dependable Computing, PRDC 2020 |
Publisher | IEEE Computer Society |
Pages | 31-39 |
Number of pages | 9 |
ISBN (Electronic) | 9781728180038 |
DOIs | |
State | Published - Dec 2020 |
Event | 25th IEEE Pacific Rim International Symposium on Dependable Computing, PRDC 2020 - Perth, Australia Duration: Dec 1 2020 → Dec 4 2020 |
Publication series
Name | Proceedings of IEEE Pacific Rim International Symposium on Dependable Computing, PRDC |
---|---|
Volume | 2020-December |
ISSN (Print) | 1541-0110 |
Conference
Conference | 25th IEEE Pacific Rim International Symposium on Dependable Computing, PRDC 2020 |
---|---|
Country/Territory | Australia |
City | Perth |
Period | 12/1/20 → 12/4/20 |
Funding
This work was supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Early Career Program, with program managers Robinson Pino and Lucy Nowell. This work was sponsored by the U.S. Department of Energy’s Office of Advanced Scientific Computing Research. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Keywords
- exascale computing
- high-performance computing
- resilience
- runtime systems
- software patterns