Abstract
Maintaining the performance of high-performance computing (HPC) applications with the expected increase in failures is a major challenge for next-generation extreme-scale systems. With increasing scale, resilience activities (e.g. checkpointing) are expected to become more diverse, less tightly synchronized, and more computationally intensive. Few existing studies, however, have examined how decisions about scheduling resilience activities impact application performance. In this work, we examine the relationship between the duration and frequency of resilience activities and application performance. Our study reveals several key findings: (i) the aggregate amount of time consumed by resilience activities is not an effective metric for predicting application performance; (ii) the duration of the interruptions due to resilience activities has the greatest influence on application performance; shorter, but more frequent, interruptions are correlated with better application performance; and (iii) the differential impact of resilience activities across applications is related to the applications’ inter-collective frequencies; the performance of applications that perform infrequent collective operations scales better in the presence of resilience activities than the performance of applications that perform more frequent collective operations. This initial study demonstrates the importance of considering how resilience activities are scheduled. We provide critical analysis and direct guidance on how the resilience challenges of future systems can be met while minimizing the impact on application performance.
| Original language | English |
|---|---|
| Title of host publication | Euro-Par 2017 |
| Subtitle of host publication | Parallel Processing Workshops - Euro-Par 2017 International Workshops |
| Editors | Dora B. Heras, Luc Bouge |
| Publisher | Springer Verlag |
| Pages | 581-592 |
| Number of pages | 12 |
| ISBN (Print) | 9783319751771 |
| DOIs | |
| State | Published - 2018 |
| Externally published | Yes |
| Event | International Workshops on Parallel Processing, Euro-Par 2017 - Santiago de Compostela, Spain Duration: Aug 28 2017 → Aug 29 2017 |
Publication series
| Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
|---|---|
| Volume | 10659 LNCS |
| ISSN (Print) | 0302-9743 |
| ISSN (Electronic) | 1611-3349 |
Conference
| Conference | International Workshops on Parallel Processing, Euro-Par 2017 |
|---|---|
| Country/Territory | Spain |
| City | Santiago de Compostela |
| Period | 08/28/17 → 08/29/17 |
Funding
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.
Keywords
- Collectives
- Performance
- Resilience
- Scheduling