Abstract
Coordinated checkpoint/restart is currently the dominant approach to mitigating the impact of failures on important scientific applications running on large-scale distributed systems. However, there is widespread evidence that coordinated checkpointing may no longer be viable on next-generation systems. Uncoordinated checkpoint/restart attempts to address the shortcomings of coordinated checkpoint/restart by allowing application processes to checkpoint their state independently. However, eliminating coordination may significantly degrade application performance. In this paper, we propose an approach that leverages existing coordination in important scientific applications to approximately coordinate checkpoints. Specifically, we propose to extend MPI implementations to force checkpoints to occur immediately after the completion of a collective operation. We evaluate the performance implications of this approach using an existing validated simulation framework. Our results demonstrate that approximately coordinated checkpointing can significantly improve application performance relative to totally uncoordinated checkpointing. We also show that forcing checkpoints to occur following a collective operation has a small impact on the nominal checkpoint interval for several important workloads. As a whole, the results presented in this paper demonstrate that approximately coordinated checkpointing may provide significant performance benefits without significantly increasing the cost of failure recovery.
Original language | English |
---|---|
Article number | e4890 |
Journal | Concurrency and Computation: Practice and Experience |
Volume | 32 |
Issue number | 3 |
DOIs | |
State | Published - Feb 10 2020 |
Externally published | Yes |
Funding
This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government. Sandia National Laboratories is a multi mission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc, for the US Department of Energy's National Nuclear Security Administration under contract DE-NA0003525.
Keywords
- MPI
- checkpoint/restart
- fault tolerance