TY - GEN
T1 - Optimal cooperative checkpointing for shared high-performance computing platforms
AU - Herault, Thomas
AU - Robert, Yves
AU - Bouteiller, Aurelien
AU - Arnold, Dorian
AU - Ferreira, Kurt
AU - Bosilca, George
AU - Dongarra, Jack
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/8/3
Y1 - 2018/8/3
N2 - In high-performance computing environments, input/output (I/O) from various sources often contend for scarce available bandwidth. Adding to the I/O operations inherent to the failure-free execution of an application, I/O from checkpoint/restart (CR) operations (used to ensure progress in the presence of failures) place an additional burden as it increase I/O contention, leading to degraded performance. In this work, we consider a cooperative scheduling policy that optimizes the overall performance of concurrently executing CR-based applications which share valuable I/O resources. First, we provide a theoretical model and then derive a set of necessary constraints needed to minimize the global waste on the platform. Our results demonstrate that the optimal checkpoint interval, as defined by Young/Daly, despite providing a sensible metric for a single application, is not sufficient to optimally address resource contention at the platform scale. We therefore show that combining optimal checkpointing periods with I/O scheduling strategies can provide a significant improvement on the overall application performance, thereby maximizing platform throughput. Overall, these results provide critical analysis and direct guidance on checkpointing large-scale workloads in the presence of competing I/O while minimizing the impact on application performance.
AB - In high-performance computing environments, input/output (I/O) from various sources often contend for scarce available bandwidth. Adding to the I/O operations inherent to the failure-free execution of an application, I/O from checkpoint/restart (CR) operations (used to ensure progress in the presence of failures) place an additional burden as it increase I/O contention, leading to degraded performance. In this work, we consider a cooperative scheduling policy that optimizes the overall performance of concurrently executing CR-based applications which share valuable I/O resources. First, we provide a theoretical model and then derive a set of necessary constraints needed to minimize the global waste on the platform. Our results demonstrate that the optimal checkpoint interval, as defined by Young/Daly, despite providing a sensible metric for a single application, is not sufficient to optimally address resource contention at the platform scale. We therefore show that combining optimal checkpointing periods with I/O scheduling strategies can provide a significant improvement on the overall application performance, thereby maximizing platform throughput. Overall, these results provide critical analysis and direct guidance on checkpointing large-scale workloads in the presence of competing I/O while minimizing the impact on application performance.
KW - Checkpoint
KW - Cooperative strategy
KW - I/O contention
KW - Workflow
UR - http://www.scopus.com/inward/record.url?scp=85052224235&partnerID=8YFLogxK
U2 - 10.1109/IPDPSW.2018.00127
DO - 10.1109/IPDPSW.2018.00127
M3 - Conference contribution
AN - SCOPUS:85052224235
SN - 9781538655559
T3 - Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018
SP - 803
EP - 812
BT - Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 32nd IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018
Y2 - 21 May 2018 through 25 May 2018
ER -