Abstract
As the failure frequency is increasing with the components count in modern and future supercomputers, resilience is becoming critical for extreme scale systems. The association of failure prediction with proactive check pointing seeks to reduce the effect of failures in the execution time of parallel applications. Unfortunately, proactive check pointing does not systematically avoid restarting from scratch. To mitigate this issue, failure prediction and proactive check pointing can be coupled with periodic check pointing. However, blind use of these techniques does not always improves system efficiency, because everyone of them comes with a mix of overheads and benefits. In order to study and understand the combination of these techniques and their improvement in the system's efficiency, we developed: (i) a prototype combining state of the art failure prediction, fast proactive check pointing and preventive check pointing, (ii) a mathematical model that reflects the expected computing efficiency of the combination and computes the optimal check pointing interval in this context, (iii) a discrete event simulator to evaluate the computing efficiency of the combination for system parameters corresponding to the current and projected large scale HPC systems. We evaluate our proposed technique on a large supercomputer (i.e. TSUBAME2) with production-level HPC applications and we show that failure prediction, proactive and preventive check pointing can be coupled successfully, imposing only about 2% to 6% of overhead in comparison with preventive check pointing only. Moreover, our model-based simulations show that the optimal solution improves the computing efficiency up to 30% in comparison with classic periodic check pointing. We show that the prediction recall has a much higher impact on execution efficiency than the prediction precision. This result suggests that researchers on failure prediction algorithms should focus on improving the recall. We also show that the combination of these techniques can significantly improve (by a factor 2, for a particular configuration) the mean time between failures (MTBF) perceived by the application.
Original language | English |
---|---|
Pages | 501-512 |
Number of pages | 12 |
DOIs | |
State | Published - 2013 |
Externally published | Yes |
Event | 27th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2013 - Boston, MA, United States Duration: May 20 2013 → May 24 2013 |
Conference
Conference | 27th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2013 |
---|---|
Country/Territory | United States |
City | Boston, MA |
Period | 05/20/13 → 05/24/13 |
Keywords
- Failure prediction
- large scale HPC systems
- multilevel checkpointing
- resilience