TY - GEN
T1 - Design for a Soft Error Resilient Dynamic Task-Based Runtime
AU - Cao, Chongxiao
AU - Herault, Thomas
AU - Bosilca, George
AU - Dongarra, Jack
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/7/17
Y1 - 2015/7/17
N2 - As the scale of modern computing systems grows, failures will happen more frequently. On the way to Exactable a generic, low-overhead, resilient extension becomes a desired aptitude of any programming paradigm. In this paper we explore three additions to a dynamic task-based runtime to build a generic framework providing soft error resilience to task-based programming paradigms. The first recovers the application by re-executing the minimum required sub-DAG, the second takes critical checkpoints of the data flowing between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re-execution. These mechanisms have been implemented in the PaRSEC task-based runtime framework. Experimental results validate our approach and quantify the overhead introduced by such mechanisms.
AB - As the scale of modern computing systems grows, failures will happen more frequently. On the way to Exactable a generic, low-overhead, resilient extension becomes a desired aptitude of any programming paradigm. In this paper we explore three additions to a dynamic task-based runtime to build a generic framework providing soft error resilience to task-based programming paradigms. The first recovers the application by re-executing the minimum required sub-DAG, the second takes critical checkpoints of the data flowing between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re-execution. These mechanisms have been implemented in the PaRSEC task-based runtime framework. Experimental results validate our approach and quantify the overhead introduced by such mechanisms.
KW - Fault tolerance
KW - runtime
KW - soft error resilience
UR - http://www.scopus.com/inward/record.url?scp=84971406186&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2015.81
DO - 10.1109/IPDPS.2015.81
M3 - Conference contribution
AN - SCOPUS:84971406186
T3 - Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium, IPDPS 2015
SP - 765
EP - 774
BT - Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium, IPDPS 2015
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 29th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2015
Y2 - 25 May 2015 through 29 May 2015
ER -