Design for a Soft Error Resilient Dynamic Task-Based Runtime

Chongxiao Cao, Thomas Herault, George Bosilca, Jack Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

25 Scopus citations

Abstract

As the scale of modern computing systems grows, failures will happen more frequently. On the way to Exactable a generic, low-overhead, resilient extension becomes a desired aptitude of any programming paradigm. In this paper we explore three additions to a dynamic task-based runtime to build a generic framework providing soft error resilience to task-based programming paradigms. The first recovers the application by re-executing the minimum required sub-DAG, the second takes critical checkpoints of the data flowing between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re-execution. These mechanisms have been implemented in the PaRSEC task-based runtime framework. Experimental results validate our approach and quantify the overhead introduced by such mechanisms.

Original languageEnglish
Title of host publicationProceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium, IPDPS 2015
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages765-774
Number of pages10
ISBN (Electronic)9781479986484
DOIs
StatePublished - Jul 17 2015
Event29th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2015 - Hyderabad, India
Duration: May 25 2015May 29 2015

Publication series

NameProceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium, IPDPS 2015

Conference

Conference29th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2015
Country/TerritoryIndia
CityHyderabad
Period05/25/1505/29/15

Funding

FundersFunder number
National Science Foundation1339820

    Keywords

    • Fault tolerance
    • runtime
    • soft error resilience

    Fingerprint

    Dive into the research topics of 'Design for a Soft Error Resilient Dynamic Task-Based Runtime'. Together they form a unique fingerprint.

    Cite this