Self-healing of workflow activity incidents on distributed computing infrastructures

Rafael Ferreira Da Silva, Tristan Glatard, Frédéric Desprez

Research output: Contribution to journalArticlepeer-review

21 Scopus citations

Abstract

Distributed computing infrastructures are commonly used through scientific gateways, but operating these gateways requires important human intervention to handle operational incidents. This paper presents a self-healing process that quantifies incident degrees of workflow activities from metrics measuring long-tail effect, application efficiency, data transfer issues, and site-specific problems. These metrics are simple enough to be computed online and they make little assumptions on the application or resource characteristics. From their degree, incidents are classified in levels and associated to sets of healing actions that are selected based on association rules modeling correlations between incident levels. We specifically study the long-tail effect issue, and propose a new algorithm to control task replication. The healing process is parametrized on real application traces acquired in production on the European Grid Infrastructure. Experimental results obtained in the Virtual Imaging Platform show that the proposed method speeds up execution up to a factor of 4, consumes up to 26% less resource time than a control execution and properly detects unrecoverable errors.

Original languageEnglish
Pages (from-to)2284-2294
Number of pages11
JournalFuture Generation Computer Systems
Volume29
Issue number8
DOIs
StatePublished - 2013
Externally publishedYes

Funding

This work is funded by the French National Agency for Research under grant ANR-09-COSI-03 “VIP”. The research leading to this publication has also received funding from the EC FP7 Programme under grant agreement 312579 ER-flow = Building an European Research Community through Interoperable Workflows and Data, and the framework LABEX ANR-11-LABX-0063 of Université de Lyon, within the program “Investissements d’Avenir” (ANR-11-IDEX-0007) operated by the French National Research Agency (ANR). We thank the European Grid Initiative and National Grid Initiatives, in particular France-Grilles, for providing the infrastructure and technical support. We also thank Ting Li and Olivier Bernard for providing optimization use-cases to the Virtual Imaging Platform.

FundersFunder number
EC FP7ANR-11-LABX-0063
French National Agency for ResearchANR-09-COSI-03
Seventh Framework Programme312579, 261323
Agence Nationale de la Recherche
Université de LyonANR-11-IDEX-0007

    Keywords

    • Error detection and handling
    • Production distributed systems
    • Workflow execution

    Fingerprint

    Dive into the research topics of 'Self-healing of workflow activity incidents on distributed computing infrastructures'. Together they form a unique fingerprint.

    Cite this