Self-healing of operational workflow incidents on distributed computing infrastructures

Rafael Ferreira Da Silva, Tristan Glatard, Frédéric Desprez

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

15 Scopus citations

Abstract

Distributed computing infrastructures are commonly used through scientific gateways, but operating these gateways requires important human intervention to handle operational incidents. This paper presents a self-healing process that quantifies incident degrees of workflow activities from metrics measuring long-tail effect, application efficiency, data transfer issues, and site-specific problems. These metrics are simple enough to be computed online and they make little assumptions on the application or resource characteristics. Incidents are classified in levels and associated to sets of healing actions that are selected based on association rules modeling correlations between incident levels. The healing process is parametrized on real application traces acquired in production on the European Grid Infrastructure. Implementation and experimental results obtained in the Virtual Imaging Platform show that the proposed method speeds up execution up to a factor of 4 and properly detects unrecoverable errors.

Original languageEnglish
Title of host publicationProceedings - 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2012
Pages318-325
Number of pages8
DOIs
StatePublished - 2012
Externally publishedYes
Event12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2012 - Ottawa, ON, Canada
Duration: May 13 2012May 16 2012

Publication series

NameProceedings - 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2012

Conference

Conference12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2012
Country/TerritoryCanada
CityOttawa, ON
Period05/13/1205/16/12

Funding

FundersFunder number
Seventh Framework Programme261323

    Keywords

    • Error detection and handling
    • Production distributed systems
    • Workflow execution

    Fingerprint

    Dive into the research topics of 'Self-healing of operational workflow incidents on distributed computing infrastructures'. Together they form a unique fingerprint.

    Cite this