TY - GEN
T1 - Self-healing of operational workflow incidents on distributed computing infrastructures
AU - Ferreira Da Silva, Rafael
AU - Glatard, Tristan
AU - Desprez, Frédéric
PY - 2012
Y1 - 2012
N2 - Distributed computing infrastructures are commonly used through scientific gateways, but operating these gateways requires important human intervention to handle operational incidents. This paper presents a self-healing process that quantifies incident degrees of workflow activities from metrics measuring long-tail effect, application efficiency, data transfer issues, and site-specific problems. These metrics are simple enough to be computed online and they make little assumptions on the application or resource characteristics. Incidents are classified in levels and associated to sets of healing actions that are selected based on association rules modeling correlations between incident levels. The healing process is parametrized on real application traces acquired in production on the European Grid Infrastructure. Implementation and experimental results obtained in the Virtual Imaging Platform show that the proposed method speeds up execution up to a factor of 4 and properly detects unrecoverable errors.
AB - Distributed computing infrastructures are commonly used through scientific gateways, but operating these gateways requires important human intervention to handle operational incidents. This paper presents a self-healing process that quantifies incident degrees of workflow activities from metrics measuring long-tail effect, application efficiency, data transfer issues, and site-specific problems. These metrics are simple enough to be computed online and they make little assumptions on the application or resource characteristics. Incidents are classified in levels and associated to sets of healing actions that are selected based on association rules modeling correlations between incident levels. The healing process is parametrized on real application traces acquired in production on the European Grid Infrastructure. Implementation and experimental results obtained in the Virtual Imaging Platform show that the proposed method speeds up execution up to a factor of 4 and properly detects unrecoverable errors.
KW - Error detection and handling
KW - Production distributed systems
KW - Workflow execution
UR - http://www.scopus.com/inward/record.url?scp=84863694572&partnerID=8YFLogxK
U2 - 10.1109/CCGrid.2012.24
DO - 10.1109/CCGrid.2012.24
M3 - Conference contribution
AN - SCOPUS:84863694572
SN - 9780769546919
T3 - Proceedings - 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2012
SP - 318
EP - 325
BT - Proceedings - 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2012
T2 - 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2012
Y2 - 13 May 2012 through 16 May 2012
ER -