Using simple PID-inspired controllers for online resilient resource management of distributed scientific workflows

Rafael Ferreira da Silva, Rosa Filgueira, Ewa Deelman, Erola Pairo-Castineira, Ian M. Overton, Malcolm P. Atkinson

Research output: Contribution to journalArticlepeer-review

23 Scopus citations

Abstract

Scientific workflows have become mainstream for conducting large-scale scientific research. As a result, many workflow applications and Workflow Management Systems (WMSs) have been developed as part of the cyberinfrastructure to allow scientists to execute their applications seamlessly on a range of distributed platforms. Although the scientific community has addressed this challenge from both theoretical and practical approaches, failure prediction, detection, and recovery still raise many research questions. In this paper, we propose an approach inspired by the control theory developed as part of autonomic computing to predict failures before they happen, and mitigated them when possible. The proposed approach is inspired on the proportional–integral–derivative controller (PID controller) control loop mechanism, which is widely used in industrial control systems, where the controller will react to adjust its output to mitigate faults. PID controllers aim to detect the possibility of a non-steady state far enough in advance so that an action can be performed to prevent it from happening. To demonstrate the feasibility of the approach, we tackle two common execution faults of large scale data-intensive workflows—data storage overload and memory overflow. We developed a simulator, which implements and evaluates simple standalone PID-inspired controllers to autonomously manage data and memory usage of a data-intensive bioinformatics workflow that consumes/produces over 4.4 TB of data, and requires over 24 TB of memory to run all tasks concurrently. Experimental results obtained via simulation indicate that workflow executions may significantly benefit from the controller-inspired approach, in particular under online and unknown conditions. Simulation results show that nearly-optimal executions (slowdown of 1.01) can be attained when using our proposed method, and faults are detected and mitigated far in advance of their occurrence.

Original languageEnglish
Pages (from-to)615-628
Number of pages14
JournalFuture Generation Computer Systems
Volume95
DOIs
StatePublished - Jun 2019
Externally publishedYes

Funding

This work was funded by DOE, USA , contract number #DESC0012636 , “Panorama—Predictive Modeling and Diagnostic Monitoring of Extreme Science Workflows”. This work was carried out when Rosa Filgueira worked for the University of Edinburgh, and was funded by the Postdoctoral and Early Career Researcher Exchanges (PECE) fellowship funded by the Scottish Informatics and Computer Science Alliance (SICSA) UK in 2016, and Erola Pairo-Castineira was supported by the Wellcome Trust-University of Edinburgh Institutional Strategic Support Fund, UK (to Ian M. Overton). This work was funded by DOE, USA, contract number #DESC0012636, “Panorama—Predictive Modeling and Diagnostic Monitoring of Extreme Science Workflows”. This work was carried out when Rosa Filgueira worked for the University of Edinburgh, and was funded by the Postdoctoral and Early Career Researcher Exchanges (PECE) fellowship funded by the Scottish Informatics and Computer Science Alliance (SICSA) UK in 2016, and Erola Pairo-Castineira was supported by the Wellcome Trust-University of Edinburgh Institutional Strategic Support Fund, UK (to Ian M. Overton). Rosa Filgueira , Ph.D., has recently joined to the British Geological Survey (BGS) as a Senior Data Scientist. Previously, she was working as a Research Associate at the Data Intensive Research Group of the University Edinburgh and as a Research and Teaching Assistant at the Computer Architecture Group of University Carlos III Madrid. Her research expertise is on improving the HPC applications’ scalability and performance having contributed to several European and national projects in hazard forecasting and parallel processing. During the VERCE project she contributed to the design and optimization of dispel4py and pioneered several dispel4py applications. Currently, she is leading requirements capture for the ENVRIplus project (funded by EU Horizon2020) delivering common data functionality for 22 pan-European Research Infrastructures.

FundersFunder number
EU Horizon2020
Scottish Informatics and Computer Science Alliance
U.S. Department of Energy#DESC0012636, 0012636
Wellcome Trust
Engineering and Physical Sciences Research CouncilEP/F057695/1
Natural Environment Research Councilbgs05014
University of Edinburgh

    Keywords

    • Autonomic computing
    • Fault detection and handling
    • Resilient Big Data workflows
    • Scientific workflows

    Fingerprint

    Dive into the research topics of 'Using simple PID-inspired controllers for online resilient resource management of distributed scientific workflows'. Together they form a unique fingerprint.

    Cite this