Abstract
Scientific workflows have become mainstream for conducting large-scale scientific research. As a result, many workflow applications and Workflow Management Systems (WMSs) have been developed as part of the cyberinfrastructure to allow scientists to execute their applications seamlessly on a range of distributed platforms. Although the scientific community has addressed this challenge from both theoretical and practical approaches, failure prediction, detection, and recovery still raise many research questions. In this paper, we propose an approach inspired by the control theory developed as part of autonomic computing to predict failures before they happen, and mitigated them when possible. The proposed approach is inspired on the proportional–integral–derivative controller (PID controller) control loop mechanism, which is widely used in industrial control systems, where the controller will react to adjust its output to mitigate faults. PID controllers aim to detect the possibility of a non-steady state far enough in advance so that an action can be performed to prevent it from happening. To demonstrate the feasibility of the approach, we tackle two common execution faults of large scale data-intensive workflows—data storage overload and memory overflow. We developed a simulator, which implements and evaluates simple standalone PID-inspired controllers to autonomously manage data and memory usage of a data-intensive bioinformatics workflow that consumes/produces over 4.4 TB of data, and requires over 24 TB of memory to run all tasks concurrently. Experimental results obtained via simulation indicate that workflow executions may significantly benefit from the controller-inspired approach, in particular under online and unknown conditions. Simulation results show that nearly-optimal executions (slowdown of 1.01) can be attained when using our proposed method, and faults are detected and mitigated far in advance of their occurrence.
Original language | English |
---|---|
Pages (from-to) | 615-628 |
Number of pages | 14 |
Journal | Future Generation Computer Systems |
Volume | 95 |
DOIs | |
State | Published - Jun 2019 |
Externally published | Yes |
Funding
This work was funded by DOE, USA , contract number #DESC0012636 , “Panorama—Predictive Modeling and Diagnostic Monitoring of Extreme Science Workflows”. This work was carried out when Rosa Filgueira worked for the University of Edinburgh, and was funded by the Postdoctoral and Early Career Researcher Exchanges (PECE) fellowship funded by the Scottish Informatics and Computer Science Alliance (SICSA) UK in 2016, and Erola Pairo-Castineira was supported by the Wellcome Trust-University of Edinburgh Institutional Strategic Support Fund, UK (to Ian M. Overton). This work was funded by DOE, USA, contract number #DESC0012636, “Panorama—Predictive Modeling and Diagnostic Monitoring of Extreme Science Workflows”. This work was carried out when Rosa Filgueira worked for the University of Edinburgh, and was funded by the Postdoctoral and Early Career Researcher Exchanges (PECE) fellowship funded by the Scottish Informatics and Computer Science Alliance (SICSA) UK in 2016, and Erola Pairo-Castineira was supported by the Wellcome Trust-University of Edinburgh Institutional Strategic Support Fund, UK (to Ian M. Overton). Rosa Filgueira , Ph.D., has recently joined to the British Geological Survey (BGS) as a Senior Data Scientist. Previously, she was working as a Research Associate at the Data Intensive Research Group of the University Edinburgh and as a Research and Teaching Assistant at the Computer Architecture Group of University Carlos III Madrid. Her research expertise is on improving the HPC applications’ scalability and performance having contributed to several European and national projects in hazard forecasting and parallel processing. During the VERCE project she contributed to the design and optimization of dispel4py and pioneered several dispel4py applications. Currently, she is leading requirements capture for the ENVRIplus project (funded by EU Horizon2020) delivering common data functionality for 22 pan-European Research Infrastructures.
Funders | Funder number |
---|---|
EU Horizon2020 | |
Scottish Informatics and Computer Science Alliance | |
U.S. Department of Energy | #DESC0012636, 0012636 |
Wellcome Trust | |
Engineering and Physical Sciences Research Council | EP/F057695/1 |
Natural Environment Research Council | bgs05014 |
University of Edinburgh |
Keywords
- Autonomic computing
- Fault detection and handling
- Resilient Big Data workflows
- Scientific workflows