End-to-end online performance data capture and analysis for scientific workflows

George Papadimitriou, Cong Wang, Karan Vahi, Rafael Ferreira da Silva, Anirban Mandal, Zhengchun Liu, Rajiv Mayani, Mats Rynge, Mariam Kiran, Vickie E. Lynch, Rajkumar Kettimuthu, Ewa Deelman, Jeffrey S. Vetter, Ian Foster

Research output: Contribution to journalArticlepeer-review

18 Scopus citations

Abstract

With the increased prevalence of employing workflows for scientific computing and a push towards exascale computing, it has become paramount that we are able to analyze characteristics of scientific applications to better understand their impact on the underlying infrastructure and vice-versa. Such analysis can help drive the design, development, and optimization of these next generation systems and solutions. In this paper, we present the architecture, integrated with existing well-established and newly developed tools, to collect online performance statistics of workflow executions from various, heterogeneous sources and publish them in a distributed database (Elasticsearch). Using this architecture, we are able to correlate online workflow performance data, with data from the underlying infrastructure, and present them in a useful and intuitive way via an online dashboard. We have validated our approach by executing two classes of real-world workflows, both under normal and anomalous conditions. The first is an I/O-intensive genome analysis workflow; the second, a CPU- and memory-intensive material science workflow. Based on the data collected in Elasticsearch, we are able to demonstrate that we can correctly identify anomalies that we injected. The resulting end-to-end data collection of workflow performance data is an important resource of training data for automated machine learning analysis.

Original languageEnglish
Pages (from-to)387-400
Number of pages14
JournalFuture Generation Computer Systems
Volume117
DOIs
StatePublished - Apr 2021

Funding

This work was funded by DOE, USA contract #DESC0012636 ,“Panorama—Predictive Modeling and Diagnostic Monitoring of Extreme Science Workflows”. Additionally this work was funded by U.S. Department of Energy, Office of Science under contract DE-AC02-06CH11357 , “RAMSES—The Robust Analytic Models for Science at Extreme Scales”. We thank G. Juve and D. Król for their contributions on extending pegasus-kickstart and enabling online monitoring. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract #DE-AC02-05CH11231 .

Keywords

  • Extreme scale
  • Online performance monitoring
  • Scientific workflows

Fingerprint

Dive into the research topics of 'End-to-end online performance data capture and analysis for scientific workflows'. Together they form a unique fingerprint.

Cite this