End-to-end online performance data capture and analysis for scientific workflows

  • George Papadimitriou
  • , Cong Wang
  • , Karan Vahi
  • , Rafael Ferreira da Silva
  • , Anirban Mandal
  • , Zhengchun Liu
  • , Rajiv Mayani
  • , Mats Rynge
  • , Mariam Kiran
  • , Vickie E. Lynch
  • , Rajkumar Kettimuthu
  • , Ewa Deelman
  • , Jeffrey S. Vetter
  • , Ian Foster

Research output: Contribution to journalArticlepeer-review

18 Scopus citations

Abstract

With the increased prevalence of employing workflows for scientific computing and a push towards exascale computing, it has become paramount that we are able to analyze characteristics of scientific applications to better understand their impact on the underlying infrastructure and vice-versa. Such analysis can help drive the design, development, and optimization of these next generation systems and solutions. In this paper, we present the architecture, integrated with existing well-established and newly developed tools, to collect online performance statistics of workflow executions from various, heterogeneous sources and publish them in a distributed database (Elasticsearch). Using this architecture, we are able to correlate online workflow performance data, with data from the underlying infrastructure, and present them in a useful and intuitive way via an online dashboard. We have validated our approach by executing two classes of real-world workflows, both under normal and anomalous conditions. The first is an I/O-intensive genome analysis workflow; the second, a CPU- and memory-intensive material science workflow. Based on the data collected in Elasticsearch, we are able to demonstrate that we can correctly identify anomalies that we injected. The resulting end-to-end data collection of workflow performance data is an important resource of training data for automated machine learning analysis.

Original languageEnglish
Pages (from-to)387-400
Number of pages14
JournalFuture Generation Computer Systems
Volume117
DOIs
StatePublished - Apr 2021

Funding

This work was funded by DOE, USA contract #DESC0012636 ,“Panorama—Predictive Modeling and Diagnostic Monitoring of Extreme Science Workflows”. Additionally this work was funded by U.S. Department of Energy, Office of Science under contract DE-AC02-06CH11357 , “RAMSES—The Robust Analytic Models for Science at Extreme Scales”. We thank G. Juve and D. Król for their contributions on extending pegasus-kickstart and enabling online monitoring. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract #DE-AC02-05CH11231 .

Keywords

  • Extreme scale
  • Online performance monitoring
  • Scientific workflows

Fingerprint

Dive into the research topics of 'End-to-end online performance data capture and analysis for scientific workflows'. Together they form a unique fingerprint.

Cite this