Graph neural networks for detecting anomalies in scientific workflows

Hongwei Jin, Krishnan Raghavan, George Papadimitriou, Cong Wang, Anirban Mandal, Mariam Kiran, Ewa Deelman, Prasanna Balaprakash

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

Identifying and addressing anomalies in complex, distributed systems can be challenging for reliable execution of scientific workflows. We model these workflows as directed acyclic graphs (DAGs), where the nodes and edges of the DAGs represent jobs and their dependencies, respectively. We develop graph neural networks (GNNs) to learn patterns in the DAGs and to detect anomalies at the node (job) and graph (workflow) levels. We investigate workflow-specific GNN models that are trained on a particular workflow and workflow-agnostic GNN models that are trained across the workflows. Our GNN models, which incorporate both individual job features and topological information from the workflow, show improved accuracy and efficiency compared to conventional learning methods for detecting anomalies. While joint trained with multiple scientific workflows, our GNN models reached an accuracy more than 80% for workflow level and 75% for job level anomalies. In addition, we illustrate the importance of hyperparameter tuning method in our study that can significantly improve the metric(s) measure of evaluating the GNN models. Finally, we integrate explainable GNN methods to provide insights on job features in the workflow that cause an anomaly.

Original languageEnglish
Pages (from-to)394-411
Number of pages18
JournalInternational Journal of High Performance Computing Applications
Volume37
Issue number3-4
DOIs
StatePublished - Jul 2023

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is funded by the Department of Energy under the Integrated Computational and Data Infrastructure (ICDI) for Scientific Discovery, grant #DE-SC0022328. Experimental data was collected on the ExoGENI testbed supported by NSF. This material is based upon work supported by the U.S. Department of Energy, Office of Science, under contract number DE-AC02-06CH11357.

Keywords

  • Anomaly detection
  • explainable predictions
  • graph neural networks
  • hyperparameter tuning
  • machine learning
  • scientific workflows

Fingerprint

Dive into the research topics of 'Graph neural networks for detecting anomalies in scientific workflows'. Together they form a unique fingerprint.

Cite this