Error Analysis of Globally Distributed Workflow Management System

  • Sankha Dutta
  • , Ozgur Kilic
  • , Tatiana Korchuganova
  • , Paul Nilsson
  • , Sairam Sri Vatsavai
  • , Kuan Chieh Hsu
  • , David K. Park
  • , Joseph Boudreau
  • , Tasnuva Chowdhury
  • , Feng Shengyu
  • , Raees Khan
  • , Jaehyung Kim
  • , Scott Klasky
  • , Tadashi Maeno
  • , Verena Ingrid Martinez Outschoorn
  • , Norbert Podhorszki
  • , Yihui Ren
  • , Frédéric Suter
  • , Wei Yang
  • , Yiming Yang
  • Shinjae Yoo, Alexei Klimentov, Adolfy Hoisie

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The management of data-intensive workflows in globally distributed computing systems, such as those used in high-energy physics, presents significant challenges in scalability, resource allocation, and fault tolerance. Workflow Management Systems (WMS) provide a critical framework for addressing these challenges by automating, monitoring, and optimizing the execution of complex computational tasks across heterogeneous resources. This paper introduces the Production and Distributed Analysis (PanDA) system, a sophisticated WMS developed for the ATLAS experiment at the Large Hadron Collider (LHC). PanDA is engineered to handle the immense data processing and analysis demands of ATLAS, operating on the Worldwide LHC Computing Grid (WLCG), one of the largest distributed computing infrastructures globally. However, faults or errors frequently occur when distributing and managing workloads on such a globally distributed computing grid. Errors can occur in various form across different sites. To understand and mitigate these errors, analysis is the first step. In this work, we analyze the errors that occurs across the globally distributed grid which will be the stepping stone towards designing effective mitigation strategies.

Original languageEnglish
Title of host publicationProceedings of 2025 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC 2025 Workshops
PublisherAssociation for Computing Machinery, Inc
Pages968-976
Number of pages9
ISBN (Electronic)9798400718717
DOIs
StatePublished - Nov 15 2025
Event2025 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC 2025 Workshops - St. Louis, United States
Duration: Nov 16 2025Nov 21 2025

Publication series

NameProceedings of 2025 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC 2025 Workshops

Conference

Conference2025 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC 2025 Workshops
Country/TerritoryUnited States
CitySt. Louis
Period11/16/2511/21/25

Funding

This material is based on work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research under Award Number DE-SC-0012704 (REDWOOD project). This work was done in collaboration with the distributed computing project within the ATLAS Collaboration. We thank our ATLAS colleagues for their support, particularly the ATLAS Distributed Computing team’s contributions. We would also like to express our deepest gratitude to Prof. Kaushik De at the University of Texas at Arlington.

Keywords

  • Workflow management system
  • distributed computing
  • error analysis

Fingerprint

Dive into the research topics of 'Error Analysis of Globally Distributed Workflow Management System'. Together they form a unique fingerprint.

Cite this