Abstract
The management of data-intensive workflows in globally distributed computing systems, such as those used in high-energy physics, presents significant challenges in scalability, resource allocation, and fault tolerance. Workflow Management Systems (WMS) provide a critical framework for addressing these challenges by automating, monitoring, and optimizing the execution of complex computational tasks across heterogeneous resources. This paper introduces the Production and Distributed Analysis (PanDA) system, a sophisticated WMS developed for the ATLAS experiment at the Large Hadron Collider (LHC). PanDA is engineered to handle the immense data processing and analysis demands of ATLAS, operating on the Worldwide LHC Computing Grid (WLCG), one of the largest distributed computing infrastructures globally. However, faults or errors frequently occur when distributing and managing workloads on such a globally distributed computing grid. Errors can occur in various form across different sites. To understand and mitigate these errors, analysis is the first step. In this work, we analyze the errors that occurs across the globally distributed grid which will be the stepping stone towards designing effective mitigation strategies.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of 2025 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC 2025 Workshops |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 968-976 |
| Number of pages | 9 |
| ISBN (Electronic) | 9798400718717 |
| DOIs | |
| State | Published - Nov 15 2025 |
| Event | 2025 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC 2025 Workshops - St. Louis, United States Duration: Nov 16 2025 → Nov 21 2025 |
Publication series
| Name | Proceedings of 2025 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC 2025 Workshops |
|---|
Conference
| Conference | 2025 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC 2025 Workshops |
|---|---|
| Country/Territory | United States |
| City | St. Louis |
| Period | 11/16/25 → 11/21/25 |
Funding
This material is based on work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research under Award Number DE-SC-0012704 (REDWOOD project). This work was done in collaboration with the distributed computing project within the ATLAS Collaboration. We thank our ATLAS colleagues for their support, particularly the ATLAS Distributed Computing team’s contributions. We would also like to express our deepest gratitude to Prof. Kaushik De at the University of Texas at Arlington.
Keywords
- Workflow management system
- distributed computing
- error analysis