TY - JOUR
T1 - Towards an Introspective Dynamic Model of Globally Distributed Computing Infrastructures
AU - Kilic, Ozgur O.
AU - Park, David K.
AU - Ren, Yihui
AU - Korchuganova, Tatiana
AU - Vatsavai, Sairam Sri
AU - Boudreau, Joseph
AU - Chowdhury, Tasnuva
AU - Feng, Shengyu
AU - Khan, Raees
AU - Kim, Jaehyung
AU - Klasky, Scott
AU - Maeno, Tadashi
AU - Nilsson, Paul
AU - Outschoorn, Verena Ingrid Martinez
AU - Podhorszki, Norbert
AU - Suter, Frédéric
AU - Yang, Wei
AU - Yang, Yiming
AU - Yoo, Shinjae
AU - Klimentov, Alexei
AU - Hoisie, Adolfy
N1 - Publisher Copyright:
© The Authors, published by EDP Sciences.
PY - 2025/10/7
Y1 - 2025/10/7
N2 - Large-scale scientific collaborations like ATLAS, Belle II, CMS, DUNE, and others involve hundreds of research institutes and thousands of researchers spread across the globe. These experiments generate petabytes of data, with volumes soon expected to reach exabytes. Consequently, there is a growing need for computation, including structured data processing from raw data to consumer-ready derived data, extensive Monte Carlo simulation campaigns, and a wide range of end-user analysis. To manage these computational and storage demands, centralized workflow and data management systems are implemented. However, decisions regarding data placement and payload allocation are often made disjointly and via heuristic means. A significant obstacle in adopting more effective heuristic or AI-driven solutions is the absence of a quick and reliable introspective dynamic model to evaluate and refine alternative approaches. In this study, we aim to develop such an interactive system using real-world data. By examining job execution records from the PanDA workflow management system, we have pinpointed key performance indicators such as queuing time, error rate, and the extent of remote data access. The dataset includes five months of activity. Additionally, we are creating a generative AI model to simulate time series of payloads, which incorporate visible features like category, event count, and submitting group, as well as hidden features like the total computational load-derived from existing PanDA records and computing site capabilities. These hidden features, which are not visible to job allocators, whether heuristic or AI-driven, influence factors such as queuing times and data movement.
AB - Large-scale scientific collaborations like ATLAS, Belle II, CMS, DUNE, and others involve hundreds of research institutes and thousands of researchers spread across the globe. These experiments generate petabytes of data, with volumes soon expected to reach exabytes. Consequently, there is a growing need for computation, including structured data processing from raw data to consumer-ready derived data, extensive Monte Carlo simulation campaigns, and a wide range of end-user analysis. To manage these computational and storage demands, centralized workflow and data management systems are implemented. However, decisions regarding data placement and payload allocation are often made disjointly and via heuristic means. A significant obstacle in adopting more effective heuristic or AI-driven solutions is the absence of a quick and reliable introspective dynamic model to evaluate and refine alternative approaches. In this study, we aim to develop such an interactive system using real-world data. By examining job execution records from the PanDA workflow management system, we have pinpointed key performance indicators such as queuing time, error rate, and the extent of remote data access. The dataset includes five months of activity. Additionally, we are creating a generative AI model to simulate time series of payloads, which incorporate visible features like category, event count, and submitting group, as well as hidden features like the total computational load-derived from existing PanDA records and computing site capabilities. These hidden features, which are not visible to job allocators, whether heuristic or AI-driven, influence factors such as queuing times and data movement.
UR - https://www.scopus.com/pages/publications/105019705677
U2 - 10.1051/epjconf/202533701082
DO - 10.1051/epjconf/202533701082
M3 - Conference article
AN - SCOPUS:105019705677
SN - 2101-6275
VL - 337
JO - EPJ Web of Conferences
JF - EPJ Web of Conferences
M1 - 01082
T2 - 27th International Conference on Computing in High Energy and Nuclear Physics, CHEP 2024
Y2 - 19 October 2024 through 25 October 2024
ER -