Abstract
The need for real-time processing to enable automated decision making and experimental steering has driven a shift from high-performance computing workflows on a centralized system to a distributed approach that integrates remote data sources, edge devices, and diverse compute facilities. Under this paradigm, data can be processed close to the source where it is generated, thus reducing latency and bandwidth usage. System resilience is thus a key challenge, requiring distributed workflows to survive component failures and to meet stringent quality-of-service requirements, which results in the need to mitigate anomalies such as congestion and low availability of resources. To address these challenges, we propose Diaspora, a unified resilience framework that is inspired by event-driven communication patterns used in public clouds. Specifically, we propose an event fabric that extends across sites, facilities, and computations to provide timely, reliable, and accurate information about data, application, and resource status. On top of the event fabric, we build resilience-enabling services that combine QoS-aware data streaming, resilient data views, resilient compute and data resources, and anomaly detection and prediction, all of which collectively enhance workflow resilience for these scientific cases.
Original language | English |
---|---|
Title of host publication | Proceedings - 2024 IEEE 20th International Conference on e-Science, e-Science 2024 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
ISBN (Electronic) | 9798350365610 |
DOIs | |
State | Published - 2024 |
Event | 20th IEEE International Conference on e-Science, e-Science 2024 - Osaka, Japan Duration: Sep 16 2024 → Sep 20 2024 |
Publication series
Name | Proceedings - 2024 IEEE 20th International Conference on e-Science, e-Science 2024 |
---|
Conference
Conference | 20th IEEE International Conference on e-Science, e-Science 2024 |
---|---|
Country/Territory | Japan |
City | Osaka |
Period | 09/16/24 → 09/20/24 |
Funding
This material is based upon work supported by the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research, under Contracts DE-AC02-06CH11357 and DE-AC02-05CH11231.
Keywords
- anomaly detection and prediction
- data streaming
- elasticity
- high-availability
- real-time distributed HPC workflows
- resilience