TY - GEN
T1 - Diaspora
T2 - 20th IEEE International Conference on e-Science, e-Science 2024
AU - Nicolae, Bogdan
AU - Wozniak, Justin M.
AU - Bicer, Tekin
AU - Nguyen, Hai
AU - Patel, Parth
AU - Pan, Haochen
AU - Gueroudji, Amal
AU - Gonthier, Maxime
AU - Hayot-Sasson, Valerie
AU - Huerta, Eliu
AU - Chard, Kyle
AU - Chard, Ryan
AU - Dorier, Matthieu
AU - Rao, Nageswara S.V.
AU - Al-Najjar, Anees
AU - Corsi, Alessandra
AU - Foster, Ian
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - The need for real-time processing to enable automated decision making and experimental steering has driven a shift from high-performance computing workflows on a centralized system to a distributed approach that integrates remote data sources, edge devices, and diverse compute facilities. Under this paradigm, data can be processed close to the source where it is generated, thus reducing latency and bandwidth usage. System resilience is thus a key challenge, requiring distributed workflows to survive component failures and to meet stringent quality-of-service requirements, which results in the need to mitigate anomalies such as congestion and low availability of resources. To address these challenges, we propose Diaspora, a unified resilience framework that is inspired by event-driven communication patterns used in public clouds. Specifically, we propose an event fabric that extends across sites, facilities, and computations to provide timely, reliable, and accurate information about data, application, and resource status. On top of the event fabric, we build resilience-enabling services that combine QoS-aware data streaming, resilient data views, resilient compute and data resources, and anomaly detection and prediction, all of which collectively enhance workflow resilience for these scientific cases.
AB - The need for real-time processing to enable automated decision making and experimental steering has driven a shift from high-performance computing workflows on a centralized system to a distributed approach that integrates remote data sources, edge devices, and diverse compute facilities. Under this paradigm, data can be processed close to the source where it is generated, thus reducing latency and bandwidth usage. System resilience is thus a key challenge, requiring distributed workflows to survive component failures and to meet stringent quality-of-service requirements, which results in the need to mitigate anomalies such as congestion and low availability of resources. To address these challenges, we propose Diaspora, a unified resilience framework that is inspired by event-driven communication patterns used in public clouds. Specifically, we propose an event fabric that extends across sites, facilities, and computations to provide timely, reliable, and accurate information about data, application, and resource status. On top of the event fabric, we build resilience-enabling services that combine QoS-aware data streaming, resilient data views, resilient compute and data resources, and anomaly detection and prediction, all of which collectively enhance workflow resilience for these scientific cases.
KW - anomaly detection and prediction
KW - data streaming
KW - elasticity
KW - high-availability
KW - real-time distributed HPC workflows
KW - resilience
UR - http://www.scopus.com/inward/record.url?scp=85205958844&partnerID=8YFLogxK
U2 - 10.1109/e-Science62913.2024.10678669
DO - 10.1109/e-Science62913.2024.10678669
M3 - Conference contribution
AN - SCOPUS:85205958844
T3 - Proceedings - 2024 IEEE 20th International Conference on e-Science, e-Science 2024
BT - Proceedings - 2024 IEEE 20th International Conference on e-Science, e-Science 2024
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 16 September 2024 through 20 September 2024
ER -