Diaspora: Resilience-Enabling Services for Real-Time Distributed Workflows

Bogdan Nicolae, Justin M. Wozniak, Tekin Bicer, Hai Nguyen, Parth Patel, Haochen Pan, Amal Gueroudji, Maxime Gonthier, Valerie Hayot-Sasson, Eliu Huerta, Kyle Chard, Ryan Chard, Matthieu Dorier, Nageswara S.V. Rao, Anees Al-Najjar, Alessandra Corsi, Ian Foster

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The need for real-time processing to enable automated decision making and experimental steering has driven a shift from high-performance computing workflows on a centralized system to a distributed approach that integrates remote data sources, edge devices, and diverse compute facilities. Under this paradigm, data can be processed close to the source where it is generated, thus reducing latency and bandwidth usage. System resilience is thus a key challenge, requiring distributed workflows to survive component failures and to meet stringent quality-of-service requirements, which results in the need to mitigate anomalies such as congestion and low availability of resources. To address these challenges, we propose Diaspora, a unified resilience framework that is inspired by event-driven communication patterns used in public clouds. Specifically, we propose an event fabric that extends across sites, facilities, and computations to provide timely, reliable, and accurate information about data, application, and resource status. On top of the event fabric, we build resilience-enabling services that combine QoS-aware data streaming, resilient data views, resilient compute and data resources, and anomaly detection and prediction, all of which collectively enhance workflow resilience for these scientific cases.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE 20th International Conference on e-Science, e-Science 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350365610
DOIs
StatePublished - 2024
Event20th IEEE International Conference on e-Science, e-Science 2024 - Osaka, Japan
Duration: Sep 16 2024Sep 20 2024

Publication series

NameProceedings - 2024 IEEE 20th International Conference on e-Science, e-Science 2024

Conference

Conference20th IEEE International Conference on e-Science, e-Science 2024
Country/TerritoryJapan
CityOsaka
Period09/16/2409/20/24

Keywords

  • anomaly detection and prediction
  • data streaming
  • elasticity
  • high-availability
  • real-time distributed HPC workflows
  • resilience

Fingerprint

Dive into the research topics of 'Diaspora: Resilience-Enabling Services for Real-Time Distributed Workflows'. Together they form a unique fingerprint.

Cite this