Abstract
Exascale computers offer transformative capabilities to combine data-driven and learning-based approaches with traditional simulation applications to accelerate scientific discovery and insight. However, these software combinations and integrations are difficult to achieve due to the challenges of coordinating and deploying heterogeneous software components on diverse and massive platforms. We present the ExaWorks project, which addresses many of these challenges. We developed a workflow Software Development Toolkit (SDK), a curated collection of workflow technologies that can be composed and interoperated through a common interface, engineered following current best practices, and specifically designed to work on HPC platforms. ExaWorks also developed PSI/J, a job management abstraction API, to simplify the construction of portable software components and applications that can be used over various HPC schedulers. The PSI/J API is a minimal interface for submitting and monitoring jobs and their execution state across multiple and commonly used HPC schedulers. We also describe several leading and innovative workflow examples of ExaWorks tools used on DOE leadership platforms. Furthermore, we discuss how our project is working with the workflow community, large computing facilities, and HPC platform vendors to address the requirements of workflows sustainably at the exascale.
Original language | English |
---|---|
Journal | International Journal of High Performance Computing Applications |
DOIs | |
State | Accepted/In press - 2025 |
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-CONF-826133), Argonne National Laboratory under Contract DE-AC02-06CH11357, and Brookhaven National Laboratory under Contract DESC0012704. This research used resources of the OLCF at ORNL, supported by the Office of Science of the U.S. DOE under Contract No. DE-AC05-00OR22725.
Keywords
- ECP
- Exascale
- HPC workflows
- SDK
- middleware building blocks
- workflow applications
- workflow community initiative
- workflow interoperability