TY - GEN
T1 - Running Ensemble Workflows at Extreme Scale
T2 - 18th IEEE International Conference on e-Science, eScience 2022
AU - Mehta, Kshitij
AU - Cliff, Ashley
AU - Suter, Frederic
AU - Walker, Angelica M.
AU - Wolf, Matthew
AU - Jacobson, Daniel
AU - Klasky, Scott
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - The ever-increasing volumes of scientific data combined with sophisticated techniques for extracting information from them have led to the increasing popularity of ensemble workflows which are a collection of runs of individual workflows. A traditional approach followed by scientists to run ensembles is to rely on simple scripts to execute different runs and manage resources. This approach is not scalable and is error-prone, thereby motivating the development of workflow management systems that specialize in executing ensembles on HPC clusters. However, when the size of both the ensemble and the target system reach extreme scales, existing workflow management systems face new challenges that hamper their efficient execution. In this paper, we describe our experience scaling an ensemble workflow from the computational biology domain from the early design stages to the execution at extreme scale on Summit, a leadership class supercomputer at the Oak Ridge National Laboratory. We discuss challenges that arise when scaling ensembles to several million runs on thousands of HPC nodes. We identify challenges with composition of the ensemble itself, its execution at large scale, post-processing of the generated data, and scalability of the file system. Based on the experience acquired, we develop a generic vision of the capabilities and abstractions to add to existing workflow management systems to enable the execution of ensemble workflows at extreme scales. We believe that the understanding of these fundamental challenges will help application teams along with workflow system developers with designing the next generation of infrastructure for composing and executing extreme-scale ensemble workflows.
AB - The ever-increasing volumes of scientific data combined with sophisticated techniques for extracting information from them have led to the increasing popularity of ensemble workflows which are a collection of runs of individual workflows. A traditional approach followed by scientists to run ensembles is to rely on simple scripts to execute different runs and manage resources. This approach is not scalable and is error-prone, thereby motivating the development of workflow management systems that specialize in executing ensembles on HPC clusters. However, when the size of both the ensemble and the target system reach extreme scales, existing workflow management systems face new challenges that hamper their efficient execution. In this paper, we describe our experience scaling an ensemble workflow from the computational biology domain from the early design stages to the execution at extreme scale on Summit, a leadership class supercomputer at the Oak Ridge National Laboratory. We discuss challenges that arise when scaling ensembles to several million runs on thousands of HPC nodes. We identify challenges with composition of the ensemble itself, its execution at large scale, post-processing of the generated data, and scalability of the file system. Based on the experience acquired, we develop a generic vision of the capabilities and abstractions to add to existing workflow management systems to enable the execution of ensemble workflows at extreme scales. We believe that the understanding of these fundamental challenges will help application teams along with workflow system developers with designing the next generation of infrastructure for composing and executing extreme-scale ensemble workflows.
KW - HPC
KW - ensemble
KW - extreme scale
KW - workflows
UR - http://www.scopus.com/inward/record.url?scp=85145435557&partnerID=8YFLogxK
U2 - 10.1109/eScience55777.2022.00042
DO - 10.1109/eScience55777.2022.00042
M3 - Conference contribution
AN - SCOPUS:85145435557
T3 - Proceedings - 2022 IEEE 18th International Conference on e-Science, eScience 2022
SP - 284
EP - 294
BT - Proceedings - 2022 IEEE 18th International Conference on e-Science, eScience 2022
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 10 October 2022 through 14 October 2022
ER -