TY - GEN
T1 - Generalizable Coordination of Large MultiscaleWorkflows
T2 - 33rd International Conference for High Performance Computing, Networking, Storage and Analysis: Science and Beyond, SC 2021
AU - Bhatia, Harsh
AU - Natale, Francesco Di
AU - Moon, Joseph Y.
AU - Zhang, Xiaohua
AU - Chavez, Joseph R.
AU - Aydin, Fikret
AU - Stanley, Chris
AU - Oppelstrup, Tomas
AU - Neale, Chris
AU - Schumacher, Sara Kokkila
AU - Ahn, Dong H.
AU - Herbein, Stephen
AU - Carpenter, Timothy S.
AU - Gnanakaran, Sandrasegaram
AU - Bremer, Peer Timo
AU - Glosli, James N.
AU - Lightstone, Felice C.
AU - Ingolfsson, Helgi I.
N1 - Publisher Copyright:
© 2021 IEEE Computer Society. All rights reserved.
PY - 2021/11/14
Y1 - 2021/11/14
N2 - The advancement of machine learning techniques and the heterogeneous architectures of most current supercomputers are propelling the demand for large multiscale simulations that can automatically and autonomously couple diverse components and map them to relevant resources to solve complex problems at multiple scales. Nevertheless, despite the recent progress in workflow technologies, current capabilities are limited to coupling two scales. In the first-ever demonstration of using three scales of resolution, we present a scalable and generalizable framework that couples pairs of models using machine learning and in situ feedback. We expand upon the massively parallel Multiscale Machine-Learned Modeling Infrastructure (MuMMI), a recent, award-winning workflow, and generalize the framework beyond its original design. We discuss the challenges and learnings in executing a massive multiscale simulation campaign that utilized over 600,000 node hours on Summit and achieved more than 98% GPU occupancy for more than 83% of the time. We present innovations to enable several orders of magnitude scaling, including simultaneously coordinating 24,000 jobs, and managing several TBs of new data per day and over a billion files in total. Finally, we describe the generalizability of our framework and, with an upcoming open-source release, discuss how the presented framework may be used for new applications.
AB - The advancement of machine learning techniques and the heterogeneous architectures of most current supercomputers are propelling the demand for large multiscale simulations that can automatically and autonomously couple diverse components and map them to relevant resources to solve complex problems at multiple scales. Nevertheless, despite the recent progress in workflow technologies, current capabilities are limited to coupling two scales. In the first-ever demonstration of using three scales of resolution, we present a scalable and generalizable framework that couples pairs of models using machine learning and in situ feedback. We expand upon the massively parallel Multiscale Machine-Learned Modeling Infrastructure (MuMMI), a recent, award-winning workflow, and generalize the framework beyond its original design. We discuss the challenges and learnings in executing a massive multiscale simulation campaign that utilized over 600,000 node hours on Summit and achieved more than 98% GPU occupancy for more than 83% of the time. We present innovations to enable several orders of magnitude scaling, including simultaneously coordinating 24,000 jobs, and managing several TBs of new data per day and over a billion files in total. Finally, we describe the generalizability of our framework and, with an upcoming open-source release, discuss how the presented framework may be used for new applications.
KW - adaptive simulations
KW - cancer research
KW - heterogenous architecture
KW - machine learning
KW - massively parallel
KW - multiscale simulations
UR - http://www.scopus.com/inward/record.url?scp=85119976678&partnerID=8YFLogxK
U2 - 10.1145/3458817.3476210
DO - 10.1145/3458817.3476210
M3 - Conference contribution
AN - SCOPUS:85119976678
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - Proceedings of SC 2021
PB - IEEE Computer Society
Y2 - 14 November 2021 through 19 November 2021
ER -