Generalizable Coordination of Large MultiscaleWorkflows: Challenges and Learnings at Scale

Harsh Bhatia, Francesco Di Natale, Joseph Y. Moon, Xiaohua Zhang, Joseph R. Chavez, Fikret Aydin, Chris Stanley, Tomas Oppelstrup, Chris Neale, Sara Kokkila Schumacher, Dong H. Ahn, Stephen Herbein, Timothy S. Carpenter, Sandrasegaram Gnanakaran, Peer Timo Bremer, James N. Glosli, Felice C. Lightstone, Helgi I. Ingolfsson

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

15 Scopus citations

Abstract

The advancement of machine learning techniques and the heterogeneous architectures of most current supercomputers are propelling the demand for large multiscale simulations that can automatically and autonomously couple diverse components and map them to relevant resources to solve complex problems at multiple scales. Nevertheless, despite the recent progress in workflow technologies, current capabilities are limited to coupling two scales. In the first-ever demonstration of using three scales of resolution, we present a scalable and generalizable framework that couples pairs of models using machine learning and in situ feedback. We expand upon the massively parallel Multiscale Machine-Learned Modeling Infrastructure (MuMMI), a recent, award-winning workflow, and generalize the framework beyond its original design. We discuss the challenges and learnings in executing a massive multiscale simulation campaign that utilized over 600,000 node hours on Summit and achieved more than 98% GPU occupancy for more than 83% of the time. We present innovations to enable several orders of magnitude scaling, including simultaneously coordinating 24,000 jobs, and managing several TBs of new data per day and over a billion files in total. Finally, we describe the generalizability of our framework and, with an upcoming open-source release, discuss how the presented framework may be used for new applications.

Original languageEnglish
Title of host publicationProceedings of SC 2021
Subtitle of host publicationThe International Conference for High Performance Computing, Networking, Storage and Analysis: Science and Beyond
PublisherIEEE Computer Society
ISBN (Electronic)9781450384421
DOIs
StatePublished - Nov 14 2021
Externally publishedYes
Event33rd International Conference for High Performance Computing, Networking, Storage and Analysis: Science and Beyond, SC 2021 - Virtual, Online, United States
Duration: Nov 14 2021Nov 19 2021

Publication series

NameInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
ISSN (Print)2167-4329
ISSN (Electronic)2167-4337

Conference

Conference33rd International Conference for High Performance Computing, Networking, Storage and Analysis: Science and Beyond, SC 2021
Country/TerritoryUnited States
CityVirtual, Online
Period11/14/2111/19/21

Funding

This work has been supported in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by the U.S. Department of Energy (DOE) and the National Cancer Institute (NCI) of the National Institutes of Health (NIH). We thank the entire JDACS4C Pilot 2 team, particularly the Pilot 2 leads Fred Streitz and Dwight V. Nissley, for their support and helpful discussion. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344, Los Alamos National Laboratory under Contract DE-AC5206NA25396, and Oak Ridge National Laboratory under contract DE-AC05-00OR22725. This research used resources of the Oak Ridge Leadership Computing Facility (OLCF), which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. For computing time, we thank the Advanced Scientific Computing Research Leadership Computing Challenge (ALCC) for time on Summit and Livermore Computing (LC) and Livermore Institutional Grand Challenge for time on Lassen. For computing support, we thank OLCF and LC staff. For data management support, we thank Bruce D’Amora, Lars Schneidenbach, Claudia Misale, and Carlos Costa.

FundersFunder number
National Institutes of Health
U.S. Department of Energy
National Cancer Institute
Office of Science
Lawrence Livermore National LaboratoryDE-AC52-07NA27344
Oak Ridge National LaboratoryDE-AC05-00OR22725
Los Alamos National LaboratoryDE-AC5206NA25396

    Keywords

    • adaptive simulations
    • cancer research
    • heterogenous architecture
    • machine learning
    • massively parallel
    • multiscale simulations

    Fingerprint

    Dive into the research topics of 'Generalizable Coordination of Large MultiscaleWorkflows: Challenges and Learnings at Scale'. Together they form a unique fingerprint.

    Cite this