MPI jobs within MPI jobs: A practical way of enabling task-level fault-tolerance in HPC workflows

Justin M. Wozniak, Matthieu Dorier, Robert Ross, Tong Shu, Tahsin Kurc, Li Tang, Norbert Podhorszki, Matthew Wolf

Research output: Contribution to journalArticlepeer-review

10 Scopus citations

Abstract

While the use of workflows for HPC is growing, MPI interoperability remains a challenge for workflow management systems. The MPI standard and/or its implementations provide a number of ways to build multiple-programs-multiple-data (MPMD) applications. These methods present limitations related to fault tolerance, and are not easy to use. In this paper, we advocate for a novel MPI_Comm_launch function acting as the parallel counterpart of a system(3) call. MPI_Comm_launch allows a child MPI application to be launched inside the resources originally held by processes of a parent MPI application. Two important aspects of MPI_Comm_launch is that it pauses the calling process, and runs the child processes on the parent's CPU cores, but in an isolated manner with respect to memory. This function makes it easier to build MPMD applications with well-decoupled subtasks. We show how this feature can provide better flexibility and better fault tolerance in ensemble simulations and HPC workflows. We report results showing 2× throughput improvement for application workflows with faults, and scaling results for challenging workloads up to 256 nodes.

Original languageEnglish
Pages (from-to)576-589
Number of pages14
JournalFuture Generation Computer Systems
Volume101
DOIs
StatePublished - Dec 2019

Funding

We thank Pavan Balaji and Rajeev Thakur for their insights about the MPI standard and its implementations, MisbahMubarak for her help setting up CODES simulations, and the ROSS developers for quickly reacting to our feature requests. This research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S. Department of Energy’s Office of Science and National Nuclear Security Administration, responsible for delivering a capable exascale ecosystem, including software, applications, and hardware technology, to support the nation’s exascale computing imperative. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research , under contract number DE-AC02-06CH11357 We gratefully acknowledge the computing resources provided on Blues, a high-performance computing cluster operated by the Laboratory Computing Resource Center at Argonne National Laboratory. Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several universities as well as other organizations (see https://www.grid5000.fr ).

FundersFunder number
Laboratory Computing Resource Center
U.S. Department of Energy
Office of Science
National Nuclear Security Administration
Advanced Scientific Computing ResearchDE-AC02-06CH11357
Argonne National Laboratory

    Keywords

    • Cram
    • Ensemble simulations
    • MPI
    • MPI_Comm_launch
    • MPMD
    • Swift/T
    • Workflows

    Fingerprint

    Dive into the research topics of 'MPI jobs within MPI jobs: A practical way of enabling task-level fault-tolerance in HPC workflows'. Together they form a unique fingerprint.

    Cite this