Job Management with mpi_jm

Evan Berkowitz, Gustav Jansen, Kenneth McElvain, André Walker-Loud

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Access to Leadership computing is required for HPC applications that require a large fraction of compute nodes for a single computation and also for use cases where the volume of smaller tasks can only be completed in a competitive or reasonable time frame through use of these Leadership computing facilities. In the latter case, a robust and lightweight manager is ideal so that all these tasks can be computed in a machine-friendly way, notably with minimal use of mpirun or equivalent to launch the executables (simple bundling of tasks can over-tax the service nodes and crash the entire scheduler). Our library, mpi_jm, can manage such allocations, provided access to the requisite MPI functionality is provided. mpi_jm is fault-tolerant against a modest number of down or non-communicative nodes, can begin executing work on smaller portions of a larger allocation before all nodes become available for the allocation, can manage GPU-intensive and CPU-only work independently and can overlay them peacefully on shared nodes. It is easily incorporated into existing MPI-capable executables, which then can run both independently and under mpi_jm management. It provides a flexible Python interface, unlocking many high-level libraries, while also tightly binding users’ executables to hardware.

Original languageEnglish
Title of host publicationHigh Performance Computing - ISC High Performance 2018 International Workshops, Revised Selected Papers
EditorsRio Yokota, John Shalf, Sadaf Alam, Michèle Weiland
PublisherSpringer Verlag
Pages432-439
Number of pages8
ISBN (Print)9783030024642
DOIs
StatePublished - 2018
EventInternational Conference on High Performance Computing, ISC High Performance 2018 - Frankfurt, Germany
Duration: Jun 28 2018Jun 28 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11203 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferenceInternational Conference on High Performance Computing, ISC High Performance 2018
Country/TerritoryGermany
CityFrankfurt
Period06/28/1806/28/18

Funding

An award of computer time was provided by the Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program to CalLat (2016) as well as the Lawrence Livermore National Laboratory (LLNL) Multiprogrammatic and Institutional Computing program through a Tier 1 Grand Challenge award. This research used the NVIDIA GPU-accelerated Titan and Summit supercomputers at the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725, and the NVIDIA GPU-accelerated Surface, Ray, and Sierra supercomputers LLNL. This work was performed under the auspices of the U.S. Department of Energy by LLNL under Contract No. DE-AC52-07NA27344 and under contract DE-AC02-05CH11231, which the Regents of the University of California manage and operate Lawrence Berkeley National Laboratory and the National Energy Research Scientific Computing Center.

FundersFunder number
Lawrence Berkeley National Laboratory
U.S. Department of EnergyDE-AC05-00OR22725
Office of Science
Lawrence Livermore National LaboratoryDE-AC02-05CH11231, DE-AC52-07NA27344
National Energy Research Scientific Computing Center

    Keywords

    • CORAL
    • Job management
    • Pilot systems

    Fingerprint

    Dive into the research topics of 'Job Management with mpi_jm'. Together they form a unique fingerprint.

    Cite this