A new deadlock resolution protocol and message matching algorithm for the extreme-scale simulator

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Investigating the performance of parallel applications at scale on future high-performance computing (HPC) architectures and the performance impact of different HPC architecture choices is an important component of HPC hardware/software co-design. The Extreme-scale Simulator (xSim) is a simulation toolkit for investigating the performance of parallel applications at scale. xSim scales to millions of simulated Message Passing Interface (MPI) processes. The xSim toolkit strives to limit simulation overheads in order to maintain performance and productivity criteria. This paper documents two improvements to xSim: (1) a new deadlock resolution protocol to reduce the parallel discrete event simulation overhead and (2) a new simulated MPI message matching algorithm to reduce the oversubscription management cost. These enhancements resulted in significant performance improvements. The simulation overhead for running the NASA Advanced Supercomputing Parallel Benchmark suite dropped from 1,020% to 238% for the conjugate gradient benchmark and 102% to 0% for the embarrassingly parallel benchmark. Additionally, the improvements were beneficial for reducing overheads in the highly accurate simulation mode of xSim, which is useful for resilience investigation studies for tracking intentional MPI process failures. In the highly accurate mode, the simulation overhead was reduced from 37,511% to 13,808% for conjugate gradient and from 3,332% to 204% for embarrassingly parallel.

Original languageEnglish
Pages (from-to)3369-3389
Number of pages21
JournalConcurrency and Computation: Practice and Experience
Volume28
Issue number12
DOIs
StatePublished - Aug 25 2016

Funding

This research is sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory (ORNL), managed by UT-Battelle, LLC for the US Department of Energy under contract no. De-AC05-00OR22725. This manuscript has been authored by UT-Battelle, LLC under contract no. DE-AC05-00OR22725 with the US Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

FundersFunder number
DOE Public Access Plan
US Department of Energy
UT-Battelle
United States Government
U.S. Department of EnergyDe-AC05-00OR22725
Oak Ridge National Laboratory
UT-Battelle

    Keywords

    • high-performance computing
    • message passing interface
    • parallel discrete event simulation
    • performance prediction

    Fingerprint

    Dive into the research topics of 'A new deadlock resolution protocol and message matching algorithm for the extreme-scale simulator'. Together they form a unique fingerprint.

    Cite this