Abstract
Investigating the performance of parallel applications at scale on future high-performance computing (HPC) architectures and the performance impact of different HPC architecture choices is an important component of HPC hardware/software co-design. The Extreme-scale Simulator (xSim) is a simulation toolkit for investigating the performance of parallel applications at scale. xSim scales to millions of simulated Message Passing Interface (MPI) processes. The xSim toolkit strives to limit simulation overheads in order to maintain performance and productivity criteria. This paper documents two improvements to xSim: (1) a new deadlock resolution protocol to reduce the parallel discrete event simulation overhead and (2) a new simulated MPI message matching algorithm to reduce the oversubscription management cost. These enhancements resulted in significant performance improvements. The simulation overhead for running the NASA Advanced Supercomputing Parallel Benchmark suite dropped from 1,020% to 238% for the conjugate gradient benchmark and 102% to 0% for the embarrassingly parallel benchmark. Additionally, the improvements were beneficial for reducing overheads in the highly accurate simulation mode of xSim, which is useful for resilience investigation studies for tracking intentional MPI process failures. In the highly accurate mode, the simulation overhead was reduced from 37,511% to 13,808% for conjugate gradient and from 3,332% to 204% for embarrassingly parallel.
Original language | English |
---|---|
Pages (from-to) | 3369-3389 |
Number of pages | 21 |
Journal | Concurrency and Computation: Practice and Experience |
Volume | 28 |
Issue number | 12 |
DOIs | |
State | Published - Aug 25 2016 |
Funding
This research is sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory (ORNL), managed by UT-Battelle, LLC for the US Department of Energy under contract no. De-AC05-00OR22725. This manuscript has been authored by UT-Battelle, LLC under contract no. DE-AC05-00OR22725 with the US Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Funders | Funder number |
---|---|
DOE Public Access Plan | |
US Department of Energy | |
UT-Battelle | |
United States Government | |
U.S. Department of Energy | De-AC05-00OR22725 |
Oak Ridge National Laboratory | |
UT-Battelle |
Keywords
- high-performance computing
- message passing interface
- parallel discrete event simulation
- performance prediction