Supporting the development of resilient message passing applications using simulation

Research output: Contribution to conferencePaperpeer-review

10 Scopus citations

Abstract

An emerging aspect of high-performance computing (HPC) hardware/software co-design is investigating performance under failure. The work in this paper extends the Extreme-scale Simulator (xSim), which was designed for evaluating the performance of message passing interface (MPI) applications on future HPC architectures, with fault-tolerant MPI extensions proposed by the MPI Fault Tolerance Working Group. xSim permits running MPI applications with millions of concurrent MPI ranks, while observing application performance in a simulated extreme-scale system using a lightweight parallel discrete event simulation. The newly added features offer user-level failure mitigation (ULFM) extensions at the simulated MPI layer to support algorithm-based fault tolerance (ABFT). The presented solution permits investigating performance under failure and failure handling of ABFT solutions. The newly enhanced xSim is the very first performance tool that supports ULFM and ABFT.

Original languageEnglish
Pages271-278
Number of pages8
DOIs
StatePublished - 2014
Event2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2014 - Turin, Italy
Duration: Feb 12 2014Feb 14 2014

Conference

Conference2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2014
Country/TerritoryItaly
CityTurin
Period02/12/1402/14/14

Funding

FundersFunder number
Oak Ridge National Laboratory

    Keywords

    • Algorithm-based Fault Tolerance
    • High-performance Computing
    • Message Passing Interface
    • Parallel Discrete Event Simulation
    • Performance Prediction

    Fingerprint

    Dive into the research topics of 'Supporting the development of resilient message passing applications using simulation'. Together they form a unique fingerprint.

    Cite this