Redundant execution of HPC applications with MR-MPI

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

40 Scopus citations

Abstract

This paper presents a modular-redundant Message Passing Interface (MPI) solution, MR-MPI, for transparently executing high-performance computing (HPC) applications in a redundant fashion. The presented work addresses the deficiencies of recovery-oriented HPC, i.e., checkpoint/restart to/from a parallel file system, at extreme scale by adding the redundancy approach to the HPC resilience portfolio. It utilizes the MPI performance tool interface, PMPI, to transparently intercept MPI calls from an application and to hide all redundancy-related mechanisms. A redundantly executed application runs with rm native MPI processes, where r is the number of MPI ranks visible to the application and m is the replication degree. Messages between redundant nodes are replicated. Partial replication for tunable resilience is supported. The performance results clearly show the negative impact of the O(mm) messages between replicas. For low-level, point-to-point benchmarks, the impact can be as high as the replication degree. For applications, performance highly depends on the actual communication types and counts. On single-core systems, the overhead can be 0% for embarrassingly parallel applications independent of the employed redundancy configuration or up to 70-90% for communication- intensive applications in a dual-redundant configuration. On multi-core systems, the overhead can be significantly higher due to the additional communication contention.

Original languageEnglish
Title of host publicationProceedings of the 10th IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2011
Pages31-38
Number of pages8
DOIs
StatePublished - 2011
Event10th IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2011 - Innsbruck, Austria
Duration: Feb 15 2011Feb 17 2011

Publication series

NameProceedings of the 10th IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2011

Conference

Conference10th IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2011
Country/TerritoryAustria
CityInnsbruck
Period02/15/1102/17/11

Keywords

  • Fault tolerance
  • High-performance computing
  • Message Passing Interface
  • Redundancy
  • Resilience

Fingerprint

Dive into the research topics of 'Redundant execution of HPC applications with MR-MPI'. Together they form a unique fingerprint.

Cite this