Toward a performance/resilience tool for hardware/software co-Design of high-Performance computing systems

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

18 Scopus citations

Abstract

xSim is a simulation-based performance investigation toolkit that permits running high-performance computing (HPC) applications in a controlled environment with millions of concurrent execution threads, while observing application performance in a simulated extreme-scale system for hardware/ software co-design. The presented work details newly developed features for xSim that permit the injection of MPI process failures, the propagation/detection/ notification of such failures within the simulation, and their handling using application-level checkpoint/restart. These new capabilities enable the observation of application behavior and performance under failure within a simulated future-generation HPC system using the most common fault handling technique.

Original languageEnglish
Title of host publicationProceedings
Subtitle of host publicationInternational Conference on Parallel Processing - The 42nd Annual Conference, ICPP 2013
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages960-969
Number of pages10
ISBN (Print)9780769551173
DOIs
StatePublished - 2013
Event42nd Annual International Conference on Parallel Processing, ICPP 2013 - Lyon, France
Duration: Oct 1 2013Oct 4 2013

Publication series

NameProceedings of the International Conference on Parallel Processing
ISSN (Print)0190-3918

Conference

Conference42nd Annual International Conference on Parallel Processing, ICPP 2013
Country/TerritoryFrance
CityLyon
Period10/1/1310/4/13

Keywords

  • Fault injection
  • High-performance computing
  • Message passing interface
  • Parallel discrete event simulation

Fingerprint

Dive into the research topics of 'Toward a performance/resilience tool for hardware/software co-Design of high-Performance computing systems'. Together they form a unique fingerprint.

Cite this