The case for modular redundancy in large-scale high performance computing systems

Christian Engelmann, Hong Ong, Stephen L. Scott

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

56 Scopus citations

Abstract

Recent investigations into resilience of large-scale high-performance computing (HPC) systems showed a continuous trend of decreasing reliability and availability. Newly installed systems have a lower mean-time to failure (MTTF) and a higher mean-time to recover (MTTR) than their predecessors. Modular redundancy is being used in many mission critical systems today to provide for resilience, such as for aerospace and command & control systems. The primary argument against modular redundancy for resilience in HPC has always been that the capability of a HPC system, and respective return on investment, would be significantly reduced. We argue that modular redundancy can significantly increase compute node availability as it removes the impact of scale from single compute node MTTR. We further argue that single compute nodes can be much less reliable, and therefore less expensive, and still be highly available, if their MTTR/MTTF ratio is maintained.

Original languageEnglish
Title of host publicationProceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2009
Pages189-194
Number of pages6
StatePublished - 2009
EventIASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2009 - Innsbruck, Austria
Duration: Feb 16 2009Feb 18 2009

Publication series

NameProceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2009

Conference

ConferenceIASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2009
Country/TerritoryAustria
CityInnsbruck
Period02/16/0902/18/09

Keywords

  • Fault tolerance
  • High availability
  • High-performance computing
  • Modular redundancy
  • Reliability

Fingerprint

Dive into the research topics of 'The case for modular redundancy in large-scale high performance computing systems'. Together they form a unique fingerprint.

Cite this