A Survey of Techniques for Modeling and Improving Reliability of Computing Systems

Sparsh Mittal, Jeffrey S. Vetter

Research output: Contribution to journalArticlepeer-review

63 Scopus citations

Abstract

Recent trends of aggressive technology scaling have greatly exacerbated the occurrences and impact of faults in computing systems. This has made 'reliability' a first-order design constraint. To address the challenges of reliability, several techniques have been proposed. This paper provides a survey of architectural techniques for improving resilience of computing systems. We especially focus on techniques proposed for microarchitectural components, such as processor registers, functional units, cache and main memory etc. In addition, we discuss techniques proposed for non-volatile memory, GPUs and 3D-stacked processors. To underscore the similarities and differences of the techniques, we classify them based on their key characteristics. We also review the metrics proposed to quantify vulnerability of processor structures. We believe that this survey will help researchers, system-architects and processor designers in gaining insights into the techniques for improving reliability of computing systems.

Original languageEnglish
Article number7094277
Pages (from-to)1226-1238
Number of pages13
JournalIEEE Transactions on Parallel and Distributed Systems
Volume27
Issue number4
DOIs
StatePublished - Apr 1 2016

Funding

This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy.

Keywords

  • Review
  • architectural techniques
  • architectural vulnerability factor
  • classification
  • fault-tolerance
  • reliability
  • resilience
  • soft/transient error
  • vulnerability

Fingerprint

Dive into the research topics of 'A Survey of Techniques for Modeling and Improving Reliability of Computing Systems'. Together they form a unique fingerprint.

Cite this