Fidelity: Efficient resilience analysis framework for deep learning accelerators

Yi He, Prasanna Balaprakash, Yanjing Li

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

48 Scopus citations

Abstract

We present a resilience analysis framework, called FIdelity, to accurately and quickly analyze the behavior of hardware errors in deep learning accelerators. Our framework enables resilience analysis starting from the very beginning of the design process to ensure that the reliability requirements are met, so that these accelerators can be safely deployed for a wide range of applications, including safety-critical applications such as self-driving cars. Existing resilience analysis techniques suffer from the following limitations: 1. general-purpose hardware techniques can achieve accurate results, but they require access to RTL to perform timeconsuming RTL simulations, which is not feasible for early design exploration; 2. general-purpose software techniques can produce results quickly, but they are highly inaccurate; 3. techniques targeting deep learning accelerators only focus on memory errors. Our FIdelity framework overcomes these limitations. FIdelity only requires a minimal amount of high-level design information that can be obtained from architectural descriptions/block diagrams, or estimated and varied for sensitivity analysis. By leveraging unique architectural properties of deep learning accelerators, we are able to systematically model a major class of hardware errors - transient errors in logic components - in software with high fidelity. Therefore, FIdelity is both quick and accurate, and does not require access to RTL. We thoroughly validate our FIdelity framework using Nvidia's open-source accelerator called NVDLA, which shows that the results are highly accurate - out of 60K fault injection experiments, the software fault models derived using FIdelity closely match the behaviors observed from RTL simulations. Using the validated FIdelity framework, we perform a large-scale resilience study on NVDLA, which consists of 46M fault injection experiments running various representative deep neural network applications. We report the key findings and architectural insights, which can be used to guide the design of future accelerators.

Original languageEnglish
Title of host publicationProceedings - 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2020
PublisherIEEE Computer Society
Pages270-281
Number of pages12
ISBN (Electronic)9781728173832
DOIs
StatePublished - Oct 2020
Externally publishedYes
Event53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2020 - Virtual, Athens, Greece
Duration: Oct 17 2020Oct 21 2020

Publication series

NameProceedings of the Annual International Symposium on Microarchitecture, MICRO
Volume2020-October
ISSN (Print)1072-4451

Conference

Conference53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2020
Country/TerritoryGreece
CityVirtual, Athens
Period10/17/2010/21/20

Bibliographical note

Publisher Copyright:
© 2020 IEEE Computer Society. All rights reserved.

Fingerprint

Dive into the research topics of 'Fidelity: Efficient resilience analysis framework for deep learning accelerators'. Together they form a unique fingerprint.

Cite this