Abstract
We present a resilience analysis framework, called FIdelity, to accurately and quickly analyze the behavior of hardware errors in deep learning accelerators. Our framework enables resilience analysis starting from the very beginning of the design process to ensure that the reliability requirements are met, so that these accelerators can be safely deployed for a wide range of applications, including safety-critical applications such as self-driving cars. Existing resilience analysis techniques suffer from the following limitations: 1. general-purpose hardware techniques can achieve accurate results, but they require access to RTL to perform timeconsuming RTL simulations, which is not feasible for early design exploration; 2. general-purpose software techniques can produce results quickly, but they are highly inaccurate; 3. techniques targeting deep learning accelerators only focus on memory errors. Our FIdelity framework overcomes these limitations. FIdelity only requires a minimal amount of high-level design information that can be obtained from architectural descriptions/block diagrams, or estimated and varied for sensitivity analysis. By leveraging unique architectural properties of deep learning accelerators, we are able to systematically model a major class of hardware errors - transient errors in logic components - in software with high fidelity. Therefore, FIdelity is both quick and accurate, and does not require access to RTL. We thoroughly validate our FIdelity framework using Nvidia's open-source accelerator called NVDLA, which shows that the results are highly accurate - out of 60K fault injection experiments, the software fault models derived using FIdelity closely match the behaviors observed from RTL simulations. Using the validated FIdelity framework, we perform a large-scale resilience study on NVDLA, which consists of 46M fault injection experiments running various representative deep neural network applications. We report the key findings and architectural insights, which can be used to guide the design of future accelerators.
Original language | English |
---|---|
Title of host publication | Proceedings - 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2020 |
Publisher | IEEE Computer Society |
Pages | 270-281 |
Number of pages | 12 |
ISBN (Electronic) | 9781728173832 |
DOIs | |
State | Published - Oct 2020 |
Externally published | Yes |
Event | 53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2020 - Virtual, Athens, Greece Duration: Oct 17 2020 → Oct 21 2020 |
Publication series
Name | Proceedings of the Annual International Symposium on Microarchitecture, MICRO |
---|---|
Volume | 2020-October |
ISSN (Print) | 1072-4451 |
Conference
Conference | 53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2020 |
---|---|
Country/Territory | Greece |
City | Virtual, Athens |
Period | 10/17/20 → 10/21/20 |
Bibliographical note
Publisher Copyright:© 2020 IEEE Computer Society. All rights reserved.