Correcting soft errors online in fast fourier transform

Xin Liang, Jieyang Chen, Dingwen Tao, Sihuan Li, Panruo Wu, Hongbo Li, Kaiming Ouyang, Yuanlai Liu, Fengguang Song, Zizhong Chen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

15 Scopus citations

Abstract

While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect soft errors offline in the fast Fourier transform (FFT) after computation finishes, none of the existing ABFT schemes detect soft errors online before the computation finishes. This paper presents an online ABFT scheme for FFT so that soft errors can be detected online and the corrupted computation can be terminated in a much more timely manner. We also extend our scheme to tolerate both arithmetic errors and memory errors, develop strategies to reduce its fault tolerance overhead and improve its numerical stability and fault coverage, and finally incorporate it into the widely used FFTW library - one of the today's fastest FFT software implementations. Experimental results demonstrate that: (1) the proposed online ABFT scheme introduces much lower overhead than the existing offline ABFT schemes; (2) it detects errors in a much more timely manner; and (3) it also has higher numerical stability and better fault coverage.

Original languageEnglish
Title of host publicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9781450351140
DOIs
StatePublished - Nov 12 2017
Externally publishedYes
EventInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017 - Denver, United States
Duration: Nov 12 2017Nov 17 2017

Publication series

NameProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017

Conference

ConferenceInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017
Country/TerritoryUnited States
CityDenver
Period11/12/1711/17/17

Funding

Œis work is partially supported by the NSF grants OAC-1305624, CCF-1513201, the SZSTI basic research program JCYJ2015063011494-2313, and the MOST key project 2017YFB0202100. This work is partially supported by the NSF grants OAC-1305624, CCF-1513201, the SZSTI basic research program JCYJ2015063011494-2313, and the MOST key project 2017YFB0202100.

FundersFunder number
SZSTIJCYJ2015063011494-2313
National Science FoundationOAC-1305624, CCF-1513201
Ministry of Science and Technology2017YFB0202100
National Science Foundation

    Keywords

    • Algorithm-Based fault tolerance
    • DFT
    • FFT
    • FFTW
    • Soft errors

    Fingerprint

    Dive into the research topics of 'Correcting soft errors online in fast fourier transform'. Together they form a unique fingerprint.

    Cite this