FT-iSort: Efficient fault tolerance for introsort

Sihuan Li, Hongbo Li, Xin Liang, Jieyang Chen, Elisabeth Giem, Kaiming Ouyang, Kai Zhao, Sheng Di, Franck Cappello, Zizhong Chen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

8 Scopus citations

Abstract

Introspective sorting is a ubiquitous sorting algorithm which underlies many large scale distributed systems. Hardware-mediated soft errors can result in comparison and memory errors, and thus cause introsort to generate incorrect output, which in turn disrupts systems built upon introsort; hence, it is critical to incorporate fault tolerance capability within introsort. This paper proposes the first theoretically-sound, practical fault tolerant introsort with negligible overhead: FT-iSort. To tolerate comparison errors, we use minimal TMR protection via exploiting the properties of the effects of soft errors on introsort. This algorithm-based selective protection incurs far less overhead than nave TMR protection designed to protect an entire application. To tolerate memory errors that escape DRAM error correcting code, we propose XOR-based re-execution. We incorporate our fault tolerance method into the well-known parallel sorting implementation HykSort, and we find that fault tolerant HykSort incurs negligible overhead and obtains nearly the same scalability as unprotected HykSort.

Original languageEnglish
Title of host publicationProceedings of SC 2019
Subtitle of host publicationThe International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherIEEE Computer Society
ISBN (Electronic)9781450362290
DOIs
StatePublished - Nov 17 2019
Event2019 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019 - Denver, United States
Duration: Nov 17 2019Nov 22 2019

Publication series

NameInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
ISSN (Print)2167-4329
ISSN (Electronic)2167-4337

Conference

Conference2019 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019
Country/TerritoryUnited States
CityDenver
Period11/17/1911/22/19

Funding

The material was supported by the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357, and supported by the National Science Foundation under Grant No. 1619253. This work was also supported by National Science Foundation CCF 1513201 and National Key Research and Development Programs No. 2017YFB0202100. We acknowledge the computing resources provided on Bebop, which is operated by the Laboratory Computing Resource Center at Argonne National Laboratory. We thank the anonymous reviewers for their insightful suggestions.

FundersFunder number
National Key Research and Development Programs2017YFB0202100
National Science Foundation CCF 1513201 and National Key ResearchCCF 1513201
National Science Foundation1619253
U.S. Department of Energy
Office of ScienceDE-AC02-06CH11357

    Keywords

    • Algorithm based fault tolerance
    • Comparison errors
    • Fault tolerant sorting
    • Introsort
    • Soft errors

    Fingerprint

    Dive into the research topics of 'FT-iSort: Efficient fault tolerance for introsort'. Together they form a unique fingerprint.

    Cite this