Parallel reduction to hessenberg form with algorithm-based fault tolerance

Yulu Jia, George Bosilca, Piotr Luszczek, Jack J. Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

This paper studies the resilience of a two-sided factorization and presents a generic algorithm-based approach capable of making two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an Algorithm Based Fault Tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the initial part of the matrix with checksums, and protect finished panels in the panel scope with diskless checkpoints. Compared with the original HR (the ScaLAPACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases.

Original languageEnglish
Title of host publicationProceedings of SC 2013
Subtitle of host publicationThe International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherIEEE Computer Society
ISBN (Print)9781450323789
DOIs
StatePublished - 2013
Externally publishedYes
Event2013 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2013 - Denver, CO, United States
Duration: Nov 17 2013Nov 22 2013

Publication series

NameInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
ISSN (Print)2167-4329
ISSN (Electronic)2167-4337

Conference

Conference2013 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2013
Country/TerritoryUnited States
CityDenver, CO
Period11/17/1311/22/13

Funding

FundersFunder number
Japan Science and Technology Agency0904952, 1063019
Oak Ridge National Laboratory

    Keywords

    • Algorithm-based fault tolerance
    • Dense linear algebra
    • Hessenberg reduction
    • Parallel numerical libraries
    • ScaLAPACK

    Fingerprint

    Dive into the research topics of 'Parallel reduction to hessenberg form with algorithm-based fault tolerance'. Together they form a unique fingerprint.

    Cite this