3D coded SUMMA: Communication-efficient and robust parallel matrix multiplication

Haewon Jeong, Yaoqing Yang, Vipul Gupta, Christian Engelmann, Tze Meng Low, Viveck Cadambe, Kannan Ramchandran, Pulkit Grover

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In this paper, we propose a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded SUMMA that achieves higher failure-tolerance than replication-based schemes for the same amount of redundancy. This work bridges the gap between recent developments in coded computing and fault-tolerance in high-performance computing (HPC). The core idea of coded computing is the same as algorithm-based fault-tolerance (ABFT), which is weaving redundancy in the computation using error-correcting codes. In particular, we show that MatDot codes, an innovative code construction for parallel matrix multiplications, can be integrated into three-dimensional SUMMA (Scalable Universal Matrix Multiplication Algorithm [30]) in a communication-avoiding manner. To tolerate any two node failures, the proposed 3D Coded SUMMA requires ~50% less redundancy than replication, while the overhead in execution time is only about 5–10%.

Original languageEnglish
Title of host publicationEuro-Par 2020
Subtitle of host publicationParallel Processing - 26th International Conference on Parallel and Distributed Computing, Proceedings
EditorsMaciej Malawski, Krzysztof Rzadca
PublisherSpringer
Pages392-407
Number of pages16
ISBN (Print)9783030576745
DOIs
StatePublished - 2020
Event26th International European Conference on Parallel and Distributed Computing, Euro-Par 2020 - Warsaw, Poland
Duration: Aug 24 2020Aug 28 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12247 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference26th International European Conference on Parallel and Distributed Computing, Euro-Par 2020
Country/TerritoryPoland
CityWarsaw
Period08/24/2008/28/20

Funding

This work was sponsored by the U.S. Department of Energy’s Office of Advanced Scientific Computing Research. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/ downloads/doe-public-access-plan).. Acknowledgements and Data Availability Statement. This project was supported partially by NSF grant CCF 1763657 and supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, program managers Robinson Pino and Lucy Nowell. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The datasets and code generated during and/or analysed during the current study are available in the Figshare repository: https://doi.org/10.6084/m9.figshare.12560 330 [19].

FundersFunder number
DOE Public Access Plan
United States Government
National Science FoundationCCF 1763657
U.S. Department of Energy
Office of Science
Advanced Scientific Computing ResearchDE-AC05-00OR22725

    Keywords

    • Algorithm-based fault tolerance
    • Coded computing
    • Communication-efficient algorithms
    • Error detection and correction
    • Fault-tolerant algorithms
    • Parallel matrix multiplication

    Fingerprint

    Dive into the research topics of '3D coded SUMMA: Communication-efficient and robust parallel matrix multiplication'. Together they form a unique fingerprint.

    Cite this