Abstract
In this paper, we propose a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded SUMMA that achieves higher failure-tolerance than replication-based schemes for the same amount of redundancy. This work bridges the gap between recent developments in coded computing and fault-tolerance in high-performance computing (HPC). The core idea of coded computing is the same as algorithm-based fault-tolerance (ABFT), which is weaving redundancy in the computation using error-correcting codes. In particular, we show that MatDot codes, an innovative code construction for parallel matrix multiplications, can be integrated into three-dimensional SUMMA (Scalable Universal Matrix Multiplication Algorithm [30]) in a communication-avoiding manner. To tolerate any two node failures, the proposed 3D Coded SUMMA requires ~50% less redundancy than replication, while the overhead in execution time is only about 5–10%.
Original language | English |
---|---|
Title of host publication | Euro-Par 2020 |
Subtitle of host publication | Parallel Processing - 26th International Conference on Parallel and Distributed Computing, Proceedings |
Editors | Maciej Malawski, Krzysztof Rzadca |
Publisher | Springer |
Pages | 392-407 |
Number of pages | 16 |
ISBN (Print) | 9783030576745 |
DOIs | |
State | Published - 2020 |
Event | 26th International European Conference on Parallel and Distributed Computing, Euro-Par 2020 - Warsaw, Poland Duration: Aug 24 2020 → Aug 28 2020 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 12247 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 26th International European Conference on Parallel and Distributed Computing, Euro-Par 2020 |
---|---|
Country/Territory | Poland |
City | Warsaw |
Period | 08/24/20 → 08/28/20 |
Funding
This work was sponsored by the U.S. Department of Energy’s Office of Advanced Scientific Computing Research. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/ downloads/doe-public-access-plan).. Acknowledgements and Data Availability Statement. This project was supported partially by NSF grant CCF 1763657 and supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, program managers Robinson Pino and Lucy Nowell. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The datasets and code generated during and/or analysed during the current study are available in the Figshare repository: https://doi.org/10.6084/m9.figshare.12560 330 [19].
Keywords
- Algorithm-based fault tolerance
- Coded computing
- Communication-efficient algorithms
- Error detection and correction
- Fault-tolerant algorithms
- Parallel matrix multiplication