TY - GEN
T1 - 3D coded SUMMA
T2 - 26th International European Conference on Parallel and Distributed Computing, Euro-Par 2020
AU - Jeong, Haewon
AU - Yang, Yaoqing
AU - Gupta, Vipul
AU - Engelmann, Christian
AU - Low, Tze Meng
AU - Cadambe, Viveck
AU - Ramchandran, Kannan
AU - Grover, Pulkit
N1 - Publisher Copyright:
© Springer Nature Switzerland AG 2020.
PY - 2020
Y1 - 2020
N2 - In this paper, we propose a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded SUMMA that achieves higher failure-tolerance than replication-based schemes for the same amount of redundancy. This work bridges the gap between recent developments in coded computing and fault-tolerance in high-performance computing (HPC). The core idea of coded computing is the same as algorithm-based fault-tolerance (ABFT), which is weaving redundancy in the computation using error-correcting codes. In particular, we show that MatDot codes, an innovative code construction for parallel matrix multiplications, can be integrated into three-dimensional SUMMA (Scalable Universal Matrix Multiplication Algorithm [30]) in a communication-avoiding manner. To tolerate any two node failures, the proposed 3D Coded SUMMA requires ~50% less redundancy than replication, while the overhead in execution time is only about 5–10%.
AB - In this paper, we propose a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded SUMMA that achieves higher failure-tolerance than replication-based schemes for the same amount of redundancy. This work bridges the gap between recent developments in coded computing and fault-tolerance in high-performance computing (HPC). The core idea of coded computing is the same as algorithm-based fault-tolerance (ABFT), which is weaving redundancy in the computation using error-correcting codes. In particular, we show that MatDot codes, an innovative code construction for parallel matrix multiplications, can be integrated into three-dimensional SUMMA (Scalable Universal Matrix Multiplication Algorithm [30]) in a communication-avoiding manner. To tolerate any two node failures, the proposed 3D Coded SUMMA requires ~50% less redundancy than replication, while the overhead in execution time is only about 5–10%.
KW - Algorithm-based fault tolerance
KW - Coded computing
KW - Communication-efficient algorithms
KW - Error detection and correction
KW - Fault-tolerant algorithms
KW - Parallel matrix multiplication
UR - http://www.scopus.com/inward/record.url?scp=85090094091&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-57675-2_25
DO - 10.1007/978-3-030-57675-2_25
M3 - Conference contribution
AN - SCOPUS:85090094091
SN - 9783030576745
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 392
EP - 407
BT - Euro-Par 2020
A2 - Malawski, Maciej
A2 - Rzadca, Krzysztof
PB - Springer
Y2 - 24 August 2020 through 28 August 2020
ER -