TY - GEN
T1 - A communication-Avoiding 3D LU factorization algorithm for sparse matrices
AU - Sao, Piyush
AU - Li, Xiaoye Sherry
AU - Vuduc, Richard
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/8/3
Y1 - 2018/8/3
N2 - We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D sparse LU algorithm uses a three-dimensional MPI process grid, aggressively exploits elimination tree parallelism and trades off increased memory for reduced per-process communication. We also analyze the asymptotic improvements for planar graphs (e.g., from 2D grid or mesh domains) and certain non-planar graphs (specifically for 3D grids and meshes). For planar graphs with n vertices, our algorithm reduces communication volume asymptotically in n by a factor of O{log n} and latency by a factor of O{log n}. For non-planar cases, our algorithm can reduce the per-process communication volume by 3× and latency by O{n1/3} times. In all cases, the memory needed to achieve these gains is a constant factor. We implemented our algorithm by extending the 2D data structure used in superLU. Our new 3D code achieves speedups up to 27× for planar graphs and up to 3.3× for non-planar graphs over the baseline 2D superLU when run on 24,000 cores of a Cray XC30.
AB - We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D sparse LU algorithm uses a three-dimensional MPI process grid, aggressively exploits elimination tree parallelism and trades off increased memory for reduced per-process communication. We also analyze the asymptotic improvements for planar graphs (e.g., from 2D grid or mesh domains) and certain non-planar graphs (specifically for 3D grids and meshes). For planar graphs with n vertices, our algorithm reduces communication volume asymptotically in n by a factor of O{log n} and latency by a factor of O{log n}. For non-planar cases, our algorithm can reduce the per-process communication volume by 3× and latency by O{n1/3} times. In all cases, the memory needed to achieve these gains is a constant factor. We implemented our algorithm by extending the 2D data structure used in superLU. Our new 3D code achieves speedups up to 27× for planar graphs and up to 3.3× for non-planar graphs over the baseline 2D superLU when run on 24,000 cores of a Cray XC30.
KW - Communication avoiding algorithm
KW - Nested dissection
KW - Sparse Gaussian elimination
KW - Sparse direct solver
UR - http://www.scopus.com/inward/record.url?scp=85052240344&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2018.00100
DO - 10.1109/IPDPS.2018.00100
M3 - Conference contribution
AN - SCOPUS:85052240344
SN - 9781538643686
T3 - Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium, IPDPS 2018
SP - 908
EP - 919
BT - Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium, IPDPS 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 32nd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018
Y2 - 21 May 2018 through 25 May 2018
ER -