Abstract
We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D algorithm for sparse LU uses a three-dimensional MPI process grid, exploits elimination tree parallelism, and trades off increased memory for reduced per-process communication. We also analyze the asymptotic improvements for planar graphs (e.g., those arising from 2D grid or mesh discretizations) and certain non-planar graphs (specifically for 3D grids and meshes). For a planar graph with n vertices, our algorithm reduces communication volume asymptotically in n by a factor of Ologn and latency by a factor of Ologn. For non-planar cases, our algorithm can reduce the per-process communication volume by 3× and latency by On1/3 times. In all cases, the memory needed to achieve these gains is a constant factor. We implemented our algorithm by extending the 2D data structure used in SUPERLU_DIST. Our new 3D code achieves empirical speedups up to 27× for planar graphs and up to 3.3× for non-planar graphs over the baseline 2D SUPERLU_DIST when run on 24,000 cores of a Cray XC30. We extend the 3D algorithm for heterogeneous architectures by adding the Highly Asynchronous Lazy Offload (HALO) algorithm for co-processor offload [44]. On 4096 nodes of a Cray XK7 with 32,768 CPU cores and 4096 Nvidia K20x GPUs, the 3D algorithm achieves empirical speedups up to 24× for planar graphs and 3.5× for non-planar graphs over the baseline 2D SUPERLU_DIST with co-processor acceleration.
Original language | English |
---|---|
Pages (from-to) | 218-234 |
Number of pages | 17 |
Journal | Journal of Parallel and Distributed Computing |
Volume | 131 |
DOIs | |
State | Published - Sep 2019 |
Funding
This research was supported in part by the Exascale Computing Project ( 17-SC-20-SC ), a collaborative effort of the U.S. Department of Energy (DOE) Office of Science and the National Nuclear Security Administration (NNSA) . It used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility operated under Contract No. DE-AC02-05CH11231, as well as those of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725 . Lastly, portions of this material are based upon work supported by the U.S. National Science Foundation (NSF) Award Numbers 1710371 . Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of DOE, NNSA, NERSC, or the NSF.