A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems

Piyush Sao, Xiaoye S. Li, Richard Vuduc

Research output: Contribution to journalArticlepeer-review

14 Scopus citations

Abstract

We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D algorithm for sparse LU uses a three-dimensional MPI process grid, exploits elimination tree parallelism, and trades off increased memory for reduced per-process communication. We also analyze the asymptotic improvements for planar graphs (e.g., those arising from 2D grid or mesh discretizations) and certain non-planar graphs (specifically for 3D grids and meshes). For a planar graph with n vertices, our algorithm reduces communication volume asymptotically in n by a factor of Ologn and latency by a factor of Ologn. For non-planar cases, our algorithm can reduce the per-process communication volume by 3× and latency by On1/3 times. In all cases, the memory needed to achieve these gains is a constant factor. We implemented our algorithm by extending the 2D data structure used in SUPERLU_DIST. Our new 3D code achieves empirical speedups up to 27× for planar graphs and up to 3.3× for non-planar graphs over the baseline 2D SUPERLU_DIST when run on 24,000 cores of a Cray XC30. We extend the 3D algorithm for heterogeneous architectures by adding the Highly Asynchronous Lazy Offload (HALO) algorithm for co-processor offload [44]. On 4096 nodes of a Cray XK7 with 32,768 CPU cores and 4096 Nvidia K20x GPUs, the 3D algorithm achieves empirical speedups up to 24× for planar graphs and 3.5× for non-planar graphs over the baseline 2D SUPERLU_DIST with co-processor acceleration.

Original languageEnglish
Pages (from-to)218-234
Number of pages17
JournalJournal of Parallel and Distributed Computing
Volume131
DOIs
StatePublished - Sep 2019

Funding

This research was supported in part by the Exascale Computing Project ( 17-SC-20-SC ), a collaborative effort of the U.S. Department of Energy (DOE) Office of Science and the National Nuclear Security Administration (NNSA) . It used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility operated under Contract No. DE-AC02-05CH11231, as well as those of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725 . Lastly, portions of this material are based upon work supported by the U.S. National Science Foundation (NSF) Award Numbers 1710371 . Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of DOE, NNSA, NERSC, or the NSF.

Fingerprint

Dive into the research topics of 'A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems'. Together they form a unique fingerprint.

Cite this