TY - GEN
T1 - Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures
AU - Xie, Chenhao
AU - Chen, Jieyang
AU - Firoz, Jesun
AU - Li, Jiajia
AU - Song, Shuaiwen Leon
AU - Barker, Kevin
AU - Raugas, Mark
AU - Li, Ang
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/8/9
Y1 - 2021/8/9
N2 - Designing efficient and scalable sparse linear algebra kernels on modern multi-GPU based HPC systems is a challenging task due to significant irregular memory references and workload imbalance across GPUs. These challenges are particularly compounded in the case of Sparse Triangular Solver (SpTRSV), which introduces additional complexity of two-dimensional computation dependencies among subsequent computation steps. Dependency information may need to be exchanged and shared among GPUs, thus warranting for efficient memory allocation, data partitioning, and workload distribution as well as fine-grained communication and synchronization support. In this work, we focus on designing algorithm for SpTRSV in a single-node, multi-GPU setting. We demonstrate that directly adopting unified memory can adversely affect the performance of SpTRSV on multi-GPU architectures, despite linking via fast interconnect like NVLinks and NVSwitches. Alternatively, we employ the latest NVSHMEM technology based on Partitioned Global Address Space programming model to enable efficient fine-grained communication and drastic synchronization overhead reduction. Furthermore, to handle workload imbalance, we propose a malleable task-pool execution model which can further enhance the utilization of GPUs. By applying these techniques, our experiments on the NVIDIA multi-GPU supernode V100-DGX-1 and DGX-2 systems demonstrate that our design can achieve an average of 3.53 × (up to 9.86 ×) speedup on a DGX-1 system and 3.66 × (up to 9.64 ×) speedup on a DGX-2 system with four GPUs over the Unified-Memory design. The comprehensive sensitivity and scalability studies also show that the proposed zero-copy SpTRSV is able to fully utilize the computing and communication resources of the multi-GPU systems.
AB - Designing efficient and scalable sparse linear algebra kernels on modern multi-GPU based HPC systems is a challenging task due to significant irregular memory references and workload imbalance across GPUs. These challenges are particularly compounded in the case of Sparse Triangular Solver (SpTRSV), which introduces additional complexity of two-dimensional computation dependencies among subsequent computation steps. Dependency information may need to be exchanged and shared among GPUs, thus warranting for efficient memory allocation, data partitioning, and workload distribution as well as fine-grained communication and synchronization support. In this work, we focus on designing algorithm for SpTRSV in a single-node, multi-GPU setting. We demonstrate that directly adopting unified memory can adversely affect the performance of SpTRSV on multi-GPU architectures, despite linking via fast interconnect like NVLinks and NVSwitches. Alternatively, we employ the latest NVSHMEM technology based on Partitioned Global Address Space programming model to enable efficient fine-grained communication and drastic synchronization overhead reduction. Furthermore, to handle workload imbalance, we propose a malleable task-pool execution model which can further enhance the utilization of GPUs. By applying these techniques, our experiments on the NVIDIA multi-GPU supernode V100-DGX-1 and DGX-2 systems demonstrate that our design can achieve an average of 3.53 × (up to 9.86 ×) speedup on a DGX-1 system and 3.66 × (up to 9.64 ×) speedup on a DGX-2 system with four GPUs over the Unified-Memory design. The comprehensive sensitivity and scalability studies also show that the proposed zero-copy SpTRSV is able to fully utilize the computing and communication resources of the multi-GPU systems.
KW - Multi-GPU Systems
KW - Sparse Linear Algebra Kernels
KW - Task Model
KW - Triangular Solver
UR - http://www.scopus.com/inward/record.url?scp=85117237881&partnerID=8YFLogxK
U2 - 10.1145/3472456.3472478
DO - 10.1145/3472456.3472478
M3 - Conference contribution
AN - SCOPUS:85117237881
T3 - ACM International Conference Proceeding Series
BT - 50th International Conference on Parallel Processing, ICPP 2021 - Main Conference Proceedings
PB - Association for Computing Machinery
T2 - 50th International Conference on Parallel Processing, ICPP 2021
Y2 - 9 August 2021 through 12 August 2021
ER -