Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures

Chenhao Xie, Jieyang Chen, Jesun Firoz, Jiajia Li, Shuaiwen Leon Song, Kevin Barker, Mark Raugas, Ang Li

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

Designing efficient and scalable sparse linear algebra kernels on modern multi-GPU based HPC systems is a challenging task due to significant irregular memory references and workload imbalance across GPUs. These challenges are particularly compounded in the case of Sparse Triangular Solver (SpTRSV), which introduces additional complexity of two-dimensional computation dependencies among subsequent computation steps. Dependency information may need to be exchanged and shared among GPUs, thus warranting for efficient memory allocation, data partitioning, and workload distribution as well as fine-grained communication and synchronization support. In this work, we focus on designing algorithm for SpTRSV in a single-node, multi-GPU setting. We demonstrate that directly adopting unified memory can adversely affect the performance of SpTRSV on multi-GPU architectures, despite linking via fast interconnect like NVLinks and NVSwitches. Alternatively, we employ the latest NVSHMEM technology based on Partitioned Global Address Space programming model to enable efficient fine-grained communication and drastic synchronization overhead reduction. Furthermore, to handle workload imbalance, we propose a malleable task-pool execution model which can further enhance the utilization of GPUs. By applying these techniques, our experiments on the NVIDIA multi-GPU supernode V100-DGX-1 and DGX-2 systems demonstrate that our design can achieve an average of 3.53 × (up to 9.86 ×) speedup on a DGX-1 system and 3.66 × (up to 9.64 ×) speedup on a DGX-2 system with four GPUs over the Unified-Memory design. The comprehensive sensitivity and scalability studies also show that the proposed zero-copy SpTRSV is able to fully utilize the computing and communication resources of the multi-GPU systems.

Original languageEnglish
Title of host publication50th International Conference on Parallel Processing, ICPP 2021 - Main Conference Proceedings
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450390682
DOIs
StatePublished - Aug 9 2021
Event50th International Conference on Parallel Processing, ICPP 2021 - Virtual, Online, United States
Duration: Aug 9 2021Aug 12 2021

Publication series

NameACM International Conference Proceeding Series

Conference

Conference50th International Conference on Parallel Processing, ICPP 2021
Country/TerritoryUnited States
CityVirtual, Online
Period08/9/2108/12/21

Keywords

  • Multi-GPU Systems
  • Sparse Linear Algebra Kernels
  • Task Model
  • Triangular Solver

Fingerprint

Dive into the research topics of 'Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures'. Together they form a unique fingerprint.

Cite this