cuThomasBatch and cuThomasVBatch, CUDA Routines to compute batch of tridiagonal systems on NVIDIA GPUs

Pedro Valero-Lara, Ivan Martínez-Pérez, Raül Sirvent, Xavier Martorell, Antonio J. Peña

Research output: Contribution to journalArticlepeer-review

24 Scopus citations

Abstract

The solving of tridiagonal systems is one of the most computationally expensive parts in many applications, so that multiple studies have explored the use of NVIDIA GPUs to accelerate such computation. However, these studies have mainly focused on using parallel algorithms to compute such systems, which can efficiently exploit the shared memory and are able to saturate the GPUs capacity with a low number of systems, presenting a poor scalability when dealing with a relatively high number of systems. The gtsvStridedBatch routine in the cuSPARSE NVIDIA package is one of these examples, which is used as reference in this article. We propose a new implementation (cuThomasBatch) based on the Thomas algorithm. Unlike other algorithms, the Thomas algorithm is sequential, and so a coarse-grained approach is implemented where one CUDA thread solves a complete tridiagonal system instead of one CUDA block as in gtsvStridedBatch. To achieve a good scalability using this approach, it is necessary to carry out a transformation in the way that the inputs are stored in memory to exploit coalescence (contiguous threads access to contiguous memory locations). Different variants regarding the transformation of the data are explored in detail. We also explore some variants for the case of variable batch, when the size of the systems of the batch has different size (cuThomasVBatch). The results given in this study prove that the implementations carried out in this work are able to beat the reference code, being up to 5× (in double precision) and 6× (in single precision) faster using the latest NVIDIA GPU architecture, the Pascal P100.

Original languageEnglish
Article numbere4909
JournalConcurrency and Computation: Practice and Experience
Volume30
Issue number24
DOIs
StatePublished - Dec 25 2018
Externally publishedYes

Funding

This project was funded from the European Union's Horizon 2020 research and innovation programme under grant agreement 720270 (HBP SGA1), from the Spanish Ministry of Economy and Competitiveness under the project Computación de Altas Prestaciones VII (TIN2015-65316-P) and the Departament d'Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programació i Entorns d'Execució Paral·lels (2014-SGR-1051). We thank the support of NVIDIA through the BSC/UPC NVIDIA GPU Center of Excellence and the valuable feedback provided by Lung Sheng Chien and Alex Fit-Florea. Antonio J. Peña was cofinanced by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva fellowship number IJCI-2015-23266. European Union's Horizon 2020 research and innovation programme, Grant/Award Number: 720270 (HBP SGA1); Spanish Ministry of Economy and Competitiveness under the project Computación de Altas Prestaciones VII, Grant/Award Number: TIN2015-65316-P; Departament d'Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR; Models de Programació i Entorns d'Execució Paral-lels, Grant/Award Number: 2014-SGR-1051

FundersFunder number
Computación de Altas Prestaciones VIITIN2015-65316-P
Lung Sheng Chien and Alex Fit-FloreaIJCI-2015-23266
NVIDIA
Generalitat de Catalunya2014-SGR-1051
Ministerio de Economía y Competitividad
Barcelona Supercomputing Center
Horizon 2020720270
Universitat Politècnica de Catalunya

    Keywords

    • CR
    • CUDA
    • PCR
    • Thomas algorithm
    • cuSPARSE
    • parallel processing
    • scalability
    • tridiagonal linear systems

    Fingerprint

    Dive into the research topics of 'cuThomasBatch and cuThomasVBatch, CUDA Routines to compute batch of tridiagonal systems on NVIDIA GPUs'. Together they form a unique fingerprint.

    Cite this