High-performance Cholesky factorization for GPU-only execution

Azzam Haidar, Ahmad Abdelfatah, Stanimire Tomov, Jack Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

12 Scopus citations

Abstract

We present our performance analysis, algorithm designs, and the optimizations needed for the development of high-performance GPU-only algorithms, and in particular, for the dense Cholesky factorization. In contrast to currently promoted designs that solve parallelism challenges on multicore architectures by representing algorithms as Directed Acyclic Graphs (DAGs), where nodes are tasks of fine granularity and edges are the dependencies between the tasks, our designs explicitly target manycore architectures like GPUs and feature coarse granularity tasks (that can be hierarchically split into fine grain data-parallel subtasks). Furthermore, in contrast to hybrid algorithms that schedule difficult to parallelize tasks on CPUs, we develop highly-efficient code for entirely GPU execution. GPU-only codes remove the expensive CPU-to-GPU communications and the tuning challenges related to slow CPU and/or low CPU-to-GPU bandwidth. We show that on latest GPUs, like the P100, this becomes so important that the GPU-only code even outperforms the hybrid MAGMA algorithms when the CPU tasks and communications can not be entirely overlapped with GPU computations. We achieve up to 4,300 GFlop/s in double precision on a P100 GPU, which is about 7-8x faster than high-end multicore CPUs, e.g., two 10-cores Intel Xeon E5-2650 v3 Haswell CPUs, where MKL runs up to about 500-600 Gflop/s. The new algorithm also outperforms significantly the GPU-only implementation currently available in the NVIDIA cuSOLVER library.

Original languageEnglish
Title of host publicationProceedings of the General Purpose GPUs, GPGPU-10 2017
PublisherAssociation for Computing Machinery, Inc
Pages42-52
Number of pages11
ISBN (Electronic)9781450349154
DOIs
StatePublished - Feb 4 2017
Event10th Workshop on General Purpose GPUs, GPGPU 2017 - Austin, United States
Duration: Feb 4 2017Feb 8 2017

Publication series

NameProceedings of the General Purpose GPUs, GPGPU-10 2017

Conference

Conference10th Workshop on General Purpose GPUs, GPGPU 2017
Country/TerritoryUnited States
CityAustin
Period02/4/1702/8/17

Funding

This material is based upon work supported by the National Science Foundation under Grant No. ACI-1339822, the Department of Energy, and NVIDIA.

FundersFunder number
National Science FoundationACI-1339822
U.S. Department of Energy
NVIDIA

    Keywords

    • Factorization
    • Hardware accelerators
    • Numerical linear algebra
    • Numerical software libraries
    • One-sided factorization algorithms

    Fingerprint

    Dive into the research topics of 'High-performance Cholesky factorization for GPU-only execution'. Together they form a unique fingerprint.

    Cite this