A framework for batched and gpu-resident factorization algorithms applied to block householder transformations

Azzam Haidar, Tingxing Tim Dong, Stanimire Tomov, Piotr Luszczek, Jack Dongarra

Research output: Contribution to journalConference articlepeer-review

22 Scopus citations

Abstract

As modern hardware keeps evolving, an increasingly effective approach to developing energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. This is in contrast to the hybrid CPU-GPU algorithms that rely heavily on using the multicore CPU for specific parts of the workload. But for a system to benefit fully from the GPU’s significantly higher energy efficiency, avoiding the use of the multicore CPU must be a primary design goal, so the system can rely more heavily on the more efficient GPU. Additionally, this will result in the removal of the costly CPU-to-GPU communication. Furthermore, we do not use a single symmetric multiprocessor (on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis, and the use of profiling and tracing tools, guided the development and optimization of our batched factorization to achieve up to a 2-fold speedup and a 3-fold energy efficiency improvement compared to our highly optimized batched CPU implementations based on the MKL library (when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to 5× speedup on the K40 GPU.

Original languageEnglish
Article numberA3
Pages (from-to)31-47
Number of pages17
JournalLecture Notes in Computer Science
Volume9137 LNCS
DOIs
StatePublished - 2015
Event30th International Conference on High Performance Computing, ISC 2015 - Frankfurt, Germany
Duration: Jul 12 2015Jul 16 2015

Funding

This material is based upon work supported by the National Science Foundation under Grant No. ACI-1339822, the Department of Energy, and Intel. The results were obtained in part with the financial support of the Russian Scientific Fund, Agreement N14-11-00190.

FundersFunder number
Department of Energy, and Intel
Russian Scientific FundN14-11-00190
National Science FoundationACI-1339822

    Fingerprint

    Dive into the research topics of 'A framework for batched and gpu-resident factorization algorithms applied to block householder transformations'. Together they form a unique fingerprint.

    Cite this