TY - GEN
T1 - Optimizing memory-bound SYMV kernel on GPU hardware accelerators
AU - Abdelfattah, Ahmad
AU - Dongarra, Jack
AU - Keyes, David
AU - Ltaief, Hatem
PY - 2013
Y1 - 2013
N2 - Hardware accelerators are becoming ubiquitous high performance scientific computing. They are capable of delivering an unprecedented level of concurrent execution contexts. High-level programming language extensions (e.g., CUDA), profiling tools (e.g., PAPI-CUDA, CUDA Profiler) are paramount to improve productivity, while effectively exploiting the underlying hardware. We present an optimized numerical kernel for computing the symmetric matrix-vector product on nVidia Fermi GPUs. Due to its inherent memory-bound nature, this kernel is very critical in the tridiagonalization of a symmetric dense matrix, which is a preprocessing step to calculate the eigenpairs. Using a novel design to address the irregular memory accesses by hiding latency and increasing bandwidth, our preliminary asymptotic results show 3.5x and 2.5x fold speedups over the similar CUBLAS 4.0 kernel, and 7-8% and 30% fold improvement over the Matrix Algebra on GPU and Multicore Architectures (MAGMA) library in single and double precision arithmetics, respectively.
AB - Hardware accelerators are becoming ubiquitous high performance scientific computing. They are capable of delivering an unprecedented level of concurrent execution contexts. High-level programming language extensions (e.g., CUDA), profiling tools (e.g., PAPI-CUDA, CUDA Profiler) are paramount to improve productivity, while effectively exploiting the underlying hardware. We present an optimized numerical kernel for computing the symmetric matrix-vector product on nVidia Fermi GPUs. Due to its inherent memory-bound nature, this kernel is very critical in the tridiagonalization of a symmetric dense matrix, which is a preprocessing step to calculate the eigenpairs. Using a novel design to address the irregular memory accesses by hiding latency and increasing bandwidth, our preliminary asymptotic results show 3.5x and 2.5x fold speedups over the similar CUBLAS 4.0 kernel, and 7-8% and 30% fold improvement over the Matrix Algebra on GPU and Multicore Architectures (MAGMA) library in single and double precision arithmetics, respectively.
UR - http://www.scopus.com/inward/record.url?scp=84883307572&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-38718-0_10
DO - 10.1007/978-3-642-38718-0_10
M3 - Conference contribution
AN - SCOPUS:84883307572
SN - 9783642387173
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 72
EP - 79
BT - High Performance Computing for Computational Science, VECPAR 2012 - 10th International Conference, Revised Selected Papers
T2 - 10th International Conference on High Performance Computing for Computational Science, VECPAR 2012
Y2 - 17 July 2012 through 20 July 2012
ER -