TY - GEN
T1 - cLMAGMA
T2 - International Workshop on OpenCL 2013 and 2014, IWOCL 2014
AU - Cao, Chongxiao
AU - Gates, Mark
AU - Dongarra, Jack
AU - Luszczek, Piotr
AU - Du, Peng
AU - Tomov, Stanimire
N1 - Publisher Copyright:
Copyright 2014 ACM.
PY - 2014/5/12
Y1 - 2014/5/12
N2 - This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, an open source, high performance OpenCL library that incorporates various optimizations, and in general provides the DLA functionality of the popular LAPACK library on heterogeneous architectures. The LAPACK compliance and use of OpenCL simplify the use of clMAGMA in applications, while providing them with portable performance. High performance is obtained through the use of the high-performance OpenCL BLAS, hardware- and OpenCL-specific tuning, and a hybridization methodology, where we split the algorithm into computational tasks of various granularities. Execution of those tasks is efficiently scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components.
AB - This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, an open source, high performance OpenCL library that incorporates various optimizations, and in general provides the DLA functionality of the popular LAPACK library on heterogeneous architectures. The LAPACK compliance and use of OpenCL simplify the use of clMAGMA in applications, while providing them with portable performance. High performance is obtained through the use of the high-performance OpenCL BLAS, hardware- and OpenCL-specific tuning, and a hybridization methodology, where we split the algorithm into computational tasks of various granularities. Execution of those tasks is efficiently scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components.
UR - http://www.scopus.com/inward/record.url?scp=84985025955&partnerID=8YFLogxK
U2 - 10.1145/2664666.2664667
DO - 10.1145/2664666.2664667
M3 - Conference contribution
AN - SCOPUS:84985025955
T3 - ACM International Conference Proceeding Series
BT - Proceedings of the International Workshop on OpenCL 2013 and 2014, IWOCL 2014
PB - Association for Computing Machinery
Y2 - 11 May 2014 through 12 May 2014
ER -