TY - GEN
T1 - Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads across Accelerators, Coprocessors, and Multicore Processors
AU - Cao, Chongxiao
AU - Gates, Mark
AU - Haidar, Azzam
AU - Luszczek, Piotr
AU - Tomov, Stanimire
AU - Yamazaki, Ichitaro
AU - Dongarra, Jack
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014
Y1 - 2014
N2 - Ever since accelerators and coprocessors became the mainstream hardware for throughput-oriented HPC workloads, various programming techniques have been proposed to increase productivity in terms of both the performance and ease-of-use. We evaluate these aspects of OpenCL on a number of hardware platforms for an important subset of dense linear algebra operations that are relevant to a wide range of scientific applications. Our findings indicate that OpenCL portability has improved since our previous publication and many new and surprising usage scenarios are possible that rival those available after decades of software development on the CPUs. The combined performance-portability metric, even though not promised by the OpenCL standard, reflects the need for tuning performance-critical operations during the porting process and we show how a large portion of the available efficiency is lost if the tuning is not done correctly.
AB - Ever since accelerators and coprocessors became the mainstream hardware for throughput-oriented HPC workloads, various programming techniques have been proposed to increase productivity in terms of both the performance and ease-of-use. We evaluate these aspects of OpenCL on a number of hardware platforms for an important subset of dense linear algebra operations that are relevant to a wide range of scientific applications. Our findings indicate that OpenCL portability has improved since our previous publication and many new and surprising usage scenarios are possible that rival those available after decades of software development on the CPUs. The combined performance-portability metric, even though not promised by the OpenCL standard, reflects the need for tuning performance-critical operations during the porting process and we show how a large portion of the available efficiency is lost if the tuning is not done correctly.
UR - http://www.scopus.com/inward/record.url?scp=84988214474&partnerID=8YFLogxK
U2 - 10.1109/ScalA.2014.8
DO - 10.1109/ScalA.2014.8
M3 - Conference contribution
AN - SCOPUS:84988214474
T3 - Proceedings of ScalA 2014: 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - held in conjunction with SC 2014: The International Conference for High Performance Computing, Networking, Storage and Analysis
SP - 61
EP - 68
BT - Proceedings of ScalA 2014
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2014
Y2 - 17 November 2014
ER -