TY - GEN
T1 - Towards achieving performance portability using directives for accelerators
AU - Lopez, M. Graham
AU - Larrea, Veronica Vergara
AU - Joubert, Wayne
AU - Hernandez, Oscar
AU - Haidar, Azzam
AU - Tomov, Stanimire
AU - Dongarra, Jack
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2017/1/30
Y1 - 2017/1/30
N2 - In this paper we explore the performance portability of directives provided by OpenMP 4 and OpenACC to program various types of node architectures with attached accelerators, both self-hosted multicore and offload multicore/GPU. Our goal is to examine how successful OpenACC and the newer offload features of OpenMP 4.5 are for moving codes between architectures, how much tuning might be required and what lessons we can learn from this experience. To do this, we use examples of algorithms with varying computational intensities for our evaluation, as both compute and data access efficiency are important considerations for overall application performance. We implement these kernels using various methods provided by newer OpenACC and OpenMP implementations, and we evaluate their performance on various platforms including both X86-64 with attached NVIDIA GPUs, self-hosted Intel Xeon Phi KNL, as well as an X86-64 host system with Intel Xeon Phi coprocessors. In this paper, we explain what factors affected the performance portability such as how to pick the right programming model, its programming style, its availability on different platforms, and how well compilers can optimize and target to multiple platforms.
AB - In this paper we explore the performance portability of directives provided by OpenMP 4 and OpenACC to program various types of node architectures with attached accelerators, both self-hosted multicore and offload multicore/GPU. Our goal is to examine how successful OpenACC and the newer offload features of OpenMP 4.5 are for moving codes between architectures, how much tuning might be required and what lessons we can learn from this experience. To do this, we use examples of algorithms with varying computational intensities for our evaluation, as both compute and data access efficiency are important considerations for overall application performance. We implement these kernels using various methods provided by newer OpenACC and OpenMP implementations, and we evaluate their performance on various platforms including both X86-64 with attached NVIDIA GPUs, self-hosted Intel Xeon Phi KNL, as well as an X86-64 host system with Intel Xeon Phi coprocessors. In this paper, we explain what factors affected the performance portability such as how to pick the right programming model, its programming style, its availability on different platforms, and how well compilers can optimize and target to multiple platforms.
UR - http://www.scopus.com/inward/record.url?scp=85015203631&partnerID=8YFLogxK
U2 - 10.1109/WACCPD.2016.006
DO - 10.1109/WACCPD.2016.006
M3 - Conference contribution
AN - SCOPUS:85015203631
T3 - Proceedings of WACCPD 2016: 3rd Workshop on Accelerator Programming using Directives - Held in conjunction with SC 2016: The International Conference for High Performance Computing, Networking, Storage and Analysis
SP - 13
EP - 24
BT - Proceedings of WACCPD 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 3rd Workshop on Accelerator Programming using Directives, WACCPD 2016
Y2 - 14 November 2016
ER -