TY - GEN
T1 - Evaluating OpenMP affinity on the POWER8 architecture
AU - Pophale, Swaroop
AU - Hernandez, Oscar
N1 - Publisher Copyright:
© Springer International Publishing Switzerland 2016.
PY - 2016
Y1 - 2016
N2 - As we move toward pre-Exascale systems, two of the DOE leadership class systems will consist of very powerful OpenPOWER compute nodes which will be more complex to program. These systems will have massive amounts of parallelism; where threads may be running on POWER9 cores as well as on accelerators. Advances in memory interconnects, such as NVLINK, will provide a unified shared memory address spaces for different types of memories HBM, DRAM, etc. In preparation for such system, we need to improve our understanding on how OpenMP supports the concept of affinity as well as memory placement on POWER8 systems. Data locality and affinity are key program optimizations to exploit the compute and memory capabilities to achieve good performance by minimizing data motion across NUMA domains and access the cache efficiently. This paper is the first step to evaluate the current features of OpenMP 4.0 on the POWER8 processors, and on how to measure its effects on a system with two POWER8 sockets. We experiment with the different affinity settings provided by OpenMP 4.0 to quantify the costs of having good data locality vs not, and measure their effects via hardware counters. We also find out which affinity settings benefits more from data locality. Based on this study we describe the current state of art, the challenges we faced in quantifying effects of affinity, and ideas on how OpenMP 5.0 should be improved to address affinity in the context of NUMA domains and accelerators.
AB - As we move toward pre-Exascale systems, two of the DOE leadership class systems will consist of very powerful OpenPOWER compute nodes which will be more complex to program. These systems will have massive amounts of parallelism; where threads may be running on POWER9 cores as well as on accelerators. Advances in memory interconnects, such as NVLINK, will provide a unified shared memory address spaces for different types of memories HBM, DRAM, etc. In preparation for such system, we need to improve our understanding on how OpenMP supports the concept of affinity as well as memory placement on POWER8 systems. Data locality and affinity are key program optimizations to exploit the compute and memory capabilities to achieve good performance by minimizing data motion across NUMA domains and access the cache efficiently. This paper is the first step to evaluate the current features of OpenMP 4.0 on the POWER8 processors, and on how to measure its effects on a system with two POWER8 sockets. We experiment with the different affinity settings provided by OpenMP 4.0 to quantify the costs of having good data locality vs not, and measure their effects via hardware counters. We also find out which affinity settings benefits more from data locality. Based on this study we describe the current state of art, the challenges we faced in quantifying effects of affinity, and ideas on how OpenMP 5.0 should be improved to address affinity in the context of NUMA domains and accelerators.
UR - http://www.scopus.com/inward/record.url?scp=84992538358&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-45550-1_3
DO - 10.1007/978-3-319-45550-1_3
M3 - Conference contribution
AN - SCOPUS:84992538358
SN - 9783319455495
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 35
EP - 46
BT - OpenMP
A2 - Maruyama, Naoya
A2 - Wahib, Mohamed
A2 - de Supinski, Bronis R.
PB - Springer Verlag
T2 - 12th International Workshop on OpenMP, IWOMP 2016
Y2 - 5 October 2016 through 7 October 2016
ER -