TY - GEN
T1 - Toward supporting multi-gpu targets via taskloop and user-defined schedules
AU - Kale, Vivek
AU - Lu, Wenbin
AU - Curtis, Anthony
AU - Malik, Abid M.
AU - Chapman, Barbara
AU - Hernandez, Oscar
N1 - Publisher Copyright:
© Springer Nature Switzerland AG 2020.
PY - 2020
Y1 - 2020
N2 - Many modern supercomputers such as ORNL’s Summit, LLNL’s Sierra, and LBL’s upcoming Perlmutter offer or will offer multiple, e.g., 4 to 8, GPUs per node for running computational science and engineering applications. One should expect an application to achieve speedup using multiple GPUs on a node of a supercomputer over a single GPU of the node, in particular an application that is embarrassingly parallel and load imbalanced, such as AutoDock, QMCPACK and DMRG++. OpenMP is a popular model used to run applications on heterogeneous devices of a node and OpenMP 5.x provides rich features for tasking and GPU offloading. However, OpenMP doesn’t provide significant support for running application code on multiple GPUs efficiently, in particular for the aforementioned applications. We provide different OpenMP task-to-GPU scheduling strategies that help distribute an application’s work across GPUs on a node for efficient parallel GPU execution. Our solution involves using OpenMP’s construct taskloop to generate OpenMP tasks containing target regions for OpenMP threads, and then having OpenMP threads assign those tasks to GPUs on a node through a schedule specified by the application programmer. We analyze the performance of our solution using a small benchmark code representative of the aforementioned applications. Our solution improves performance over a standard baseline assignment of tasks to GPUs by up to 57.2%. Further, based on our results, we suggest OpenMP extensions that could help an application programmer have his or her application run on multiple GPUs per node efficiently.
AB - Many modern supercomputers such as ORNL’s Summit, LLNL’s Sierra, and LBL’s upcoming Perlmutter offer or will offer multiple, e.g., 4 to 8, GPUs per node for running computational science and engineering applications. One should expect an application to achieve speedup using multiple GPUs on a node of a supercomputer over a single GPU of the node, in particular an application that is embarrassingly parallel and load imbalanced, such as AutoDock, QMCPACK and DMRG++. OpenMP is a popular model used to run applications on heterogeneous devices of a node and OpenMP 5.x provides rich features for tasking and GPU offloading. However, OpenMP doesn’t provide significant support for running application code on multiple GPUs efficiently, in particular for the aforementioned applications. We provide different OpenMP task-to-GPU scheduling strategies that help distribute an application’s work across GPUs on a node for efficient parallel GPU execution. Our solution involves using OpenMP’s construct taskloop to generate OpenMP tasks containing target regions for OpenMP threads, and then having OpenMP threads assign those tasks to GPUs on a node through a schedule specified by the application programmer. We analyze the performance of our solution using a small benchmark code representative of the aforementioned applications. Our solution improves performance over a standard baseline assignment of tasks to GPUs by up to 57.2%. Further, based on our results, we suggest OpenMP extensions that could help an application programmer have his or her application run on multiple GPUs per node efficiently.
KW - AutoDock
KW - High-performance
KW - Load balancing
KW - Multi GPUs
KW - Offload
KW - OpenMP
KW - Parallel
KW - Tasks
KW - User-defined scheduling
UR - http://www.scopus.com/inward/record.url?scp=85091322183&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-58144-2_19
DO - 10.1007/978-3-030-58144-2_19
M3 - Conference contribution
AN - SCOPUS:85091322183
SN - 9783030581435
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 295
EP - 309
BT - OpenMP
A2 - Milfeld, Kent
A2 - Koesterke, Lars
A2 - de Supinski, Bronis R.
A2 - Klinkenberg, Jannis
PB - Springer Science and Business Media Deutschland GmbH
T2 - 16th International Workshop on OpenMP, IWOMP 2020
Y2 - 22 September 2020 through 24 September 2020
ER -