Toward supporting multi-gpu targets via taskloop and user-defined schedules

Vivek Kale, Wenbin Lu, Anthony Curtis, Abid M. Malik, Barbara Chapman, Oscar Hernandez

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

7 Scopus citations

Abstract

Many modern supercomputers such as ORNL’s Summit, LLNL’s Sierra, and LBL’s upcoming Perlmutter offer or will offer multiple, e.g., 4 to 8, GPUs per node for running computational science and engineering applications. One should expect an application to achieve speedup using multiple GPUs on a node of a supercomputer over a single GPU of the node, in particular an application that is embarrassingly parallel and load imbalanced, such as AutoDock, QMCPACK and DMRG++. OpenMP is a popular model used to run applications on heterogeneous devices of a node and OpenMP 5.x provides rich features for tasking and GPU offloading. However, OpenMP doesn’t provide significant support for running application code on multiple GPUs efficiently, in particular for the aforementioned applications. We provide different OpenMP task-to-GPU scheduling strategies that help distribute an application’s work across GPUs on a node for efficient parallel GPU execution. Our solution involves using OpenMP’s construct taskloop to generate OpenMP tasks containing target regions for OpenMP threads, and then having OpenMP threads assign those tasks to GPUs on a node through a schedule specified by the application programmer. We analyze the performance of our solution using a small benchmark code representative of the aforementioned applications. Our solution improves performance over a standard baseline assignment of tasks to GPUs by up to 57.2%. Further, based on our results, we suggest OpenMP extensions that could help an application programmer have his or her application run on multiple GPUs per node efficiently.

Original languageEnglish
Title of host publicationOpenMP
Subtitle of host publicationPortable Multi-Level Parallelism on Modern Systems - 16th International Workshop on OpenMP, IWOMP 2020, Proceedings
EditorsKent Milfeld, Lars Koesterke, Bronis R. de Supinski, Jannis Klinkenberg
PublisherSpringer Science and Business Media Deutschland GmbH
Pages295-309
Number of pages15
ISBN (Print)9783030581435
DOIs
StatePublished - 2020
Event16th International Workshop on OpenMP, IWOMP 2020 - Austin, United States
Duration: Sep 22 2020Sep 24 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12295 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference16th International Workshop on OpenMP, IWOMP 2020
Country/TerritoryUnited States
CityAustin
Period09/22/2009/24/20

Funding

Acknowledgements. This research was supported in part by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration, in particular its subproject on Scaling OpenMP with LLVm for Exascale performance and portability (SOLLVE). It is also supported in part by NSF project 1409946 “Compute on Data Path”. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under contract number DE-AC05-00OR22725. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. The authors would like to thank Stony Brook Research Computing and Cyberinfrastructure, and the Institute for Advanced Computational Science at Stony Brook University for access to the high-performance SeaWulf computing system, which was made possible by a $1.4M National Science Foundation grant (#1531492). We want to thank Jeremy Smith and Ada Sedova, from Oak Ridge National Laboratory, for providing a small sample of input sets for the Autodock-GPU experiments to help us study the application workload. We acknowledge the QMCPACK team at ORNL for discussing their code with respect to application load imbalances. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/ downloads/doe-public-access-plan).. Acknowledgements. This research was supported in part by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration, in particular its subproject on Scaling OpenMP with LLVm for Exascale performance and portability (SOLLVE). It is also supported in part by NSF project 1409946 ?Compute on Data Path?. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under contract number DE-AC05-00OR22725. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. The authors would like to thank Stony Brook Research Computing and Cyberinfrastructure, and the Institute for Advanced Computational Science at Stony Brook University for access to the high-performance SeaWulf computing system, which was made possible by a $1.4M National Science Foundation grant (#1531492). We want to thank Jeremy Smith and Ada Sedova, from Oak Ridge National Laboratory, for providing a small sample of input sets for the Autodock-GPU experiments to help us study the application workload. We acknowledge the QMCPACK team at ORNL for discussing their code with respect to application load imbalances. Notice of Copyright: This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/ downloads/doe-public-access-plan).

FundersFunder number
DOE Public Access Plan17-SC-20-SC
U.S. Department of Energy Office of Science
United States Government
National Science Foundation1409946, 1531492
U.S. Department of Energy
Office of Science
National Nuclear Security Administration
Advanced Scientific Computing ResearchDE-AC05-00OR22725
Oak Ridge National Laboratory

    Keywords

    • AutoDock
    • High-performance
    • Load balancing
    • Multi GPUs
    • Offload
    • OpenMP
    • Parallel
    • Tasks
    • User-defined scheduling

    Fingerprint

    Dive into the research topics of 'Toward supporting multi-gpu targets via taskloop and user-defined schedules'. Together they form a unique fingerprint.

    Cite this