Optimizing Data Movement for GPU-Based In-Situ Workflow Using GPUDirect RDMA

  • Bo Zhang
  • , Philip E. Davis
  • , Nicolas Morales
  • , Zhao Zhang
  • , Keita Teranishi
  • , Manish Parashar

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

The extreme-scale computing landscape is increasingly dominated by GPU-accelerated systems. At the same time, in-situ workflows that employ memory-to-memory inter-application data exchanges have emerged as an effective approach for leveraging these extreme-scale systems. In the case of GPUs, GPUDirect RDMA enables third-party devices, such as network interface cards, to access GPU memory directly and has been adopted for intra-application communications across GPUs. In this paper, we present an interoperable framework for GPU-based in-situ workflows that optimizes data movement using GPUDirect RDMA. Specifically, we analyze the characteristics of the possible data movement pathways between GPUs from an in-situ workflow perspective, and design a strategy that maximizes throughput. Furthermore, we implement this approach as an extension of the DataSpaces data staging service, and experimentally evaluate its performance and scalability on a current leadership GPU cluster. The performance results show that the proposed design reduces data-movement time by up to 53% and 40% for the sender and receiver, respectively, and maintains excellent scalability for up to 256 GPUs.

Original languageEnglish
Title of host publicationEuro-Par 2023
Subtitle of host publicationParallel Processing - 29th International Conference on Parallel and Distributed Computing, Proceedings
EditorsJosé Cano, Marios D. Dikaiakos, George A. Papadopoulos, Miquel Pericàs, Rizos Sakellariou
PublisherSpringer Science and Business Media Deutschland GmbH
Pages323-338
Number of pages16
ISBN (Print)9783031396977
DOIs
StatePublished - 2023
Event29th International European Conference on Parallel and Distributed Computing, Euro-Par 2023 - Limassol, Cyprus
Duration: Aug 28 2023Sep 1 2023

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume14100 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference29th International European Conference on Parallel and Distributed Computing, Euro-Par 2023
Country/TerritoryCyprus
CityLimassol
Period08/28/2309/1/23

Funding

National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration (NNSA) under contract DE-NA0003525. This work was funded by NNSA’s Advanced Simulation and Computing (ASC) Program. This manuscript has been authored by UT-Battelle LLC under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). This work is also based upon work by the RAPIDS2 Institute supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research through the Advanced Computing (SciDAC) program under Award Number DE-SC0023130. The datasets and code generated during and/or analysed during the current study are available in the Figshare repository: https://doi.org/10.6084/m9.figshare.23535855 [4]. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration (NNSA) under contract DE-NA0003525. This work was funded by NNSA’s Advanced Simulation and Computing (ASC) Program. This manuscript has been authored by UT-Battelle LLC under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). This work is also based upon work by the RAPIDS2 Institute supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research through the Advanced Computing (SciDAC) program under Award Number DE-SC0023130. The datasets and code generated during and/or analysed during the current study are available in the Figshare repository: https://doi.org/10.6084/m9.figshare.23535855 [4]. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.

Keywords

  • Extreme-Scale Data Management
  • GPU
  • GPUDirect RDMA
  • In-Situ
  • Workflow

Fingerprint

Dive into the research topics of 'Optimizing Data Movement for GPU-Based In-Situ Workflow Using GPUDirect RDMA'. Together they form a unique fingerprint.

Cite this