MDLoader: A Hybrid Model-Driven Data Loader for Distributed Graph Neural Network Training

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Scalable data management is essential for processing large scientific dataset on HPC platforms for distributed deep learning. In-memory distributed storage is preferred for its speed, enabling rapid, random, and frequent data access required by stochastic optimizers. Processes use one-sided or collective communication to fetch remote data, with optimal performance depending on (i) dataset characteristics, (ii) training scale, and (iii) interconnection network. Empirical analysis shows collective communication excels with larger mini-batch sizes and/or fewer processes, whereas one-sided communication outperforms at larger scales. We propose MDLoader, a hybrid in-memory data loader for distributed graph neural network training. MDLoader features a model-driven performance estimator that dynamically selects between one-sided and collective communication at the beginning of training using Tree of Parzen Estimators (TPE). Evaluations on NERSC Perlmutter and OLCF Summit show MDLoader outperforms single-backend loaders by up to 2.83 × and predicts the suitable communication method with 96.3% (Perlmutter) and 94.3% (Summit) success rate.

Original languageEnglish
Title of host publicationProceedings of SC 2024-W
Subtitle of host publicationWorkshops of the International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1046-1057
Number of pages12
ISBN (Electronic)9798350355543
DOIs
StatePublished - 2024
Event2024 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC Workshops 2024 - Atlanta, United States
Duration: Nov 17 2024Nov 22 2024

Publication series

NameProceedings of SC 2024-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference2024 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC Workshops 2024
Country/TerritoryUnited States
CityAtlanta
Period11/17/2411/22/24

Funding

This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research and Office of Science, Scientific Discovery through Advanced Computing (SciDAC) program. This work used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231, under award ERCAP0025216 and ERCAP0027259. This work also used resources of the Oak Ridge Leadership Computing Facility, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725, under INCITE award CPH161. This research is partially supported by the Artificial Intelligence Initiative as part of the Laboratory Directed Research and Development (LDRD) Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the US Department of Energy under contract DE-AC05- 00OR22725.

Keywords

  • Graph Neural Network
  • MPI communication
  • Performance estimator

Fingerprint

Dive into the research topics of 'MDLoader: A Hybrid Model-Driven Data Loader for Distributed Graph Neural Network Training'. Together they form a unique fingerprint.

Cite this