Distributed Training for High Resolution Images: A Domain and Spatial Decomposition Approach

Aristeidis Tsaris, Jacob Hinkle, Dalton Lunga, Philipe Ambrozio Dias

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

In this work we developed two Pytorch libraries using the PyTorch RPC interface for distributed deep learning approaches on high resolution images. The spatial decomposition library allows for distributed training on very large images, which otherwise wouldn't be possible on a single GPU. The domain parallelism library allows for distributed training across multiple domain unlabeled data, by leveraging the domain separation architecture. Both of those libraries where tested on the Summit supercomputer at Oak Ridge National Laboratory at a moderate scale.

Original languageEnglish
Title of host publicationProceedings of RSDHA 2021
Subtitle of host publicationRedefining Scalability for Diversely Heterogeneous Architectures, Held in conjunction with SC 2021: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages27-33
Number of pages7
ISBN (Electronic)9781665458771
DOIs
StatePublished - 2021
Event2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures, RSDHA 2021 - St. Louis, United States
Duration: Nov 19 2021 → …

Publication series

NameProceedings of RSDHA 2021: Redefining Scalability for Diversely Heterogeneous Architectures, Held in conjunction with SC 2021: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures, RSDHA 2021
Country/TerritoryUnited States
CitySt. Louis
Period11/19/21 → …

Funding

ACKNOWLEDGMENT This research is sponsored by the AI Initiative as part of the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the US Department of Energy under contract DE-AC05-00OR22725. This research used resources at the Oak Ridge Leadership Computing Facility, a DOE Office of Science User Facility operated by the Oak Ridge National Laboratory. This research is sponsored by the AI Initiative as part of the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UTBattelle, LLC, for the US Department of Energy under contract DE-AC05-00OR22725. This research used resources at the Oak Ridge Leadership Computing Facility, a DOE Office of Science User Facility operated by the Oak Ridge National Laboratory. This manuscript has been co-authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE).The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

Keywords

  • distributed deep learning
  • high resolution images
  • scalability

Fingerprint

Dive into the research topics of 'Distributed Training for High Resolution Images: A Domain and Spatial Decomposition Approach'. Together they form a unique fingerprint.

Cite this