Abstract
In this work we developed two Pytorch libraries using the PyTorch RPC interface for distributed deep learning approaches on high resolution images. The spatial decomposition library allows for distributed training on very large images, which otherwise wouldn't be possible on a single GPU. The domain parallelism library allows for distributed training across multiple domain unlabeled data, by leveraging the domain separation architecture. Both of those libraries where tested on the Summit supercomputer at Oak Ridge National Laboratory at a moderate scale.
Original language | English |
---|---|
Title of host publication | Proceedings of RSDHA 2021 |
Subtitle of host publication | Redefining Scalability for Diversely Heterogeneous Architectures, Held in conjunction with SC 2021: The International Conference for High Performance Computing, Networking, Storage and Analysis |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 27-33 |
Number of pages | 7 |
ISBN (Electronic) | 9781665458771 |
DOIs | |
State | Published - 2021 |
Event | 2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures, RSDHA 2021 - St. Louis, United States Duration: Nov 19 2021 → … |
Publication series
Name | Proceedings of RSDHA 2021: Redefining Scalability for Diversely Heterogeneous Architectures, Held in conjunction with SC 2021: The International Conference for High Performance Computing, Networking, Storage and Analysis |
---|
Conference
Conference | 2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures, RSDHA 2021 |
---|---|
Country/Territory | United States |
City | St. Louis |
Period | 11/19/21 → … |
Funding
ACKNOWLEDGMENT This research is sponsored by the AI Initiative as part of the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the US Department of Energy under contract DE-AC05-00OR22725. This research used resources at the Oak Ridge Leadership Computing Facility, a DOE Office of Science User Facility operated by the Oak Ridge National Laboratory. This research is sponsored by the AI Initiative as part of the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UTBattelle, LLC, for the US Department of Energy under contract DE-AC05-00OR22725. This research used resources at the Oak Ridge Leadership Computing Facility, a DOE Office of Science User Facility operated by the Oak Ridge National Laboratory. This manuscript has been co-authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE).The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Keywords
- distributed deep learning
- high resolution images
- scalability