Hvac: Removing I/O Bottleneck for Large-Scale Deep Learning Applications

Awais Khan, Arnab K. Paul, Christopher Zimmer, Sarp Oral, Sajal Dash, Scott Atchley, Feiyi Wang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

11 Scopus citations

Abstract

Scientific communities are increasingly adopting deep learning (DL) models in their applications to accelerate scientific discovery processes. However, with rapid growth in the computing capabilities of HPC supercomputers, large-scale DL applications have to spend a significant portion of training time performing I/O to a parallel storage system. Previous research works have investigated optimization techniques such as prefetching and caching. Unfortunately, there exist non-trivial challenges to adopting the existing solutions on HPC supercomputers for large-scale DL training applications, which include non-performance and/or failures at extreme scale, lack of portability and generality in design, complex deployment methodology, and being limited to a specific application or dataset. To address these challenges, we propose High-Velocity AI Cache (HVAC), a distributed read-cache layer that targets and fully exploits the node-local storage or near node-local storage technology. HVAC seamlessly accelerates read I/O by aggregating node-local or near node-local storage, avoiding metadata lookups and file locking while preserving portability in the application code. We deploy and evaluate HVAC on 1,024 nodes (with over 6000 NVIDIA V100 GPUS) of the Summit supercomputer. In particular, we evaluate the scalability, efficiency, accuracy, and load distribution of HVAC compared to GPFS and XFS-on-NVMe. With four different DL applications, we observe an average 25 % performance improvement atop GPFS and 9% drop against XFS-on-NVMe, which scale linearly and are considered the performance upper bound. We envision HVAC as an important caching library for upcoming HPC supercomputers such as Frontier.

Original languageEnglish
Title of host publicationProceedings - 2022 IEEE International Conference on Cluster Computing, CLUSTER 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages324-335
Number of pages12
ISBN (Electronic)9781665498562
DOIs
StatePublished - 2022
Event2022 IEEE International Conference on Cluster Computing, CLUSTER 2022 - Heidelberg, Germany
Duration: Sep 6 2022Sep 9 2022

Publication series

NameProceedings - IEEE International Conference on Cluster Computing, ICCC
Volume2022-September
ISSN (Print)1552-5244

Conference

Conference2022 IEEE International Conference on Cluster Computing, CLUSTER 2022
Country/TerritoryGermany
CityHeidelberg
Period09/6/2209/9/22

Funding

This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid up, irrevocable, world-wide license to publish or reproduce the published form of the manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DEAC05- 00OR22725

Keywords

  • Caching and I/O Optimizations
  • Deep Learning
  • High-Performance Computing (HPC)

Fingerprint

Dive into the research topics of 'Hvac: Removing I/O Bottleneck for Large-Scale Deep Learning Applications'. Together they form a unique fingerprint.

Cite this