Characterizing Machine Learning I/O Workloads on Leadership Scale HPC Systems

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

12 Scopus citations

Abstract

High performance computing (HPC) is no longer solely limited to traditional workloads such as simulation and modeling. With the increase in the popularity of machine learning (ML) and deep learning (DL) technologies, we are observing that an increasing number of HPC users are incorporating ML methods into their workflow and scientific discovery processes, across a wide spectrum of science domains such as biology, earth science, and physics. This gives rise to a diverse set of I/O patterns than the traditional checkpoint/restart-based HPC I/O behavior. The details of the I/O characteristics of such ML I/O workloads have not been studied extensively for large-scale leadership HPC systems. This paper aims to fill that gap by providing an in-depth analysis to gain an understanding of the I/O behavior of ML I/O workloads using darshan - an I/O characterization tool designed for lightweight tracing and profiling. We study the darshan logs of more than 23, 000 HPC ML I/O jobs over a time period of one year running on Summit - the second-fastest supercomputer in the world. This paper provides a systematic I/O characterization of ML I/O jobs running on a leadership scale supercomputer to understand how the I/O behavior differs across science domains and the scale of workloads, and analyze the usage of parallel file system and burst buffer by ML I/O workloads.

Original languageEnglish
Title of host publicationProceedings - 29th International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS 2021
PublisherIEEE Computer Society
ISBN (Electronic)9781665458382
DOIs
StatePublished - 2021
Event29th International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS 2021 - Houston, United States
Duration: Nov 3 2021Nov 5 2021

Publication series

NameProceedings - IEEE Computer Society's Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, MASCOTS
ISSN (Print)1526-7539

Conference

Conference29th International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS 2021
Country/TerritoryUnited States
CityHouston
Period11/3/2111/5/21

Funding

ACKNOWLEDGMENT We would like to thank Hyogi Sim for his suggestions and inputs for the paper. This research used resources of the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at the Oak Ridge National Laboratory, which is supported by the Office of Science of the DOE under Contract DE-AC05-00OR22725.

Keywords

  • Burst Buffer
  • Darshan
  • HPC Storage
  • High Performance Computing
  • I/O Characterization
  • IBM Spectrum Scale
  • Machine Learning
  • Parallel File System

Fingerprint

Dive into the research topics of 'Characterizing Machine Learning I/O Workloads on Leadership Scale HPC Systems'. Together they form a unique fingerprint.

Cite this