Applying machine learning to understand write performance of large-scale parallel filesystems

Bing Xie, Zilong Tan, Philip Carns, Jeff Chase, Kevin Harms, Jay Lofstead, Sarp Oral, Sudharshan S. Vazhkudai, Feiyi Wang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

11 Scopus citations

Abstract

In high-performance computing (HPC), I/O performance prediction offers the potential to improve the efficiency of scientific computing. In particular, accurate prediction can make runtime estimates more precise, guide users toward optimal checkpoint strategies, and better inform facility provisioning and scheduling policies. HPC I/O performance is notoriously difficult to predict and model, however, in large part because of inherent variability and a lack of transparency in the behaviors of constituent storage system components. In this work we seek to advance the state of the art in HPC I/O performance prediction by (1) modeling the mean performance to address high variability, (2) deriving model features from write patterns, system architecture and system configurations, and (3) employing Lasso regression model to improve model accuracy. We demonstrate the efficacy of our approach by applying it to a crucial subset of common HPC I/O motifs, namely, file-per-process checkpoint write workloads. We conduct experiments on two distinct production HPC platforms-Titan at the Oak Ridge Leadership Computing Facility and Cetus at the Argonne Leadership Computing Facility-to train and evaluate our models. We find that we can attain ≤ 30% relative error for 92.79% and 99.64% of the samples in our test set on these platforms, respectively.

Original languageEnglish
Title of host publicationProceedings of PDSW 2019
Subtitle of host publicationIEEE/ACM 4th International Parallel Data Systems Workshop - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages30-39
Number of pages10
ISBN (Electronic)9781728160054
DOIs
StatePublished - Nov 2019
Event4th IEEE/ACM International Parallel Data Systems Workshop, PDSW 2019 - Denver, United States
Duration: Nov 18 2019 → …

Publication series

NameProceedings of PDSW 2019: IEEE/ACM 4th International Parallel Data Systems Workshop - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference4th IEEE/ACM International Parallel Data Systems Workshop, PDSW 2019
Country/TerritoryUnited States
CityDenver
Period11/18/19 → …

Funding

This work was supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357. This work used resources of the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at the Oak Ridge National Laboratory, which is supported by the Office of Science of the Department of Energy under Contract DE-AC05-00OR22725. This work used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH11357. This work used resources of Sandia National Laboratories. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.

Keywords

  • Large-scale parallel filesystem
  • Machine learning
  • Production supercomputer
  • Write performance

Fingerprint

Dive into the research topics of 'Applying machine learning to understand write performance of large-scale parallel filesystems'. Together they form a unique fingerprint.

Cite this