Interpreting write performance of supercomputer I/O systems with regression models

Bing Xie, Zilong Tan, Philip Carns, Jeff Chase, Kevin Harms, Jay Lofstead, Sarp Oral, Sudharshan S. Vazhkudai, Feiyi Wang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

9 Scopus citations

Abstract

This work seeks to advance the state of the art in HPC I/O performance analysis and interpretation. In particular, we demonstrate effective techniques to: (1) model output performance in the presence of I/O interference from production loads; (2) build features from write patterns and key parameters of the system architecture and configurations; (3) employ suitable machine learning algorithms to improve model accuracy. We train models with five popular regression algorithms and conduct experiments on two distinct production HPC platforms. We find that the lasso and random forest models predict output performance with high accuracy on both of the target systems. We also explore use of the models to guide adaptation in I/O middleware systems, and show potential for improvements of at least 15% from model-guided adaptation on 70% of samples, and improvements up to 10 × on some samples for both of the target systems.

Original languageEnglish
Title of host publicationProceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages557-566
Number of pages10
ISBN (Electronic)9781665440660
DOIs
StatePublished - May 2021
Event35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021 - Virtual, Online
Duration: May 17 2021May 21 2021

Publication series

NameProceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021

Conference

Conference35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021
CityVirtual, Online
Period05/17/2105/21/21

Funding

ACKNOWLEDGMENTS This research used resources of the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at the Oak Ridge National Laboratory, which is supported by the Office of Science of the Department of Energy under Contract DE-AC05-00OR22725. This work was supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357. This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH11357. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.

Keywords

  • High performance computing
  • I/O performance
  • Machine learning

Fingerprint

Dive into the research topics of 'Interpreting write performance of supercomputer I/O systems with regression models'. Together they form a unique fingerprint.

Cite this