Abstract
In high-performance computing (HPC), I/O performance prediction offers the potential to improve the efficiency of scientific computing. In particular, accurate prediction can make runtime estimates more precise, guide users toward optimal checkpoint strategies, and better inform facility provisioning and scheduling policies. HPC I/O performance is notoriously difficult to predict and model, however, in large part because of inherent variability and a lack of transparency in the behaviors of constituent storage system components. In this work we seek to advance the state of the art in HPC I/O performance prediction by (1) modeling the mean performance to address high variability, (2) deriving model features from write patterns, system architecture and system configurations, and (3) employing Lasso regression model to improve model accuracy. We demonstrate the efficacy of our approach by applying it to a crucial subset of common HPC I/O motifs, namely, file-per-process checkpoint write workloads. We conduct experiments on two distinct production HPC platforms-Titan at the Oak Ridge Leadership Computing Facility and Cetus at the Argonne Leadership Computing Facility-to train and evaluate our models. We find that we can attain ≤ 30% relative error for 92.79% and 99.64% of the samples in our test set on these platforms, respectively.
Original language | English |
---|---|
Title of host publication | Proceedings of PDSW 2019 |
Subtitle of host publication | IEEE/ACM 4th International Parallel Data Systems Workshop - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 30-39 |
Number of pages | 10 |
ISBN (Electronic) | 9781728160054 |
DOIs | |
State | Published - Nov 2019 |
Event | 4th IEEE/ACM International Parallel Data Systems Workshop, PDSW 2019 - Denver, United States Duration: Nov 18 2019 → … |
Publication series
Name | Proceedings of PDSW 2019: IEEE/ACM 4th International Parallel Data Systems Workshop - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis |
---|
Conference
Conference | 4th IEEE/ACM International Parallel Data Systems Workshop, PDSW 2019 |
---|---|
Country/Territory | United States |
City | Denver |
Period | 11/18/19 → … |
Funding
This work was supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357. This work used resources of the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at the Oak Ridge National Laboratory, which is supported by the Office of Science of the Department of Energy under Contract DE-AC05-00OR22725. This work used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH11357. This work used resources of Sandia National Laboratories. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.
Keywords
- Large-scale parallel filesystem
- Machine learning
- Production supercomputer
- Write performance