Machine Learning Assisted HPC Workload Trace Generation for Leadership Scale Storage Systems

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Monitoring and analyzing a wide range of I/O activities in an HPC cluster is important in maintaining mission-critical performance in a large-scale, multi-user, parallel storage system. Center-wide I/O traces can provide high-level information and fine-grained activities per application or per user running in the system. Studying such large-scale traces can provide helpful insights into the system. It can be used to develop predictive methods for making predictive decisions, adjusting scheduling policies, or providing decisions for the design of next-generation systems. However, sharing real-world I/O traces to expedite such research efforts leaves a few concerns; i) the cost of sharing the large traces is expensive due to this large size, and ii) privacy concern is an issue. We address such issues by building an end-to-end machine learn- ing (ML) workflow that can generate I/O traces for large-scale HPC applications. We leverage ML based feature selection and gener- ative models for I/O trace generation. The generative models are trained on I/O traces collected by the darshan I/O characterization tool over a period of one year. We present a two-step generation process consisting of two deep-learning models, called the feature generator and the trace generator. The combination of two-step generative models provides robustness by reducing the bias of the model and accounting for the stochastic nature of the I/O traces across different runs of an application. We evaluate the performance of the generative models and show that the two-step model can generate time-series I/O traces with less than 20% root mean square error.

Original languageEnglish
Title of host publicationHPDC 2022 - Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing
PublisherAssociation for Computing Machinery, Inc
Pages199-212
Number of pages14
ISBN (Electronic)9781450391993
DOIs
StatePublished - Jun 27 2022
Event31st International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2022 - Virtual, Online, United States
Duration: Jun 27 2022Jun 30 2022

Publication series

NameHPDC 2022 - Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing

Conference

Conference31st International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2022
Country/TerritoryUnited States
CityVirtual, Online
Period06/27/2206/30/22

Funding

This work used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. We thank Sarah Neuwirth for Figure 3. We are also extremely thankful to the reviewers and our shepherd, Peter Dinda, for their valuable feedback.

FundersFunder number
U.S. Department of EnergyDE-AC05-00OR22725
Office of Science

    Keywords

    • clustering
    • darshan
    • feature selection
    • generative modeling
    • parallel file system

    Fingerprint

    Dive into the research topics of 'Machine Learning Assisted HPC Workload Trace Generation for Leadership Scale Storage Systems'. Together they form a unique fingerprint.

    Cite this