Machine learning based parallel I/O predictive modeling: A case study on lustre file systems

Sandeep Madireddy, Prasanna Balaprakash, Philip Carns, Robert Latham, Robert Ross, Shane Snyder, Stefan M. Wild

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

15 Scopus citations

Abstract

Parallel I/O hardware and software infrastructure is a key contributor to performance variability for applications running on large-scale HPC systems. This variability confounds efforts to predict application performance for characterization, modeling, optimization, and job scheduling. We propose a modeling approach that improves predictive ability by explicitly treating the variability and by leveraging the sensitivity of application parameters on performance to group applications with similar characteristics. We develop a Gaussian process-based machine learning algorithm to model I/O performance and its variability as a function of application and file system characteristics. We demonstrate the effectiveness of the proposed approach using data collected from the Edison system at the National Energy Research Scientific Computing Center. The results show that the proposed sensitivity-based models are better at prediction when compared with application-partitioned or unpartitioned models. We highlight modeling techniques that are robust to the outliers that can occur in production parallel file systems. Using the developed metrics and modeling approach, we provide insights into the file system metrics that have a significant impact on I/O performance.

Original languageEnglish
Title of host publicationHigh Performance Computing - 33rd International Conference, ISC High Performance 2018, Proceedings
EditorsMichele Weiland, David Keyes, Carsten Trinitis, Rio Yokota
PublisherSpringer Verlag
Pages184-204
Number of pages21
ISBN (Print)9783319920399
DOIs
StatePublished - 2018
Externally publishedYes
Event33rd International Conference on ISC High Performance, 2018 - Frankfurt, Germany
Duration: Jun 24 2018Jun 28 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10876 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference33rd International Conference on ISC High Performance, 2018
Country/TerritoryGermany
CityFrankfurt
Period06/24/1806/28/18

Funding

Acknowledgment. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

FundersFunder number
U.S. Department of Energy
Office of ScienceDE-AC02-05CH11231
Advanced Scientific Computing ResearchDE-AC02-06CH11357

    Keywords

    • I/O performance variability
    • Machine learning
    • Parallel file systems
    • Robust Gaussian process regression

    Fingerprint

    Dive into the research topics of 'Machine learning based parallel I/O predictive modeling: A case study on lustre file systems'. Together they form a unique fingerprint.

    Cite this