TY - GEN
T1 - Machine learning based parallel I/O predictive modeling
T2 - 33rd International Conference on ISC High Performance, 2018
AU - Madireddy, Sandeep
AU - Balaprakash, Prasanna
AU - Carns, Philip
AU - Latham, Robert
AU - Ross, Robert
AU - Snyder, Shane
AU - Wild, Stefan M.
N1 - Publisher Copyright:
© 2018, Springer International Publishing AG, part of Springer Nature.
PY - 2018
Y1 - 2018
N2 - Parallel I/O hardware and software infrastructure is a key contributor to performance variability for applications running on large-scale HPC systems. This variability confounds efforts to predict application performance for characterization, modeling, optimization, and job scheduling. We propose a modeling approach that improves predictive ability by explicitly treating the variability and by leveraging the sensitivity of application parameters on performance to group applications with similar characteristics. We develop a Gaussian process-based machine learning algorithm to model I/O performance and its variability as a function of application and file system characteristics. We demonstrate the effectiveness of the proposed approach using data collected from the Edison system at the National Energy Research Scientific Computing Center. The results show that the proposed sensitivity-based models are better at prediction when compared with application-partitioned or unpartitioned models. We highlight modeling techniques that are robust to the outliers that can occur in production parallel file systems. Using the developed metrics and modeling approach, we provide insights into the file system metrics that have a significant impact on I/O performance.
AB - Parallel I/O hardware and software infrastructure is a key contributor to performance variability for applications running on large-scale HPC systems. This variability confounds efforts to predict application performance for characterization, modeling, optimization, and job scheduling. We propose a modeling approach that improves predictive ability by explicitly treating the variability and by leveraging the sensitivity of application parameters on performance to group applications with similar characteristics. We develop a Gaussian process-based machine learning algorithm to model I/O performance and its variability as a function of application and file system characteristics. We demonstrate the effectiveness of the proposed approach using data collected from the Edison system at the National Energy Research Scientific Computing Center. The results show that the proposed sensitivity-based models are better at prediction when compared with application-partitioned or unpartitioned models. We highlight modeling techniques that are robust to the outliers that can occur in production parallel file systems. Using the developed metrics and modeling approach, we provide insights into the file system metrics that have a significant impact on I/O performance.
KW - I/O performance variability
KW - Machine learning
KW - Parallel file systems
KW - Robust Gaussian process regression
UR - http://www.scopus.com/inward/record.url?scp=85048568864&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-92040-5_10
DO - 10.1007/978-3-319-92040-5_10
M3 - Conference contribution
AN - SCOPUS:85048568864
SN - 9783319920399
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 184
EP - 204
BT - High Performance Computing - 33rd International Conference, ISC High Performance 2018, Proceedings
A2 - Weiland, Michele
A2 - Keyes, David
A2 - Trinitis, Carsten
A2 - Yokota, Rio
PB - Springer Verlag
Y2 - 24 June 2018 through 28 June 2018
ER -