TY - GEN
T1 - Machine Learning Assisted HPC Workload Trace Generation for Leadership Scale Storage Systems
AU - Paul, Arnab K.
AU - Choi, Jong Youl
AU - Karimi, Ahmad Maroof
AU - Wang, Feiyi
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/6/27
Y1 - 2022/6/27
N2 - Monitoring and analyzing a wide range of I/O activities in an HPC cluster is important in maintaining mission-critical performance in a large-scale, multi-user, parallel storage system. Center-wide I/O traces can provide high-level information and fine-grained activities per application or per user running in the system. Studying such large-scale traces can provide helpful insights into the system. It can be used to develop predictive methods for making predictive decisions, adjusting scheduling policies, or providing decisions for the design of next-generation systems. However, sharing real-world I/O traces to expedite such research efforts leaves a few concerns; i) the cost of sharing the large traces is expensive due to this large size, and ii) privacy concern is an issue. We address such issues by building an end-to-end machine learn- ing (ML) workflow that can generate I/O traces for large-scale HPC applications. We leverage ML based feature selection and gener- ative models for I/O trace generation. The generative models are trained on I/O traces collected by the darshan I/O characterization tool over a period of one year. We present a two-step generation process consisting of two deep-learning models, called the feature generator and the trace generator. The combination of two-step generative models provides robustness by reducing the bias of the model and accounting for the stochastic nature of the I/O traces across different runs of an application. We evaluate the performance of the generative models and show that the two-step model can generate time-series I/O traces with less than 20% root mean square error.
AB - Monitoring and analyzing a wide range of I/O activities in an HPC cluster is important in maintaining mission-critical performance in a large-scale, multi-user, parallel storage system. Center-wide I/O traces can provide high-level information and fine-grained activities per application or per user running in the system. Studying such large-scale traces can provide helpful insights into the system. It can be used to develop predictive methods for making predictive decisions, adjusting scheduling policies, or providing decisions for the design of next-generation systems. However, sharing real-world I/O traces to expedite such research efforts leaves a few concerns; i) the cost of sharing the large traces is expensive due to this large size, and ii) privacy concern is an issue. We address such issues by building an end-to-end machine learn- ing (ML) workflow that can generate I/O traces for large-scale HPC applications. We leverage ML based feature selection and gener- ative models for I/O trace generation. The generative models are trained on I/O traces collected by the darshan I/O characterization tool over a period of one year. We present a two-step generation process consisting of two deep-learning models, called the feature generator and the trace generator. The combination of two-step generative models provides robustness by reducing the bias of the model and accounting for the stochastic nature of the I/O traces across different runs of an application. We evaluate the performance of the generative models and show that the two-step model can generate time-series I/O traces with less than 20% root mean square error.
KW - clustering
KW - darshan
KW - feature selection
KW - generative modeling
KW - parallel file system
UR - http://www.scopus.com/inward/record.url?scp=85134156494&partnerID=8YFLogxK
U2 - 10.1145/3502181.3531457
DO - 10.1145/3502181.3531457
M3 - Conference contribution
AN - SCOPUS:85134156494
T3 - HPDC 2022 - Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing
SP - 199
EP - 212
BT - HPDC 2022 - Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing
PB - Association for Computing Machinery, Inc
T2 - 31st International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2022
Y2 - 27 June 2022 through 30 June 2022
ER -