Abstract
Many modern HPC applications do not make good use of the limited available I/O bandwidth. Developing an understanding of the I/O subsystem is a critical first step in order to better utilize an HPC system. While expert insight is indispensable, I/O experts are in rare supply. We seek to automate this effort by developing and interpreting models of I/O throughput. Such interpretations may be useful to both application developers who can use them to improve their codes and to facility operators who can use them to identify larger problems in an HPC system.The application of machine learning (ML) to HPC system analysis has been shown to be a promising direction. However, the direct application of ML methods to I/O throughput prediction often leads to brittle models with low extrapolative power. In this work, we set out to understand the reasons why common methods underperform on this specific problem domain, and how to build models that better generalize on unseen data. We show that commonly used cross-validation testing yields sets that are too similar, preventing us from detecting overfitting. We propose a method for generating test sets that encourages training-Test set separation. Next we explore limits of I/O throughput prediction and show that we can estimate I/O contention noise by observing repeated runs of an application. Then we show that by using our new test sets, we can better discriminate different architectures of ML models in terms of how well they generalize.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of ROSS 2020 |
| Subtitle of host publication | 10th International Workshop on Runtime and Operating Systems for Supercomputers, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 41-49 |
| Number of pages | 9 |
| ISBN (Electronic) | 9781665422680 |
| DOIs | |
| State | Published - Nov 2020 |
| Externally published | Yes |
| Event | 10th IEEE/ACM International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2020 - Virtual, Atlanta, United States Duration: Nov 13 2020 → … |
Publication series
| Name | Proceedings of ROSS 2020: 10th International Workshop on Runtime and Operating Systems for Supercomputers, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis |
|---|
Conference
| Conference | 10th IEEE/ACM International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2020 |
|---|---|
| Country/Territory | United States |
| City | Virtual, Atlanta |
| Period | 11/13/20 → … |
Funding
This work was supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357. This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH11357.
Keywords
- High-Performance Computing
- I/O Analysis
- Machine Learning
- Optimization