TY - GEN
T1 - Hpc i/o throughput bottleneck analysis with explainable local models
AU - Isakov, Mihailo
AU - Rosario, Eliakin Del
AU - Madireddy, Sandeep
AU - Balaprakash, Prasanna
AU - Carns, Philip
AU - Ross, Robert B.
AU - Kinsy, Michel A.
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/11
Y1 - 2020/11
N2 - With the growing complexity of high-performance computing (HPC) systems, achieving high performance can be difficult because of I/O bottlenecks. We analyze multiple years' worth of Darshan logs from the Argonne Leadership Computing Facility's Theta supercomputer in order to understand causes of poor I/O throughput. We present Gauge: A data-driven diagnostic tool for exploring the latent space of supercomputing job features, understanding behaviors of clusters of jobs, and interpreting I/O bottlenecks. We find groups of jobs that at first sight are highly heterogeneous but share certain behaviors, and analyze these groups instead of individual jobs, allowing us to reduce the workload of domain experts and automate I/O performance analysis. We conduct a case study where a system owner using Gauge was able to arrive at several clusters that do not conform to conventional I/O behaviors, as well as find several potential improvements, both on the application level and the system level.
AB - With the growing complexity of high-performance computing (HPC) systems, achieving high performance can be difficult because of I/O bottlenecks. We analyze multiple years' worth of Darshan logs from the Argonne Leadership Computing Facility's Theta supercomputer in order to understand causes of poor I/O throughput. We present Gauge: A data-driven diagnostic tool for exploring the latent space of supercomputing job features, understanding behaviors of clusters of jobs, and interpreting I/O bottlenecks. We find groups of jobs that at first sight are highly heterogeneous but share certain behaviors, and analyze these groups instead of individual jobs, allowing us to reduce the workload of domain experts and automate I/O performance analysis. We conduct a case study where a system owner using Gauge was able to arrive at several clusters that do not conform to conventional I/O behaviors, as well as find several potential improvements, both on the application level and the system level.
KW - HPC
KW - I/O
KW - clustering
KW - diagnostics
KW - machine learning
UR - http://www.scopus.com/inward/record.url?scp=85099562280&partnerID=8YFLogxK
U2 - 10.1109/SC41405.2020.00037
DO - 10.1109/SC41405.2020.00037
M3 - Conference contribution
AN - SCOPUS:85099562280
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - Proceedings of SC 2020
PB - IEEE Computer Society
T2 - 2020 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020
Y2 - 9 November 2020 through 19 November 2020
ER -