TY - GEN
T1 - Gauge
T2 - 5th IEEE/ACM International Parallel Data Systems Workshop, PDSW 2020
AU - Del Rosario, Eliakin
AU - Currier, Mikaela
AU - Isakov, Mihailo
AU - Madireddy, Sandeep
AU - Balaprakash, Prasanna
AU - Carns, Philip
AU - Ross, Robert B.
AU - Harms, Kevin
AU - Snyder, Shane
AU - Kinsy, Michel A.
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/11
Y1 - 2020/11
N2 - Understanding and alleviating I/O bottlenecks in HPC system workloads is difficult due to the complex, multilayered nature of HPC I/O subsystems. Even with full visibility into the jobs executed on the system, the lack of tooling makes debugging I/O problems difficult. In this work, we introduce Gauge, an interactive, data-driven, web-based visualization tool for HPC I/O performance analysis. Gauge aids in the process of visualizing and analyzing, in an interactive fashion, large sets of HPC application execution logs. It performs a number of functions met to significantly reduce the cognitive load of navigating these sets - some worth many years of HPC logs. For instance, as its first step in many processing chains, it arranges unordered sets of collected HPC logs into a hierarchy of clusters for later analysis. This clustering step allows application developers to quickly navigate logs, find how their jobs compare to those of their peers in terms of I/O utilization, as well as how to improve their future runs. Similarly, facility operators can use Gauge to 'get a pulse' on the workloads running on their HPC systems, find clusters of under performing applications, and diagnose the reason for poor I/O throughput. In this work, we describe how Gauge arrives at the HPC jobs clustering, how it presents data about the jobs, and how it can be used to further narrow down and understand behavior of sets of jobs. We also provide a case study on using Gauge from the perspective of a facility operator.
AB - Understanding and alleviating I/O bottlenecks in HPC system workloads is difficult due to the complex, multilayered nature of HPC I/O subsystems. Even with full visibility into the jobs executed on the system, the lack of tooling makes debugging I/O problems difficult. In this work, we introduce Gauge, an interactive, data-driven, web-based visualization tool for HPC I/O performance analysis. Gauge aids in the process of visualizing and analyzing, in an interactive fashion, large sets of HPC application execution logs. It performs a number of functions met to significantly reduce the cognitive load of navigating these sets - some worth many years of HPC logs. For instance, as its first step in many processing chains, it arranges unordered sets of collected HPC logs into a hierarchy of clusters for later analysis. This clustering step allows application developers to quickly navigate logs, find how their jobs compare to those of their peers in terms of I/O utilization, as well as how to improve their future runs. Similarly, facility operators can use Gauge to 'get a pulse' on the workloads running on their HPC systems, find clusters of under performing applications, and diagnose the reason for poor I/O throughput. In this work, we describe how Gauge arrives at the HPC jobs clustering, how it presents data about the jobs, and how it can be used to further narrow down and understand behavior of sets of jobs. We also provide a case study on using Gauge from the perspective of a facility operator.
KW - Clustering
KW - High-Performance Computing
KW - I/O Analysis
KW - Machine Learning
KW - Visualization
UR - http://www.scopus.com/inward/record.url?scp=85099570665&partnerID=8YFLogxK
U2 - 10.1109/PDSW51947.2020.00008
DO - 10.1109/PDSW51947.2020.00008
M3 - Conference contribution
AN - SCOPUS:85099570665
T3 - Proceedings of PDSW 2020: IEEE/ACM 5th International Parallel Data Systems Workshop, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
SP - 15
EP - 21
BT - Proceedings of PDSW 2020
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 12 November 2020
ER -