Gauge: An Interactive Data-Driven Visualization Tool for HPC Application I/O Performance Analysis

Eliakin Del Rosario, Mikaela Currier, Mihailo Isakov, Sandeep Madireddy, Prasanna Balaprakash, Philip Carns, Robert B. Ross, Kevin Harms, Shane Snyder, Michel A. Kinsy

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

Understanding and alleviating I/O bottlenecks in HPC system workloads is difficult due to the complex, multilayered nature of HPC I/O subsystems. Even with full visibility into the jobs executed on the system, the lack of tooling makes debugging I/O problems difficult. In this work, we introduce Gauge, an interactive, data-driven, web-based visualization tool for HPC I/O performance analysis. Gauge aids in the process of visualizing and analyzing, in an interactive fashion, large sets of HPC application execution logs. It performs a number of functions met to significantly reduce the cognitive load of navigating these sets - some worth many years of HPC logs. For instance, as its first step in many processing chains, it arranges unordered sets of collected HPC logs into a hierarchy of clusters for later analysis. This clustering step allows application developers to quickly navigate logs, find how their jobs compare to those of their peers in terms of I/O utilization, as well as how to improve their future runs. Similarly, facility operators can use Gauge to 'get a pulse' on the workloads running on their HPC systems, find clusters of under performing applications, and diagnose the reason for poor I/O throughput. In this work, we describe how Gauge arrives at the HPC jobs clustering, how it presents data about the jobs, and how it can be used to further narrow down and understand behavior of sets of jobs. We also provide a case study on using Gauge from the perspective of a facility operator.

Original languageEnglish
Title of host publicationProceedings of PDSW 2020
Subtitle of host publicationIEEE/ACM 5th International Parallel Data Systems Workshop, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages15-21
Number of pages7
ISBN (Electronic)9781665415941
DOIs
StatePublished - Nov 2020
Externally publishedYes
Event5th IEEE/ACM International Parallel Data Systems Workshop, PDSW 2020 - Virtual, Atlanta, United States
Duration: Nov 12 2020 → …

Publication series

NameProceedings of PDSW 2020: IEEE/ACM 5th International Parallel Data Systems Workshop, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference5th IEEE/ACM International Parallel Data Systems Workshop, PDSW 2020
Country/TerritoryUnited States
CityVirtual, Atlanta
Period11/12/20 → …

Funding

This work was supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357. This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH11357.

FundersFunder number
U.S. Department of Energy
Office of Science
Advanced Scientific Computing ResearchDE-AC02-06CH11357
Argonne National Laboratory

    Keywords

    • Clustering
    • High-Performance Computing
    • I/O Analysis
    • Machine Learning
    • Visualization

    Fingerprint

    Dive into the research topics of 'Gauge: An Interactive Data-Driven Visualization Tool for HPC Application I/O Performance Analysis'. Together they form a unique fingerprint.

    Cite this