Characterization and identification of HPC applications at leadership computing facility

Zhengchun Liu, Ryan Lewis, Rajkumar Kettimuthu, Kevin Harms, Philip Carns, Nageswara Rao, Ian Foster, Michael E. Papka

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

13 Scopus citations

Abstract

High Performance Computing (HPC) is an important method for scientific discovery via large-scale simulation, data analysis, or artificial intelligence. Leadership-class supercomputers are expensive, but essential to run large HPC applications. The Petascale era of supercomputers began in 2008, with the first machines achieving performance in excess of one petaflops, and with the advent of new supercomputers in 2021 (e.g., Aurora, Frontier), the Exascale era will soon begin. However, the high theoretical computing capability (i.e., peak FLOPS) of a machine is not the only meaningful target when designing a supercomputer, as the resources demand of applications varies. A deep understanding of the characterization of applications that run on a leadership supercomputer is one of the most important ways for planning its design, development and operation. In order to improve our understanding of HPC applications, user demands and resource usage characteristics, we perform correlative analysis of various logs for different subsystems of a leadership supercomputer. This analysis reveals surprising, sometimes counter-intuitive patterns, which, in some cases, conflicts with existing assumptions, and have important implications for future system designs as well as supercomputer operations. For example, our analysis shows that while the applications spend significant time on MPI, most applications spend very little time on file I/O. Combined analysis of hardware event logs and task failure logs show that the probability of a hardware FATAL event causing task failure is low. Combined analysis of control system logs and file I/O logs reveals that pure POSIX I/O is used more widely than higher level parallel I/O. Based on holistic insights of the application gained through combined and co-analysis of multiple logs from different perspectives and general intuition, we engineer features to "fingerprint" HPC applications. We use t-SNE (a machine learning technique for dimensionality reduction) to validate the explainability of our features and finally train machine learning models to identify HPC applications or group those with similar characteristic. To the best of our knowledge, this is the first work that combines logs on file I/O, computing, and inter-node communication for insightful analysis of HPC applications in production.

Original languageEnglish
Title of host publicationProceedings of the 34th ACM International Conference on Supercomputing, ICS 2020
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450379830
DOIs
StatePublished - Jun 29 2020
Event34th ACM International Conference on Supercomputing, ICS 2020 - Barcelona, Spain
Duration: Jun 29 2020Jul 2 2020

Publication series

NameProceedings of the International Conference on Supercomputing

Conference

Conference34th ACM International Conference on Supercomputing, ICS 2020
Country/TerritorySpain
CityBarcelona
Period06/29/2007/2/20

Funding

This material is based upon work supported by the U.S. Department of Energy, Office of Science, under contract number DE-AC02-06CH11357. The datasets used in this research were generated from resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. We thank Doug Waldron and Sudheer Chunduri, both from Argonne Leadership Computing Facility, for providing dataset descriptions. We also would like to thank the four anonymous reviewers for their helpful comments.

FundersFunder number
DOE Office of Science
U.S. Department of Energy
Office of ScienceDE-AC02-06CH11357

    Keywords

    • application identification
    • characterization
    • high performance computing
    • logs data mining

    Fingerprint

    Dive into the research topics of 'Characterization and identification of HPC applications at leadership computing facility'. Together they form a unique fingerprint.

    Cite this