Scalable performance awareness for in situ scientific applications

Matthew Wolf, Jong Choi, Greg Eisenhauer, Stephane Ethier, Kevin Huck, Scott Klasky, Jeremy Logan, Allen Malony, Chad Wood, Julien Dominski, Gabriele Merlo

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Part of the promise of exascale computing and the next generation of scientific simulation codes is the ability to bring together time and spatial scales that have traditionally been treated separately. This enables creating complex coupled simulations and in situ analysis pipelines, encompassing such things as 'whole device' fusion models or the simulation of cities from sewers to rooftops. Unfortunately, the HPC analysis tools that have been built up over the preceding decades are ill suited to the debugging and performance analysis of such computational ensembles. In this paper, we present a new vision for performance measurement and understanding of HPC codes, MonitoringAnalytics (MONA). MONA is designed to be a flexible, high performance monitoring infrastructure that can perform monitoring analysis in place or in transit by embedding analytics and characterization directly into the data stream, without relying upon delivering all monitoring information to a central database for post-processing. It addresses the trade-offs between the prohibitively expensive capture of all performance characteristics and not capturing enough to detect the features of interest. We demonstrate several uses of MONA; capturing and indexing multi-executable performance profiles to enable later processing, extraction of performance primitives to enable the generation of customizable benchmarks and performance skeletons, and extracting communication and application behaviors to enable better control and placement for the current and future runs of the science ensemble. Relevant performance information based on a system for MONA built from ADIOS and SOSflow technologies is provided for DOE science applications and leadership machines.

Original languageEnglish
Title of host publicationProceedings - IEEE 15th International Conference on eScience, eScience 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages266-276
Number of pages11
ISBN (Electronic)9781728124513
DOIs
StatePublished - Sep 2019
Event15th IEEE International Conference on eScience, eScience 2019 - San Diego, United States
Duration: Sep 24 2019Sep 27 2019

Publication series

NameProceedings - IEEE 15th International Conference on eScience, eScience 2019

Conference

Conference15th IEEE International Conference on eScience, eScience 2019
Country/TerritoryUnited States
CitySan Diego
Period09/24/1909/27/19

Funding

ACKNOWLEDGMENT We gratefully recognize the support from the Department of Energy’s Office of Advanced Scientific Computing Research (ASCR Research) for enabling this work. Additionally, this research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, the National Energy Research Scientific Computing Center (NERSC), and the Argonne Leadership Computing Facility which are supported by the Office of Science of the U.S. Department of Energy under Contract Nos. DE-AC05-00OR22725,DE-AC02-05CH11231,and DE-AC02-06CH11357,respectively.

FundersFunder number
ASCR Research
U.S. Department of EnergyDE-AC05-00OR22725, DE-AC02-05CH11231, DE-AC02-06CH11357
Office of Science
Advanced Scientific Computing Research
National Energy Research Scientific Computing Center

    Keywords

    • Analysis
    • I/O miniapp generation
    • In situ
    • Online
    • Performance variability
    • Process placement
    • Runtime performance monitoring

    Fingerprint

    Dive into the research topics of 'Scalable performance awareness for in situ scientific applications'. Together they form a unique fingerprint.

    Cite this