TY - GEN
T1 - Scalable performance awareness for in situ scientific applications
AU - Wolf, Matthew
AU - Choi, Jong
AU - Eisenhauer, Greg
AU - Ethier, Stephane
AU - Huck, Kevin
AU - Klasky, Scott
AU - Logan, Jeremy
AU - Malony, Allen
AU - Wood, Chad
AU - Dominski, Julien
AU - Merlo, Gabriele
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/9
Y1 - 2019/9
N2 - Part of the promise of exascale computing and the next generation of scientific simulation codes is the ability to bring together time and spatial scales that have traditionally been treated separately. This enables creating complex coupled simulations and in situ analysis pipelines, encompassing such things as 'whole device' fusion models or the simulation of cities from sewers to rooftops. Unfortunately, the HPC analysis tools that have been built up over the preceding decades are ill suited to the debugging and performance analysis of such computational ensembles. In this paper, we present a new vision for performance measurement and understanding of HPC codes, MonitoringAnalytics (MONA). MONA is designed to be a flexible, high performance monitoring infrastructure that can perform monitoring analysis in place or in transit by embedding analytics and characterization directly into the data stream, without relying upon delivering all monitoring information to a central database for post-processing. It addresses the trade-offs between the prohibitively expensive capture of all performance characteristics and not capturing enough to detect the features of interest. We demonstrate several uses of MONA; capturing and indexing multi-executable performance profiles to enable later processing, extraction of performance primitives to enable the generation of customizable benchmarks and performance skeletons, and extracting communication and application behaviors to enable better control and placement for the current and future runs of the science ensemble. Relevant performance information based on a system for MONA built from ADIOS and SOSflow technologies is provided for DOE science applications and leadership machines.
AB - Part of the promise of exascale computing and the next generation of scientific simulation codes is the ability to bring together time and spatial scales that have traditionally been treated separately. This enables creating complex coupled simulations and in situ analysis pipelines, encompassing such things as 'whole device' fusion models or the simulation of cities from sewers to rooftops. Unfortunately, the HPC analysis tools that have been built up over the preceding decades are ill suited to the debugging and performance analysis of such computational ensembles. In this paper, we present a new vision for performance measurement and understanding of HPC codes, MonitoringAnalytics (MONA). MONA is designed to be a flexible, high performance monitoring infrastructure that can perform monitoring analysis in place or in transit by embedding analytics and characterization directly into the data stream, without relying upon delivering all monitoring information to a central database for post-processing. It addresses the trade-offs between the prohibitively expensive capture of all performance characteristics and not capturing enough to detect the features of interest. We demonstrate several uses of MONA; capturing and indexing multi-executable performance profiles to enable later processing, extraction of performance primitives to enable the generation of customizable benchmarks and performance skeletons, and extracting communication and application behaviors to enable better control and placement for the current and future runs of the science ensemble. Relevant performance information based on a system for MONA built from ADIOS and SOSflow technologies is provided for DOE science applications and leadership machines.
KW - Analysis
KW - I/O miniapp generation
KW - In situ
KW - Online
KW - Performance variability
KW - Process placement
KW - Runtime performance monitoring
UR - http://www.scopus.com/inward/record.url?scp=85083176423&partnerID=8YFLogxK
U2 - 10.1109/eScience.2019.00037
DO - 10.1109/eScience.2019.00037
M3 - Conference contribution
AN - SCOPUS:85083176423
T3 - Proceedings - IEEE 15th International Conference on eScience, eScience 2019
SP - 266
EP - 276
BT - Proceedings - IEEE 15th International Conference on eScience, eScience 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 15th IEEE International Conference on eScience, eScience 2019
Y2 - 24 September 2019 through 27 September 2019
ER -