Lightweight Measurement and Analysis of HPC Performance Variability

Jered Dominguez-Trujillo, Keira Haskins, Soheila Jafari Khouzani, Christopher Leap, Sahba Tashakkori, Quincy Wofford, Trilce Estrada, Patrick G. Bridges, Patrick M. Widener

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

Performance variation deriving from hardware and software sources is common in modern scientific and data-intensive computing systems, and synchronization in parallel and distributed programs often exacerbates their impacts at scale. The decentralized and emergent effects of such variation are, unfortunately, also difficult to systematically measure, analyze, and predict; modeling assumptions which are stringent enough to make analysis tractable frequently cannot be guaranteed at meaningful application scales, and longitudinal methods at such scales can require the capture and manipulation of impractically large amounts of data. This paper describes a new, scalable, and statistically robust approach for effective modeling, measurement, and analysis of large-scale performance variation in HPC systems. Our approach avoids the need to reason about complex distributions of runtimes among large numbers of individual application processes by focusing instead on the maximum length of distributed workload intervals. We describe this approach and its implementation in MPI which makes it applicable to a diverse set of HPC workloads. We also present evaluations of these techniques for quantifying and predicting performance variation carried out on large-scale computing systems, and discuss the strengths and limitations of the underlying modeling assumptions.

Original languageEnglish
Title of host publicationProceedings of PMBS 2020
Subtitle of host publicationPerformance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages50-60
Number of pages11
ISBN (Electronic)9781665422659
DOIs
StatePublished - Nov 2020
Externally publishedYes
Event2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, PMBS 2020 - Virtual, Atlanta, United States
Duration: Nov 12 2020 → …

Publication series

NameProceedings of PMBS 2020: Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, PMBS 2020
Country/TerritoryUnited States
CityVirtual, Atlanta
Period11/12/20 → …

Funding

This paper was supported in part by the National Science Foundation under Grant No. OAC-1807563, and by the Director, Office of Advanced Scientific Computing Research, Office of Science, of the United States Department of Energy. This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility operated under Contract No. DE-AC02-05CH11231, resources at the UNM Center for Advanced Research Computing, and from the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562 through allocation ASC190036. This work was funded in part by Los Alamos National Laboratory, supported by the US Department of Energy contract DE-FC02-06ER25750 (Los Alamos Publication Number LAUR-20-28021). Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Hon-eywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. SAND2020-9636C.

FundersFunder number
National Science FoundationASC190036, ACI-1548562, OAC-1807563
U.S. Department of EnergyDE-AC02-05CH11231, DE-FC02-06ER25750, LAUR-20-28021
Office of Science
National Nuclear Security AdministrationSAND2020-9636C, DE-NA0003525
Advanced Scientific Computing Research
Los Alamos National Laboratory

    Keywords

    • OS intererence
    • performance modeling
    • performance variation

    Fingerprint

    Dive into the research topics of 'Lightweight Measurement and Analysis of HPC Performance Variability'. Together they form a unique fingerprint.

    Cite this