Hybrid Approach to HPC Cluster Telemetry and Hardware Log Analytics

Justin Thaler, Woong Shin, Steven Roberts, James H. Rogers, Todd Rosedahl

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

The number of computer processing nodes and processor cores in cluster systems is growing rapidly. Discovering, and reacting to, a hardware or environmental issue in a timely manner enables proper fault isolation, improves quality of service, and improves system up-time. In the case of performance impacts and node outages, RAS policies can direct actions such as job quiescence or migration. Additionally, power consumption, thermal information, and utilization metrics can be used to provide cluster energy and cooling efficiency improvements as well as optimized job placement. This paper describes a highly scalable telemetry architecture that allows event aggregation, application of RAS policies, and provides the ability for cluster control system feedback. The architecture advances existing approaches by including both programmable policies, which are applied as events stream through the hierarchical network to persistence storage, and treatment of sensor telemetry in an extensible framework. This implementation has proven robust and is in use in both cloud and HPC environments including the Summit system of 4,608 nodes at Oak Ridge National Laboratory [5].

Original languageEnglish
Title of host publication2020 IEEE High Performance Extreme Computing Conference, HPEC 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728192192
DOIs
StatePublished - Sep 22 2020
Event2020 IEEE High Performance Extreme Computing Conference, HPEC 2020 - Virtual, Waltham, United States
Duration: Sep 21 2020Sep 25 2020

Publication series

Name2020 IEEE High Performance Extreme Computing Conference, HPEC 2020

Conference

Conference2020 IEEE High Performance Extreme Computing Conference, HPEC 2020
Country/TerritoryUnited States
CityVirtual, Waltham
Period09/21/2009/25/20

Funding

ACKNOWLEDGEMENTS The work was supported by, and used the resources of, the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at ORNL, which is managed by UT Battelle, LLC for the U.S. DOE (under the contract No. DE-AC05-00OR22725).

FundersFunder number
U.S. Department of EnergyDE-AC05-00OR22725

    Fingerprint

    Dive into the research topics of 'Hybrid Approach to HPC Cluster Telemetry and Hardware Log Analytics'. Together they form a unique fingerprint.

    Cite this