TY - GEN
T1 - Hybrid Approach to HPC Cluster Telemetry and Hardware Log Analytics
AU - Thaler, Justin
AU - Shin, Woong
AU - Roberts, Steven
AU - Rogers, James H.
AU - Rosedahl, Todd
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/9/22
Y1 - 2020/9/22
N2 - The number of computer processing nodes and processor cores in cluster systems is growing rapidly. Discovering, and reacting to, a hardware or environmental issue in a timely manner enables proper fault isolation, improves quality of service, and improves system up-time. In the case of performance impacts and node outages, RAS policies can direct actions such as job quiescence or migration. Additionally, power consumption, thermal information, and utilization metrics can be used to provide cluster energy and cooling efficiency improvements as well as optimized job placement. This paper describes a highly scalable telemetry architecture that allows event aggregation, application of RAS policies, and provides the ability for cluster control system feedback. The architecture advances existing approaches by including both programmable policies, which are applied as events stream through the hierarchical network to persistence storage, and treatment of sensor telemetry in an extensible framework. This implementation has proven robust and is in use in both cloud and HPC environments including the Summit system of 4,608 nodes at Oak Ridge National Laboratory [5].
AB - The number of computer processing nodes and processor cores in cluster systems is growing rapidly. Discovering, and reacting to, a hardware or environmental issue in a timely manner enables proper fault isolation, improves quality of service, and improves system up-time. In the case of performance impacts and node outages, RAS policies can direct actions such as job quiescence or migration. Additionally, power consumption, thermal information, and utilization metrics can be used to provide cluster energy and cooling efficiency improvements as well as optimized job placement. This paper describes a highly scalable telemetry architecture that allows event aggregation, application of RAS policies, and provides the ability for cluster control system feedback. The architecture advances existing approaches by including both programmable policies, which are applied as events stream through the hierarchical network to persistence storage, and treatment of sensor telemetry in an extensible framework. This implementation has proven robust and is in use in both cloud and HPC environments including the Summit system of 4,608 nodes at Oak Ridge National Laboratory [5].
UR - http://www.scopus.com/inward/record.url?scp=85099387755&partnerID=8YFLogxK
U2 - 10.1109/HPEC43674.2020.9286239
DO - 10.1109/HPEC43674.2020.9286239
M3 - Conference contribution
AN - SCOPUS:85099387755
T3 - 2020 IEEE High Performance Extreme Computing Conference, HPEC 2020
BT - 2020 IEEE High Performance Extreme Computing Conference, HPEC 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2020 IEEE High Performance Extreme Computing Conference, HPEC 2020
Y2 - 21 September 2020 through 25 September 2020
ER -