TY - GEN
T1 - Understanding GPU Memory Corruption at Extreme Scale
T2 - 38th ACM International Conference on Supercomputing, ICS 2024
AU - Oles, Vladyslav
AU - Schmedding, Anna
AU - Ostrouchov, George
AU - Shin, Woong
AU - Smirni, Evgenia
AU - Engelmann, Christian
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/5/30
Y1 - 2024/5/30
N2 - GPU memory corruption and in particular double-bit errors (DBEs) remain one of the least understood aspects of HPC system reliability. Albeit rare, their occurrences always lead to job termination and can potentially cost thousands of node-hours, either from wasted computations or as the overhead from regular checkpointing needed to minimize the losses. As supercomputers and their components simultaneously grow in scale, density, failure rates, and environmental footprint, the efficiency of HPC operations becomes both an imperative and a challenge. We examine DBEs using system telemetry data and logs collected from the Summit supercomputer, equipped with 27,648 Tesla V100 GPUs with 2nd-generation high-bandwidth memory (HBM2). Using exploratory data analysis and statistical learning, we extract several insights about memory reliability in such GPUs. We find that GPUs with prior DBE occurrences are prone to experience them again due to otherwise harmless factors, correlate this phenomenon with GPU placement, and suggest manufacturing variability as a factor. On the general population of GPUs, we link DBEs to short- and long-term high power consumption modes while finding no significant correlation with higher temperatures. We also show that the workload type can be a factor in memory's propensity to corruption.
AB - GPU memory corruption and in particular double-bit errors (DBEs) remain one of the least understood aspects of HPC system reliability. Albeit rare, their occurrences always lead to job termination and can potentially cost thousands of node-hours, either from wasted computations or as the overhead from regular checkpointing needed to minimize the losses. As supercomputers and their components simultaneously grow in scale, density, failure rates, and environmental footprint, the efficiency of HPC operations becomes both an imperative and a challenge. We examine DBEs using system telemetry data and logs collected from the Summit supercomputer, equipped with 27,648 Tesla V100 GPUs with 2nd-generation high-bandwidth memory (HBM2). Using exploratory data analysis and statistical learning, we extract several insights about memory reliability in such GPUs. We find that GPUs with prior DBE occurrences are prone to experience them again due to otherwise harmless factors, correlate this phenomenon with GPU placement, and suggest manufacturing variability as a factor. On the general population of GPUs, we link DBEs to short- and long-term high power consumption modes while finding no significant correlation with higher temperatures. We also show that the workload type can be a factor in memory's propensity to corruption.
KW - data analysis
KW - GPU memory failures
KW - HPC
UR - http://www.scopus.com/inward/record.url?scp=85196300180&partnerID=8YFLogxK
U2 - 10.1145/3650200.3656615
DO - 10.1145/3650200.3656615
M3 - Conference contribution
AN - SCOPUS:85196300180
T3 - Proceedings of the International Conference on Supercomputing
SP - 188
EP - 200
BT - ICS 2024 - Proceedings of the 38th ACM International Conference on Supercomputing
PB - Association for Computing Machinery
Y2 - 4 June 2024 through 7 June 2024
ER -