TY - GEN
T1 - Understanding Soft Error Sensitivity of Deep Learning Models and Frameworks through Checkpoint Alteration
AU - Rojas, Elvis
AU - Perez, Diego
AU - Calhoun, Jon C.
AU - Gomez, Leonardo Bautista
AU - Jones, Terry
AU - Meneses, Esteban
N1 - Publisher Copyright:
©2021 IEEE.
PY - 2021
Y1 - 2021
N2 - The convergence of artificial intelligence, highperformance computing (HPC), and data science brings unique opportunities for marked advance discoveries and that leverage synergies across scientific domains. Recently, deep learning (DL) models have been successfully applied to a wide spectrum of fields, from social network analysis to climate modeling. Such advances greatly benefit from already available HPC infrastructure, mainly GPU-enabled supercomputers. However, those powerful computing systems are exposed to failures, particularly silent data corruption (SDC) in which bit-flips occur without the program crashing. Consequently, exploring the impact of SDCs in DL models is vital for maintaining progress in many scientific domains. This paper uses a distinctive methodology to inject faults into training phases of DL models. We use checkpoint file alteration to study the effect of having bit-flips in different places of a model and at different moments of the training. Our strategy is general enough to allow the analysis of any combination of DL model and framework - so long as they produce a Hierarchical Data Format 5 checkpoint file. The experimental results confirm that popular DL models are often able to absorb dozens of bitflips with a minimal impact on accuracy convergence.
AB - The convergence of artificial intelligence, highperformance computing (HPC), and data science brings unique opportunities for marked advance discoveries and that leverage synergies across scientific domains. Recently, deep learning (DL) models have been successfully applied to a wide spectrum of fields, from social network analysis to climate modeling. Such advances greatly benefit from already available HPC infrastructure, mainly GPU-enabled supercomputers. However, those powerful computing systems are exposed to failures, particularly silent data corruption (SDC) in which bit-flips occur without the program crashing. Consequently, exploring the impact of SDCs in DL models is vital for maintaining progress in many scientific domains. This paper uses a distinctive methodology to inject faults into training phases of DL models. We use checkpoint file alteration to study the effect of having bit-flips in different places of a model and at different moments of the training. Our strategy is general enough to allow the analysis of any combination of DL model and framework - so long as they produce a Hierarchical Data Format 5 checkpoint file. The experimental results confirm that popular DL models are often able to absorb dozens of bitflips with a minimal impact on accuracy convergence.
KW - Checkpoint
KW - Deep learning
KW - Fault injection
KW - HDF5
KW - High-performance computing
KW - Neural networks
KW - Resilience
UR - http://www.scopus.com/inward/record.url?scp=85126022957&partnerID=8YFLogxK
U2 - 10.1109/Cluster48925.2021.00045
DO - 10.1109/Cluster48925.2021.00045
M3 - Conference contribution
AN - SCOPUS:85126022957
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 492
EP - 503
BT - Proceedings - 2021 IEEE International Conference on Cluster Computing, Cluster 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE International Conference on Cluster Computing, Cluster 2021
Y2 - 7 September 2021 through 10 September 2021
ER -