TY - GEN
T1 - Detection and correction of silent data corruption for large-scale high-performance computing
AU - Fiala, David
AU - Mueller, Frank
AU - Engelmann, Christian
AU - Riesen, Rolf
AU - Ferreira, Kurt
AU - Brightwell, Ron
PY - 2012
Y1 - 2012
N2 - Faults have become the norm rather than the exception for high-end computing clusters. Exacerbating this situation, some of these faults remain undetected, manifesting themselves as silent errors that allow applications to compute incorrect results. This paper studies the potential for redundancy to detect and correct soft errors in MPI message-passing applications while investigating the challenges inherent to detecting soft errors within MPI applications by providing transparent MPI redundancy. By assuming a model wherein corruption in application data manifests itself by producing differing MPI messages between replicas, we study the best suited protocols for detecting and correcting corrupted MPI messages. Using our fault injector, we observe that even a single error can have profound effects on applications by causing a cascading pattern of corruption which in most cases spreads to all other processes. Results indicate that our consistency protocols can successfully protect applications experiencing even high rates of silent data corruption.
AB - Faults have become the norm rather than the exception for high-end computing clusters. Exacerbating this situation, some of these faults remain undetected, manifesting themselves as silent errors that allow applications to compute incorrect results. This paper studies the potential for redundancy to detect and correct soft errors in MPI message-passing applications while investigating the challenges inherent to detecting soft errors within MPI applications by providing transparent MPI redundancy. By assuming a model wherein corruption in application data manifests itself by producing differing MPI messages between replicas, we study the best suited protocols for detecting and correcting corrupted MPI messages. Using our fault injector, we observe that even a single error can have profound effects on applications by causing a cascading pattern of corruption which in most cases spreads to all other processes. Results indicate that our consistency protocols can successfully protect applications experiencing even high rates of silent data corruption.
UR - http://www.scopus.com/inward/record.url?scp=84877705582&partnerID=8YFLogxK
U2 - 10.1109/SC.2012.49
DO - 10.1109/SC.2012.49
M3 - Conference contribution
AN - SCOPUS:84877705582
SN - 9781467308069
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2012
T2 - 2012 24th International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2012
Y2 - 10 November 2012 through 16 November 2012
ER -