TY - GEN
T1 - Selective protection for sparse iterative solvers to reduce the resilience overhead
AU - Sun, Hongyang
AU - Gainaru, Ana
AU - Shantharam, Manu
AU - Raghavan, Padma
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/9
Y1 - 2020/9
N2 - The increasing scale and complexity of today's high-performance computing (HPC) systems demand a renewed focus on enhancing the resilience of long-running scientific applications in the presence of faults. Many of these applications are iterative in nature as they operate on sparse matrices that concern the simulation of partial differential equations (PDEs) which numerically capture the physical properties on discretized spatial domains. While these applications currently benefit from many application-agnostic resilience techniques at the system level, such as checkpointing and replication, there is significant overhead in deploying these techniques. In this paper, we seek to develop application-aware resilience techniques that leverage an iterative application's intrinsic resiliency to faults and selectively protect certain elements, thereby reducing the resilience overhead. Specifically, we investigate the impact of soft errors on the widely used Preconditioned Conjugate Gradient (PCG) method, whose reliability depends heavily on the error propagation through the sparse matrix-vector multiplication (SpMV) operation. By characterizing the performance of PCG in correlation with a numerical property of the underlying sparse matrix, we propose a selective protection scheme that protects only certain critical elements of the operation based on an analytical model. An experimental evaluation using 20 sparse matrices from the SuiteSparse Matrix Collection shows that our proposed scheme is able to reduce the resilience overhead by as much as 70.2% and an average of 32.6% compared to the baseline techniques with full-protection or zero-protection.
AB - The increasing scale and complexity of today's high-performance computing (HPC) systems demand a renewed focus on enhancing the resilience of long-running scientific applications in the presence of faults. Many of these applications are iterative in nature as they operate on sparse matrices that concern the simulation of partial differential equations (PDEs) which numerically capture the physical properties on discretized spatial domains. While these applications currently benefit from many application-agnostic resilience techniques at the system level, such as checkpointing and replication, there is significant overhead in deploying these techniques. In this paper, we seek to develop application-aware resilience techniques that leverage an iterative application's intrinsic resiliency to faults and selectively protect certain elements, thereby reducing the resilience overhead. Specifically, we investigate the impact of soft errors on the widely used Preconditioned Conjugate Gradient (PCG) method, whose reliability depends heavily on the error propagation through the sparse matrix-vector multiplication (SpMV) operation. By characterizing the performance of PCG in correlation with a numerical property of the underlying sparse matrix, we propose a selective protection scheme that protects only certain critical elements of the operation based on an analytical model. An experimental evaluation using 20 sparse matrices from the SuiteSparse Matrix Collection shows that our proposed scheme is able to reduce the resilience overhead by as much as 70.2% and an average of 32.6% compared to the baseline techniques with full-protection or zero-protection.
KW - Iterative solvers
KW - Preconditioned conjugate gradient
KW - Resilience
KW - Selective protection
KW - Soft errors
UR - http://www.scopus.com/inward/record.url?scp=85095864688&partnerID=8YFLogxK
U2 - 10.1109/SBAC-PAD49847.2020.00029
DO - 10.1109/SBAC-PAD49847.2020.00029
M3 - Conference contribution
AN - SCOPUS:85095864688
T3 - Proceedings - Symposium on Computer Architecture and High Performance Computing
SP - 141
EP - 148
BT - Proceedings - 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2020
PB - IEEE Computer Society
T2 - 32nd IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2020
Y2 - 8 September 2020 through 11 September 2020
ER -