TY - GEN
T1 - Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
AU - Li, Dong
AU - Chen, Zizhong
AU - Wu, Panruo
AU - Vetter, Jeffrey S.
PY - 2013
Y1 - 2013
N2 - Algorithm-based fault tolerance (ABFT) is a highly efficient resilience solution for many widely-used scientific computing kernels. However, in the context of the resilience ecosystem, ABFT is completely opaque to any underlying hardware re-silience mechanisms. As a result, some data structures are over-protected by ABFT and hardware, which leads to re-dundant costs in terms of performance and energy. In this paper, we rethink ABFT using an integrated view including both software and hardware with the goal of improving performance and energy efficiency of ABFT-enabled appli-cations. In particular, we study how to coordinate ABFT and error-correcting code (ECC) for main memory, and in-vestigate the impact of this coordination on performance, energy, and resilience for ABFT-enabled applications. Scaling tests and analysis indicate that our approach saves up to 25% for system energy (and up to 40% for dynamic mem-ory energy) with up to 18% performance improvement over traditional approaches of ABFT with ECC.
AB - Algorithm-based fault tolerance (ABFT) is a highly efficient resilience solution for many widely-used scientific computing kernels. However, in the context of the resilience ecosystem, ABFT is completely opaque to any underlying hardware re-silience mechanisms. As a result, some data structures are over-protected by ABFT and hardware, which leads to re-dundant costs in terms of performance and energy. In this paper, we rethink ABFT using an integrated view including both software and hardware with the goal of improving performance and energy efficiency of ABFT-enabled appli-cations. In particular, we study how to coordinate ABFT and error-correcting code (ECC) for main memory, and in-vestigate the impact of this coordination on performance, energy, and resilience for ABFT-enabled applications. Scaling tests and analysis indicate that our approach saves up to 25% for system energy (and up to 40% for dynamic mem-ory energy) with up to 18% performance improvement over traditional approaches of ABFT with ECC.
KW - Adaptive resilience
KW - Algorithm-based fault tolerance
KW - Error-correcting code
UR - http://www.scopus.com/inward/record.url?scp=84899682930&partnerID=8YFLogxK
U2 - 10.1145/2503210.2503226
DO - 10.1145/2503210.2503226
M3 - Conference contribution
AN - SCOPUS:84899682930
SN - 9781450323789
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - Proceedings of SC 2013
PB - IEEE Computer Society
T2 - 2013 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2013
Y2 - 17 November 2013 through 22 November 2013
ER -