TY - GEN
T1 - GPU-ABFT
T2 - 11th IEEE International Conference on Networking Architecture and Storage, NAS 2016
AU - Chen, Jieyang
AU - Li, Sihuan
AU - Chen, Zizhong
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/8/23
Y1 - 2016/8/23
N2 - For matrix operations, the algorithm-based fault tolerance (ABFT) brings much lower fault tolerance overhead than the traditional Triple Modular Redundancy or Double Modular Redundancy approaches. Many works have been done to develop and optimize ABFT schemes on general purpose microprocessors. However, the ABFT schemes on heterogeneous systems with GPUs are not fully developed and optimized. Moreover, existing ABFT schemes can correct computing errors brings by the logic parts, however, many memory storage errors cannot be detected and corrected by current ABFT schemes. In this work, we designed a new ABFT scheme with both computing and memory storage protection. Then, we apply it to Cholesky decomposition on heterogeneous systems with GPUs. In addition, we develop several fault tolerance overhead reduction techniques specifically for heterogeneous systems with GPUs accelerators. Experimental results show that our ABFT scheme is able to correct both computing error and memory storage error with low overhead and comparable overall performance.
AB - For matrix operations, the algorithm-based fault tolerance (ABFT) brings much lower fault tolerance overhead than the traditional Triple Modular Redundancy or Double Modular Redundancy approaches. Many works have been done to develop and optimize ABFT schemes on general purpose microprocessors. However, the ABFT schemes on heterogeneous systems with GPUs are not fully developed and optimized. Moreover, existing ABFT schemes can correct computing errors brings by the logic parts, however, many memory storage errors cannot be detected and corrected by current ABFT schemes. In this work, we designed a new ABFT scheme with both computing and memory storage protection. Then, we apply it to Cholesky decomposition on heterogeneous systems with GPUs. In addition, we develop several fault tolerance overhead reduction techniques specifically for heterogeneous systems with GPUs accelerators. Experimental results show that our ABFT scheme is able to correct both computing error and memory storage error with low overhead and comparable overall performance.
UR - https://www.scopus.com/pages/publications/84988423865
U2 - 10.1109/NAS.2016.7549404
DO - 10.1109/NAS.2016.7549404
M3 - Conference contribution
AN - SCOPUS:84988423865
T3 - 2016 IEEE International Conference on Networking Architecture and Storage, NAS 2016 - Proceedings
BT - 2016 IEEE International Conference on Networking Architecture and Storage, NAS 2016 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 8 August 2016 through 10 August 2016
ER -