TY - GEN
T1 - FTRANS
T2 - 2020 ACM/IEEE International Symposium on Low Power Electronics and Design, ISLPED 2020
AU - Li, Bingbing
AU - Pandey, Santosh
AU - Fang, Haowen
AU - Lyv, Yanjun
AU - Li, Ji
AU - Chen, Jieyang
AU - Xie, Mimi
AU - Wan, Lipeng
AU - Liu, Hang
AU - Ding, Caiwen
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/8/10
Y1 - 2020/8/10
N2 - In natural language processing (NLP), the "Transformer"architecture was proposed as the first transduction model replying entirely on self-Attention mechanisms without using sequence-Aligned recurrent neural networks (RNNs) or convolution, and it achieved significant improvements for sequence to sequence tasks. The introduced intensive computation and storage of these pre-Trained language representations has impeded their popularity into computation and memory constrained devices. The field-programmable gate array (FPGA) is widely used to accelerate deep learning algorithms for its high parallelism and low latency. However, the trained models are still too large to accommodate to an FPGA fabric. In this paper, we propose an efficient acceleration framework, Ftrans, for transformer-based large scale language representations. Our framework includes enhanced block-circulant matrix (BCM)-based weight representation to enable model compression on large-scale language representations at the algorithm level with few accuracy degradation, and an acceleration design at the architecture level. Experimental results show that our proposed framework significantly reduce the model size of NLP models by up to 16 times. Our FPGA design achieves 27.07× and 81 × improvement in performance and energy efficiency compared to CPU, and up to 8.80× improvement in energy efficiency compared to GPU.
AB - In natural language processing (NLP), the "Transformer"architecture was proposed as the first transduction model replying entirely on self-Attention mechanisms without using sequence-Aligned recurrent neural networks (RNNs) or convolution, and it achieved significant improvements for sequence to sequence tasks. The introduced intensive computation and storage of these pre-Trained language representations has impeded their popularity into computation and memory constrained devices. The field-programmable gate array (FPGA) is widely used to accelerate deep learning algorithms for its high parallelism and low latency. However, the trained models are still too large to accommodate to an FPGA fabric. In this paper, we propose an efficient acceleration framework, Ftrans, for transformer-based large scale language representations. Our framework includes enhanced block-circulant matrix (BCM)-based weight representation to enable model compression on large-scale language representations at the algorithm level with few accuracy degradation, and an acceleration design at the architecture level. Experimental results show that our proposed framework significantly reduce the model size of NLP models by up to 16 times. Our FPGA design achieves 27.07× and 81 × improvement in performance and energy efficiency compared to CPU, and up to 8.80× improvement in energy efficiency compared to GPU.
UR - http://www.scopus.com/inward/record.url?scp=85098261612&partnerID=8YFLogxK
U2 - 10.1145/3370748.3406567
DO - 10.1145/3370748.3406567
M3 - Conference contribution
AN - SCOPUS:85098261612
T3 - ACM International Conference Proceeding Series
BT - Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, ISLPED 2020
PB - Association for Computing Machinery
Y2 - 10 August 2020 through 12 August 2020
ER -