TY - GEN
T1 - TSM2
T2 - 33rd ACM International Conference on Supercomputing, ICS 2019, held in conjunction with the Federated Computing Research Conference, FCRC 2019
AU - Chen, Jieyang
AU - Xiong, Nan
AU - Liang, Xin
AU - Tao, Dingwen
AU - Li, Sihuan
AU - Ouyang, Kaiming
AU - Zhao, Kai
AU - Debardeleben, Nathan
AU - Guan, Qiang
AU - Chen, Zizhong
N1 - Publisher Copyright:
© 2019 ACM.
PY - 2019/6/26
Y1 - 2019/6/26
N2 - Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been done on optimizing linear algebra operations on GPUs with regular-shaped input. However, few works are focusing on fully utilizing GPU resources when the input is not regular-shaped. Current optimizations lack of considering fully utilizing the memory bandwidth and computing power, therefore they could only achieve sub-optimal performance. In this paper, we propose a performant tall-and-skinny matrix-matrix multiplication algorithm on GPUs - TSM2. It focuses on optimizing linear algebra operation with none regular-shaped input. We implement the proposed algorithm and test on three different Nvidia GPU micro-architectures: Kepler, Maxwell, and Pascal. Experiments show that our TSM2 speedups the computation by 1.1x - 3x, improves memory bandwidth utilization by 8% - 47.6%, and improves computing power utilization by 7% - 37.3% comparing to the current state-of-the-art works. We replace the original matrix operations in K-means and Algorithm-Bases Fault Tolerance (ABFT) with TSM2 and achieve up to 1.89x and 1.90x speed up.
AB - Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been done on optimizing linear algebra operations on GPUs with regular-shaped input. However, few works are focusing on fully utilizing GPU resources when the input is not regular-shaped. Current optimizations lack of considering fully utilizing the memory bandwidth and computing power, therefore they could only achieve sub-optimal performance. In this paper, we propose a performant tall-and-skinny matrix-matrix multiplication algorithm on GPUs - TSM2. It focuses on optimizing linear algebra operation with none regular-shaped input. We implement the proposed algorithm and test on three different Nvidia GPU micro-architectures: Kepler, Maxwell, and Pascal. Experiments show that our TSM2 speedups the computation by 1.1x - 3x, improves memory bandwidth utilization by 8% - 47.6%, and improves computing power utilization by 7% - 37.3% comparing to the current state-of-the-art works. We replace the original matrix operations in K-means and Algorithm-Bases Fault Tolerance (ABFT) with TSM2 and achieve up to 1.89x and 1.90x speed up.
KW - GEMM
KW - GPU
KW - Matrix-matrix multiplication
KW - Optimization
KW - Tall-and-skinny
UR - http://www.scopus.com/inward/record.url?scp=85074515967&partnerID=8YFLogxK
U2 - 10.1145/3330345.3330355
DO - 10.1145/3330345.3330355
M3 - Conference contribution
AN - SCOPUS:85074515967
T3 - Proceedings of the International Conference on Supercomputing
SP - 106
EP - 116
BT - ICS 2019 - International Conference on Supercomputing
PB - Association for Computing Machinery
Y2 - 26 June 2019
ER -