TY - GEN
T1 - Performance of Hierarchical-matrix BiCGStab Solver on GPU Clusters
AU - Yamazaki, Ichitaro
AU - Abdelfattah, Ahmad
AU - Ida, Akihiro
AU - Ohshima, Satoshi
AU - Tomov, Stanimire
AU - Yokota, Rio
AU - Dongarra, Jack
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/8/3
Y1 - 2018/8/3
N2 - HACApK is a software package for solving dense linear systems of equations and is used in other software packages, like ppohBEM for solving boundary integral equations. To enable the solution of large-scale boundary value problems, HACApK hierarchically compresses the coefficient matrix and uses the BiConjugate Gradient Stabilized (BiCGStab) method for solving the linear system. To extend HACApK's capability, this paper outlines how we ported the HACApK linear solver onto GPU clusters. Though the potential of GPUS has been widely accepted in high-performance computing, it is still a challenge to utilize the GPUS for a solver, like HACApK, that requires fine-grained irregular computation and global communication. To utilize the GPUS, we integrated the variable-size batched GPU kernel that was recently released in the MAGMA software package. This is the first time the variable-size batched kernels were used in a solver or application code. We discuss several techniques to improve the performance of the batched kernel and demonstrate the effects of these techniques on two state-of-The-Art GPU clusters. For instance, with two 14-core Intel Xeon CPUs and four NVIDIA P100 GPUS per node, the GPU kernel obtained a solver speedup of 8× on one node and 4× on eight nodes. We also show that when the inter-GPU communication becomes significant, the solution time can be further reduced by a factor of 2× by carefully designing the communication layer with the underlying node architecture in mind.
AB - HACApK is a software package for solving dense linear systems of equations and is used in other software packages, like ppohBEM for solving boundary integral equations. To enable the solution of large-scale boundary value problems, HACApK hierarchically compresses the coefficient matrix and uses the BiConjugate Gradient Stabilized (BiCGStab) method for solving the linear system. To extend HACApK's capability, this paper outlines how we ported the HACApK linear solver onto GPU clusters. Though the potential of GPUS has been widely accepted in high-performance computing, it is still a challenge to utilize the GPUS for a solver, like HACApK, that requires fine-grained irregular computation and global communication. To utilize the GPUS, we integrated the variable-size batched GPU kernel that was recently released in the MAGMA software package. This is the first time the variable-size batched kernels were used in a solver or application code. We discuss several techniques to improve the performance of the batched kernel and demonstrate the effects of these techniques on two state-of-The-Art GPU clusters. For instance, with two 14-core Intel Xeon CPUs and four NVIDIA P100 GPUS per node, the GPU kernel obtained a solver speedup of 8× on one node and 4× on eight nodes. We also show that when the inter-GPU communication becomes significant, the solution time can be further reduced by a factor of 2× by carefully designing the communication layer with the underlying node architecture in mind.
KW - Distributed memory
KW - GPU
KW - Hierarchical Matrix
KW - Krylov
UR - http://www.scopus.com/inward/record.url?scp=85052235105&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2018.00102
DO - 10.1109/IPDPS.2018.00102
M3 - Conference contribution
AN - SCOPUS:85052235105
SN - 9781538643686
T3 - Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium, IPDPS 2018
SP - 930
EP - 939
BT - Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium, IPDPS 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 32nd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018
Y2 - 21 May 2018 through 25 May 2018
ER -