TY - GEN
T1 - High-Order Finite Element Method using Standard and Device-Level Batch GEMM on GPUs
AU - Beams, Natalie
AU - Abdelfattah, Ahmad
AU - Tomov, Stan
AU - Dongarra, Jack
AU - Kolev, Tzanio
AU - Dudouit, Yohann
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/11
Y1 - 2020/11
N2 - We present new GPU implementations of the tensor contractions arising from basis-related computations for high-order finite element methods. We consider both tensor and non-tensor bases. In the case of tensor bases, we introduce new kernels based on a series of fused device-level matrix multiplications (GEMMs), specifically designed to utilize the fast memory of the GPU. For non-tensor bases, we develop a tuned framework for choosing standard batch-BLAS GEMMs that will maximize performance across groups of elements. The implementations are included in a backend of the libCEED library. We present benchmark results for the diffusion and mass operators using libCEED integration through the MFEM finite element library and compare to those of the previously best-performing GPU backends for stand-alone basis computations. In tensor cases, we see improvements of approximately 10-30% for some cases, particularly for higher basis orders. For the non-tensor tests, the new batch-GEMMs implementation is twice as fast as what was previously available for basis function order greater than five and greater than approximately 105 degrees of freedom in the mesh; up to ten times speedup is seen for eighth-order basis functions.
AB - We present new GPU implementations of the tensor contractions arising from basis-related computations for high-order finite element methods. We consider both tensor and non-tensor bases. In the case of tensor bases, we introduce new kernels based on a series of fused device-level matrix multiplications (GEMMs), specifically designed to utilize the fast memory of the GPU. For non-tensor bases, we develop a tuned framework for choosing standard batch-BLAS GEMMs that will maximize performance across groups of elements. The implementations are included in a backend of the libCEED library. We present benchmark results for the diffusion and mass operators using libCEED integration through the MFEM finite element library and compare to those of the previously best-performing GPU backends for stand-alone basis computations. In tensor cases, we see improvements of approximately 10-30% for some cases, particularly for higher basis orders. For the non-tensor tests, the new batch-GEMMs implementation is twice as fast as what was previously available for basis function order greater than five and greater than approximately 105 degrees of freedom in the mesh; up to ten times speedup is seen for eighth-order basis functions.
KW - GPU
KW - Tensor contractions
KW - batched linear algebra
KW - finite elements
KW - high-order methods
KW - matrix-free FEM
UR - http://www.scopus.com/inward/record.url?scp=85101253885&partnerID=8YFLogxK
U2 - 10.1109/ScalA51936.2020.00012
DO - 10.1109/ScalA51936.2020.00012
M3 - Conference contribution
AN - SCOPUS:85101253885
T3 - Proceedings of ScalA 2020: 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
SP - 53
EP - 60
BT - Proceedings of ScalA 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 11th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2020
Y2 - 13 November 2020 through 13 November 2020
ER -