TY - GEN
T1 - Accelerating Lattice QCD Multigrid on GPUs Using Fine-Grained Parallelization
AU - Clark, M. A.
AU - Joo, Balint
AU - Strelchenko, Alexei
AU - Cheng, Michael
AU - Gambhir, Arjun
AU - Brower, Richard C.
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/7/2
Y1 - 2016/7/2
N2 - The past decade has witnessed a dramatic acceleration of lattice quantum chromodynamics calculations in nuclear and particle physics. This has been due to both significant progress in accelerating the iterative linear solvers using multigrid algorithms, and due to the throughput improvements brought by GPUs. Deploying hierarchical algorithms optimally on GPUs is non-trivial owing to the lack of parallelism on the coarse grids, and as such, these advances have not proved multiplicative. Using the QUDA library, we demonstrate that by exposing all sources of parallelism that the underlying stencil problem possesses, and through appropriate mapping of this parallelism to the GPU architecture, we can achieve high efficiency even for the coarsest of grids. Results are presented for the Wilson-Clover discretization, where we demonstrate up to 10x speedup over present state-of-the-art GPU-accelerated methods on Titan. Finally, we look to the future, and consider the software implications of our findings.
AB - The past decade has witnessed a dramatic acceleration of lattice quantum chromodynamics calculations in nuclear and particle physics. This has been due to both significant progress in accelerating the iterative linear solvers using multigrid algorithms, and due to the throughput improvements brought by GPUs. Deploying hierarchical algorithms optimally on GPUs is non-trivial owing to the lack of parallelism on the coarse grids, and as such, these advances have not proved multiplicative. Using the QUDA library, we demonstrate that by exposing all sources of parallelism that the underlying stencil problem possesses, and through appropriate mapping of this parallelism to the GPU architecture, we can achieve high efficiency even for the coarsest of grids. Results are presented for the Wilson-Clover discretization, where we demonstrate up to 10x speedup over present state-of-the-art GPU-accelerated methods on Titan. Finally, we look to the future, and consider the software implications of our findings.
UR - http://www.scopus.com/inward/record.url?scp=85017203388&partnerID=8YFLogxK
U2 - 10.1109/SC.2016.67
DO - 10.1109/SC.2016.67
M3 - Conference contribution
AN - SCOPUS:85017203388
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
SP - 795
EP - 806
BT - Proceedings of SC 2016
PB - IEEE Computer Society
T2 - 2016 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016
Y2 - 13 November 2016 through 18 November 2016
ER -