Abstract
We present an implementation of all-electron density-functional theory for massively parallel GPU-based platforms, using localized atom-centered basis functions and real-space integration grids. Special attention is paid to domain decomposition of the problem on non-uniform grids, which enables compute- and memory-parallel execution across thousands of nodes for real-space operations, e.g. the update of the electron density, the integration of the real-space Hamiltonian matrix, and calculation of Pulay forces. To assess the performance of our GPU implementation, we performed benchmarks on three different architectures using a 103-material test set. We find that operations which rely on dense serial linear algebra show dramatic speedups from GPU acceleration: in particular, SCF iterations including force and stress calculations exhibit speedups ranging from 4.5 to 6.6. For the architectures and problem types investigated here, this translates to an expected overall speedup between 3–4 for the entire calculation (including non-GPU accelerated parts), for problems featuring several tens to hundreds of atoms. Additional calculations for a 375-atom Bi2Se3 bilayer show that the present GPU strategy scales for large-scale distributed-parallel simulations.
Original language | English |
---|---|
Article number | 107314 |
Journal | Computer Physics Communications |
Volume | 254 |
DOIs | |
State | Published - Sep 2020 |
Funding
This work was supported by the LDRD Program of ORNL, USA managed by UT-Battelle, LLC, for the U.S. DOE and by the Oak Ridge Leadership Computing Facility , which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725 . A portion of work was conducted at the Center for Nanophase Materials Sciences , which is a DOE Office of Science User Facility, and supported by the Creative Materials Discovery Program through the National Research Foundation of Korea funded by the Ministry of Science, ICT and Future Planning ( NRF-2016M3D1A1919181 ). We gratefully acknowledge the support of NVIDIA Corporation, USA with the donation of Quadro GP100 and Titan V GPUs used for local development, as well as access to their PSG cluster. We thank Dr. Vincenzo Lordi and the Lawrence Livermore National Laboratory (LLNL), a U.S. Department of Energy Facility, for assistance with and access to LLNL’s supercomputer Lassen for benchmarks conducted in this work. The work on LLNL’s Lassen supercomputer was performed under the auspices of the U.S. Department of Energy at Lawrence Livermore National Laboratory under Contract No. DE-AC52-07NA27344 . We thank Dr. Ville Havu for useful discussion about the grid partitioning scheme used in FHI-aims. We would finally like to acknowledge the contribution of Dr. Rainer Johanni, deceased in 2012, who pioneered the distributed-parallel CPU version of the locally-indexed real-space Hamiltonian scheme that is a critical foundation of this work. This work was supported by the LDRD Program of ORNL, USA managed by UT-Battelle, LLC, for the U.S. DOE and by the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. A portion of work was conducted at the Center for Nanophase Materials Sciences, which is a DOE Office of Science User Facility, and supported by the Creative Materials Discovery Program through the National Research Foundation of Korea funded by the Ministry of Science, ICT and Future Planning (NRF-2016M3D1A1919181). We gratefully acknowledge the support of NVIDIA Corporation, USA with the donation of Quadro GP100 and Titan V GPUs used for local development, as well as access to their PSG cluster. We thank Dr. Vincenzo Lordi and the Lawrence Livermore National Laboratory (LLNL), a U.S. Department of Energy Facility, for assistance with and access to LLNL's supercomputer Lassen for benchmarks conducted in this work. The work on LLNL's Lassen supercomputer was performed under the auspices of the U.S. Department of Energy at Lawrence Livermore National Laboratory under Contract No. DE-AC52-07NA27344. We thank Dr. Ville Havu for useful discussion about the grid partitioning scheme used in FHI-aims. We would finally like to acknowledge the contribution of Dr. Rainer Johanni, deceased in 2012, who pioneered the distributed-parallel CPU version of the locally-indexed real-space Hamiltonian scheme that is a critical foundation of this work.
Keywords
- Density functional theory
- Domain decomposition
- Electronic structure
- GPU acceleration
- High performance computing
- Localized basis sets