Abstract
We present a scalable implementation of the Linearized Augmented Plane Wave method for distributed memory systems, which relies on an efficient distributed, block-cyclic setup of the Hamiltonian and overlap matrices and allows us to turn around highly accurate 1000+ atom all-electron quantum materials simulations on clusters with a few hundred nodes. The implementation runs efficiently on standard multi-core CPU nodes, as well as hybrid CPU-GPU nodes. The key for the latter is a novel algorithm to solve the generalized eigenvalue problem for dense, complex Hermitian matrices on distributed hybrid CPU-GPU systems. Performance tests for Li-intercalated CoO2 supercells containing 1501 atoms demonstrate that high-accuracy, transferable quantum simulations can now be used in throughput materials search problems. While our application can benefit and get scalable performance through CPU-only libraries like ScaLAPACK or ELPA2, our new hybrid solver enables the efficient use of GPUs and shows that a hybrid CPU-GPU architecture scales to a desired performance using substantially fewer cluster nodes, and notably, is considerably more energy efficient than the traditional multi-core CPU only systems for such complex applications.
Original language | English |
---|---|
Title of host publication | Proceedings of SC 2015 |
Subtitle of host publication | The International Conference for High Performance Computing, Networking, Storage and Analysis |
Publisher | IEEE Computer Society |
ISBN (Electronic) | 9781450337236 |
DOIs | |
State | Published - Nov 15 2015 |
Event | International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015 - Austin, United States Duration: Nov 15 2015 → Nov 20 2015 |
Publication series
Name | International Conference for High Performance Computing, Networking, Storage and Analysis, SC |
---|---|
Volume | 15-20-November-2015 |
ISSN (Print) | 2167-4329 |
ISSN (Electronic) | 2167-4337 |
Conference
Conference | International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015 |
---|---|
Country/Territory | United States |
City | Austin |
Period | 11/15/15 → 11/20/15 |
Funding
The authors would like to thank the NSF grant #ACI-1339822 and NVIDIA, as well as the Swiss National Foundation's NCCR-MARVEL project for supporting this research effort.