Abstract
Lattice Quantumchromodynamics (QCD) is a powerful tool to numerically access the low energy regime of QCD in a straightforward way with quantifyable uncertainties. In this approach, QCD is discretized on a four dimensional, Euclidean space-time grid with millions of degrees of freedom. In modern lattice calculations, most of the work is still spent in solving large, sparse linear systems. This part has two challenges, i.e. optimizing the sparse matrix application as well as BLAS-like kernels used in the linear solver. We are going to present performance optimizations of the Dirac operator (dslash) with and without clover term for recent Intel® architectures, i.e. Haswell and Knights Landing (KNL). We were able to achieve a good fraction of peak performance for the Wilson-Dslash kernel, and Conjugate Gradients and Stabilized BiConjugate Gradients solvers. We will also present a series of experiments we performed on KNL, i.e. running MCDRAM in different modes, enabling or disabling hardware prefetching as well as using different SoA lengths. Furthermore, we will present a weak scaling study up to 16 KNL nodes.
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 9945 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | International Workshops on High Performance Computing, ISC High Performance 2016 and Workshop on 2nd International Workshop on Communication Architectures at Extreme Scale, ExaComm 2016, Workshop on Exascale Multi/Many Core Computing Systems, E-MuCoCoS 2016, HPC I/O in the Data Center, HPC-IODC 2016, Application Performance on Intel Xeon Phi – Being Prepared for KNL and Beyond, IXPUG 2016, International Workshop on OpenPOWER for HPC, IWOPH 2016, International Workshop on Performance Portable Programming Models for Accelerators, P^3MA 2016, Workshop on Virtualization in High-Performance Cloud Computing, VHPC 2016, Workshop on Performance and Scalability of Storage Systems, WOPSSS 2016 |
---|---|
Country/Territory | Germany |
City | Frankfurt |
Period | 06/19/16 → 06/23/16 |
Funding
The majority of this work was carried out during a NERSC Exa-scale Scientific Application Program (NESAP) deep dive known as a Dungeon Session at the offices of Intel in Portland, Oregon. We thank NERSC and Intel for organizing this session. We also like to thank Jack Deslippe for insightful discussions. Performance results were measured on the Intel Endeavor cluster and on the Cori Phase I system at NERSC with additional development work carried out on systems at Jefferson Lab. B. Joó gratefully acknowledges funding from the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Office of Nuclear Physics and Office of High Energy Physics under the SciDAC program (USQCD). This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Nuclear Physics under contract DE-AC05-06OR23177. The National Energy Research Scientific Computing Center (NERSC) is a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. B. Joó—Notice: Authored by Jefferson Science Associates, LLC under U.S. DOE Contract No. DE-AC05-06OR23177. The U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce this manuscript for U.S. Government purposes.