TY - GEN
T1 - Lattice QCD on Intel® Xeon Phi™ coprocessors
AU - Joó, Bálint
AU - Kalamkar, Dhiraj D.
AU - Vaidyanathan, Karthikeyan
AU - Smelyanskiy, Mikhail
AU - Pamnany, Kiran
AU - Lee, Victor W.
AU - Dubey, Pradeep
AU - Watson, William
PY - 2013
Y1 - 2013
N2 - Lattice Quantum Chromodynamics (LQCD) is currently the only known model independent, non perturbative computational method for calculations in the theory of the strong interactions, and is of importance in studies of nuclear and high energy physics. LQCD codes use large fractions of supercomputing cycles worldwide and are often amongst the first to be ported to new high performance computing architectures. The recently released Intel Xeon Phi architecture from Intel Corporation features parallelism at the level of many x86-based cores, multiple threads per core, and vector processing units. In this contribution, we describe our experiences with optimizing a key LQCD kernel for the Xeon Phi architecture. On a single node, using single precision, our Dslash kernel sustains a performance of up to 320 GFLOPS, while our Conjugate Gradients solver sustains up to 237 GFLOPS. Furthermore we demonstrate a fully 'native' multi-node LQCD implementation running entirely on KNC nodes with minimum involvement of the host CPU. Our multi-node implementation of the solver has been strong scaled to 3.9 TFLOPS on 32 KNCs.
AB - Lattice Quantum Chromodynamics (LQCD) is currently the only known model independent, non perturbative computational method for calculations in the theory of the strong interactions, and is of importance in studies of nuclear and high energy physics. LQCD codes use large fractions of supercomputing cycles worldwide and are often amongst the first to be ported to new high performance computing architectures. The recently released Intel Xeon Phi architecture from Intel Corporation features parallelism at the level of many x86-based cores, multiple threads per core, and vector processing units. In this contribution, we describe our experiences with optimizing a key LQCD kernel for the Xeon Phi architecture. On a single node, using single precision, our Dslash kernel sustains a performance of up to 320 GFLOPS, while our Conjugate Gradients solver sustains up to 237 GFLOPS. Furthermore we demonstrate a fully 'native' multi-node LQCD implementation running entirely on KNC nodes with minimum involvement of the host CPU. Our multi-node implementation of the solver has been strong scaled to 3.9 TFLOPS on 32 KNCs.
UR - http://www.scopus.com/inward/record.url?scp=84884497620&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-38750-0_4
DO - 10.1007/978-3-642-38750-0_4
M3 - Conference contribution
AN - SCOPUS:84884497620
SN - 9783642387494
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 40
EP - 54
BT - Supercomputing - 28th International Supercomputing Conference, ISC 2013, Proceedings
T2 - 28th International Supercomputing Conference on Supercomputing, ISC 2013
Y2 - 16 June 2013 through 20 June 2013
ER -