TY - GEN
T1 - Improving communication performance and scalability of native applications on Intel® Xeon Phi™ coprocessor clusters
AU - Vaidyanathan, Karthikeyan
AU - Pamnany, Kiran
AU - Kalamkar, Dhiraj D.
AU - Heinecke, Alexander
AU - Smelyanskiy, Mikhail
AU - Park, Jongsoo
AU - Kim, Daehyun
AU - Shet G., Aniruddha
AU - Kaul, Bharat
AU - Joó, Bálint
AU - Dubey, Pradeep
PY - 2014
Y1 - 2014
N2 - Intel® Xeon Phi™ coprocessor-based clusters offer high compute and memory performance for parallel workloads and also support direct network access. Many real world applications are significantly impacted by network characteristics and to maximize the performance of such applications on these clusters, it is particularly important to effectively saturate network bandwidth and/or hide communications latency. We demonstrate how to do so using techniques such as pipelined DMAs for data transfer, dynamic chunk sizing, and better asynchronous progress. We also show a method for, and the impact of avoiding serialization and maximizing parallelism during application communication phases. Additionally, we apply application optimizations focused on balancing computation and communication in order to hide communication latency and improve utilization of cores and of network bandwidth. We demonstrate the impact of our techniques on three well known and highly optimized HPC kernels running natively on the Intel Xeon Phi coprocessor. For the Wilson-Dslash operator from Lattice QCD, we characterize the improvements from each of our optimizations for communication performance, apply our method for maximizing concurrency during communication phases, and show an overall 48% improvement from our previously best published result. For HPL/LINPACK, we show 68.5% efficiency with 97 TFLOPs on 128 Intel Xeon Phi coprocessors, the first ever reported native HPL efficiency on a coprocessor-based supercomputer. For FFT, we show 10.8 TFLOPs using 1024 Intel Xeon Phi coprocessors on the TACC Stampede cluster, the highest reported performance on any Intel Architecture-based cluster and the first such result to be reported on a coprocessor-based supercomputer.
AB - Intel® Xeon Phi™ coprocessor-based clusters offer high compute and memory performance for parallel workloads and also support direct network access. Many real world applications are significantly impacted by network characteristics and to maximize the performance of such applications on these clusters, it is particularly important to effectively saturate network bandwidth and/or hide communications latency. We demonstrate how to do so using techniques such as pipelined DMAs for data transfer, dynamic chunk sizing, and better asynchronous progress. We also show a method for, and the impact of avoiding serialization and maximizing parallelism during application communication phases. Additionally, we apply application optimizations focused on balancing computation and communication in order to hide communication latency and improve utilization of cores and of network bandwidth. We demonstrate the impact of our techniques on three well known and highly optimized HPC kernels running natively on the Intel Xeon Phi coprocessor. For the Wilson-Dslash operator from Lattice QCD, we characterize the improvements from each of our optimizations for communication performance, apply our method for maximizing concurrency during communication phases, and show an overall 48% improvement from our previously best published result. For HPL/LINPACK, we show 68.5% efficiency with 97 TFLOPs on 128 Intel Xeon Phi coprocessors, the first ever reported native HPL efficiency on a coprocessor-based supercomputer. For FFT, we show 10.8 TFLOPs using 1024 Intel Xeon Phi coprocessors on the TACC Stampede cluster, the highest reported performance on any Intel Architecture-based cluster and the first such result to be reported on a coprocessor-based supercomputer.
KW - FFT
KW - HPL
KW - Intel Xeon Phi coprocessor clusters
KW - Lattice QCD
KW - native applications
UR - http://www.scopus.com/inward/record.url?scp=84906706181&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2014.113
DO - 10.1109/IPDPS.2014.113
M3 - Conference contribution
AN - SCOPUS:84906706181
SN - 9780769552071
T3 - Proceedings of the International Parallel and Distributed Processing Symposium, IPDPS
SP - 1083
EP - 1092
BT - Proceedings - IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS 2014
PB - IEEE Computer Society
T2 - 28th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2014
Y2 - 19 May 2014 through 23 May 2014
ER -