TY - GEN
T1 - Using Advanced Vector Extensions AVX-512 for MPI Reductions
AU - Zhong, Dong
AU - Cao, Qinglei
AU - Bosilca, George
AU - Dongarra, Jack
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/9/21
Y1 - 2020/9/21
N2 - As the scale of high-performance computing (HPC) systems continues to grow, researchers are devoted themselves to explore increasing levels of parallelism to achieve optimal performance. The modern CPU's design, including its features of hierarchical memory and SIMD/vectorization capability, governs algorithms' efficiency. The recent introduction of wide vector instruction set extensions (AVX and SVE) motivated vectorization to become of critical importance to increase efficiency and close the gap to peak performance. In this paper, we propose an implementation of predefined MPI reduction operations utilizing AVX, AVX2 and AVX-512 intrinsics to provide vector-based reduction operation and to improve the time-to-solution of these predefined MPI reduction operations. With these optimizations, we achieve higher efficiency for local computations, which directly benefit the overall cost of collective reductions. The evaluation of the resulting software stack under different scenarios demonstrates that the solution is at the same time generic and efficient. Experiments are conducted on an Intel Xeon Gold cluster, which shows our AVX-512 optimized reduction operations achieve 10X performance benefits than Open MPI default for MPI local reduction.
AB - As the scale of high-performance computing (HPC) systems continues to grow, researchers are devoted themselves to explore increasing levels of parallelism to achieve optimal performance. The modern CPU's design, including its features of hierarchical memory and SIMD/vectorization capability, governs algorithms' efficiency. The recent introduction of wide vector instruction set extensions (AVX and SVE) motivated vectorization to become of critical importance to increase efficiency and close the gap to peak performance. In this paper, we propose an implementation of predefined MPI reduction operations utilizing AVX, AVX2 and AVX-512 intrinsics to provide vector-based reduction operation and to improve the time-to-solution of these predefined MPI reduction operations. With these optimizations, we achieve higher efficiency for local computations, which directly benefit the overall cost of collective reductions. The evaluation of the resulting software stack under different scenarios demonstrates that the solution is at the same time generic and efficient. Experiments are conducted on an Intel Xeon Gold cluster, which shows our AVX-512 optimized reduction operations achieve 10X performance benefits than Open MPI default for MPI local reduction.
KW - Instruction level parallelism
KW - Intel AVX2/AVX-512
KW - Long vector extension
KW - MPI reduction operation
KW - Single instruction multiple data
KW - Vector operation
UR - http://www.scopus.com/inward/record.url?scp=85093968974&partnerID=8YFLogxK
U2 - 10.1145/3416315.3416316
DO - 10.1145/3416315.3416316
M3 - Conference contribution
AN - SCOPUS:85093968974
T3 - ACM International Conference Proceeding Series
SP - 1
EP - 10
BT - Proceedings of 2020 27th European MPI Users'' Group Meeting, EuroMPI/USA 2020
PB - Association for Computing Machinery
T2 - 27th European MPI Users' Group Meeting, EuroMPI/USA 2020
Y2 - 21 September 2020 through 24 September 2020
ER -