Abstract
The modern CPU's design, including the deep memory hierarchies and SIMD/vectorization capability have a more significant impact on algorithms’ efficiency than the modest frequency increase observed recently. The current introduction of wide vector instruction set extensions (AVX and SVE) motivated vectorization to become a critical software component to increase efficiency and close the gap to peak performance. In this paper, we investigate the impact of the vectorization of MPI reduction operations. We propose an implementation of predefined MPI reduction operations using vector intrinsics (AVX and SVE) to improve the time-to-solution of the predefined MPI reduction operations. The evaluation of the resulting software stack under different scenarios demonstrates that the approach is not only efficient but also generalizable to many vector architectures. Experiments conducted on varied architectures (Intel Xeon Gold, AMD Zen 2, and Arm A64FX), show that the proposed vector extension optimized reduction operations significantly reduce completion time for collective communication reductions. With these optimizations, we achieve higher memory bandwidth and an increased efficiency for local computations, which directly benefit the overall cost of collective reductions and applications based on them.
Original language | English |
---|---|
Article number | 102871 |
Journal | Parallel Computing |
Volume | 109 |
DOIs | |
State | Published - Mar 2022 |
Externally published | Yes |
Funding
This material is based upon work supported by the National Science Foundation, United States under Grant No. (1664142); and the Exascale Computing Project, United States (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration, United States. The authors would also like to thank the Texas Advanced Computing Center (TACC). For computer time, this research used the Stampede2 flagship supercomputer of the Extreme Science and Engineering Discovery Environment (XSEDE) hosted at TACC. This material is based upon work supported by the National Science Foundation, United States under Grant No. ( 1664142 ); and the Exascale Computing Project, United States ( 17-SC-20-SC ), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration, United States . The authors would also like to thank the Texas Advanced Computing Center (TACC). For computer time, this research used the Stampede2 flagship supercomputer of the Extreme Science and Engineering Discovery Environment (XSEDE) hosted at TACC.
Funders | Funder number |
---|---|
Extreme Science and Engineering Discovery Environment | |
Texas Advanced Computing Center | |
XSEDE | |
National Science Foundation | 17-SC-20-SC, 1664142 |
U.S. Department of Energy | |
Directorate for Computer and Information Science and Engineering | 1725692 |
National Nuclear Security Administration |
Keywords
- Instruction level parallelism
- Intel AVX2/AVX-512
- Long vector extension
- MPI reduction operation
- Scalable Vector Extension (SVE)
- Single instruction multiple data
- Vector operation