TY - GEN
T1 - Using Arm Scalable Vector Extension to Optimize OPEN MPI
AU - Zhong, Dong
AU - Shamis, Pavel
AU - Cao, Qinglei
AU - Bosilca, George
AU - Sumimoto, Shinji
AU - Miura, Kenichi
AU - Dongarra, Jack
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/5
Y1 - 2020/5
N2 - As the scale of high-performance computing (HPC) systems continues to grow, increasing levels of parallelism must be implored to achieve optimal performance. Recently, the processors support wide vector extensions, vectorization becomes much more important to exploit the potential peak performance of target architecture. Novel processor architectures, such as the Armv8-A architecture, introduce Scalable Vector Extension (SVE)-an optional separate architectural extension with a new set of A64 instruction encodings, which enables even greater parallelisms.In this paper, we analyze the usage and performance of the SVE instructions in Arm SVE vector Instruction Set Architecture (ISA); and utilize those instructions to improve the memcpy and various local reduction operations. Furthermore, we propose new strategies to improve the performance of MPI operations including datatype packing/unpacking and MPI reduction. With these optimizations, we not only provide a higher-parallelism for a single node, but also achieve a more efficient communication scheme of message exchanging. The resulting efforts have been implemented in the context of OPEN MPI, providing efficient and scalable capabilities of SVE usage and extending the possible implementations of SVE to a more extensive range of programming and execution paradigms. The evaluation of the resulting software stack under different scenarios with both simulator and Fujitsu's A64FX processor demonstrates that the solution is at the same time generic and efficient.
AB - As the scale of high-performance computing (HPC) systems continues to grow, increasing levels of parallelism must be implored to achieve optimal performance. Recently, the processors support wide vector extensions, vectorization becomes much more important to exploit the potential peak performance of target architecture. Novel processor architectures, such as the Armv8-A architecture, introduce Scalable Vector Extension (SVE)-an optional separate architectural extension with a new set of A64 instruction encodings, which enables even greater parallelisms.In this paper, we analyze the usage and performance of the SVE instructions in Arm SVE vector Instruction Set Architecture (ISA); and utilize those instructions to improve the memcpy and various local reduction operations. Furthermore, we propose new strategies to improve the performance of MPI operations including datatype packing/unpacking and MPI reduction. With these optimizations, we not only provide a higher-parallelism for a single node, but also achieve a more efficient communication scheme of message exchanging. The resulting efforts have been implemented in the context of OPEN MPI, providing efficient and scalable capabilities of SVE usage and extending the possible implementations of SVE to a more extensive range of programming and execution paradigms. The evaluation of the resulting software stack under different scenarios with both simulator and Fujitsu's A64FX processor demonstrates that the solution is at the same time generic and efficient.
KW - ARMIE
KW - SVE
KW - Vector Length Agnostic
KW - datatype pack and unpack
KW - local reduction
KW - non-contiguous accesses
UR - http://www.scopus.com/inward/record.url?scp=85089090228&partnerID=8YFLogxK
U2 - 10.1109/CCGrid49817.2020.00-71
DO - 10.1109/CCGrid49817.2020.00-71
M3 - Conference contribution
AN - SCOPUS:85089090228
T3 - Proceedings - 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020
SP - 222
EP - 231
BT - Proceedings - 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020
A2 - Lefevre, Laurent
A2 - Varela, Carlos A.
A2 - Pallis, George
A2 - Toosi, Adel N.
A2 - Rana, Omer
A2 - Buyya, Rajkumar
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020
Y2 - 11 May 2020 through 14 May 2020
ER -