TY - GEN
T1 - MPI collectives on modern multicore clusters
T2 - CCGRID 2008 - 8th IEEE International Symposium on Cluster Computing and the Grid
AU - Mamidala, Amith R.
AU - Kumar, Rahul
AU - De, Debraj
AU - Panda, D. K.
PY - 2008
Y1 - 2008
N2 - The advances in multicore technology and modern interconnects is rapidly accelerating the number of cores deployed in today's commodity clusters. A majority of parallel applications written in MPI employ collective operations in their communication kernels. Optimization of these operations on the multicore platforms is the key to obtaining good performance speed-ups. However, designing these operations on the modern multicores is a non-trivial task. Modern multicores such as Intel's Clovertown and AMD's Opteron feature various architectural attributes resulting in interesting ramifications. For example, Clovertown deploys shared L2 caches for a pair of cores whereas in Opteron, L2 caches are exclusive to a core. Understanding the impact of these architectures on communication performance is crucial to designing efficient collective algorithms. In this paper, we systematically evaluate these architectures and use these insights to develop efficient collective operations such as MPI_Bcast, MPI_Allgather, MPI_Allreduce and MPI_Alltoall. Further, we characterize the behavior of these collective algorithms on multicores especially when concurrent network and intra-node communications occur. We also evaluate the benefits of the proposed intra-node MPI_Allreduce over Opteron multicores and compare it with Intel Clovertown systems. The optimizations proposed in this paper reduce the latency of MPI_Bcast and MPI_Allgather by 1.9 and 4.0 times, respectively on 512 cores. For MPI Allreduce, our optimizations improve the performance by as much as 33% on the multicores. Further, we observe upto three times improvement in performance for matrix multiplication benchmark on 512 cores.
AB - The advances in multicore technology and modern interconnects is rapidly accelerating the number of cores deployed in today's commodity clusters. A majority of parallel applications written in MPI employ collective operations in their communication kernels. Optimization of these operations on the multicore platforms is the key to obtaining good performance speed-ups. However, designing these operations on the modern multicores is a non-trivial task. Modern multicores such as Intel's Clovertown and AMD's Opteron feature various architectural attributes resulting in interesting ramifications. For example, Clovertown deploys shared L2 caches for a pair of cores whereas in Opteron, L2 caches are exclusive to a core. Understanding the impact of these architectures on communication performance is crucial to designing efficient collective algorithms. In this paper, we systematically evaluate these architectures and use these insights to develop efficient collective operations such as MPI_Bcast, MPI_Allgather, MPI_Allreduce and MPI_Alltoall. Further, we characterize the behavior of these collective algorithms on multicores especially when concurrent network and intra-node communications occur. We also evaluate the benefits of the proposed intra-node MPI_Allreduce over Opteron multicores and compare it with Intel Clovertown systems. The optimizations proposed in this paper reduce the latency of MPI_Bcast and MPI_Allgather by 1.9 and 4.0 times, respectively on 512 cores. For MPI Allreduce, our optimizations improve the performance by as much as 33% on the multicores. Further, we observe upto three times improvement in performance for matrix multiplication benchmark on 512 cores.
UR - http://www.scopus.com/inward/record.url?scp=50649091849&partnerID=8YFLogxK
U2 - 10.1109/CCGRID.2008.87
DO - 10.1109/CCGRID.2008.87
M3 - Conference contribution
AN - SCOPUS:50649091849
SN - 9780769531564
T3 - Proceedings CCGRID 2008 - 8th IEEE International Symposium on Cluster Computing and the Grid
SP - 130
EP - 137
BT - Proceedings CCGRID 2008 - 8th IEEE International Symposium on Cluster Computing and the Grid
Y2 - 19 May 2008 through 22 May 2008
ER -