TY - GEN
T1 - CommBench
T2 - 38th ACM International Conference on Supercomputing, ICS 2024
AU - Hidayetoglu, Mert
AU - De Gonzalo, Simon Garcia
AU - Slaughter, Elliott
AU - Li, Yu
AU - Zimmer, Christopher
AU - Bicer, Tekin
AU - Ren, Bin
AU - Gropp, William
AU - Hwu, Wen Mei
AU - Aiken, Alex
N1 - Publisher Copyright:
© 2024 Owner/Author.
PY - 2024/5/30
Y1 - 2024/5/30
N2 - Modern high-performance computing systems have multiple GPUs and network interface cards (NICs) per node. The resulting network architectures have multilevel hierarchies of subnetworks with different interconnect and software technologies. These systems offer multiple vendor-provided communication capabilities and library implementations (IPC, MPI, NCCL, RCCL, OneCCL) with APIs providing varying levels of performance across the different levels. Understanding this performance is currently difficult because of the wide range of architectures and programming models (CUDA, HIP, OneAPI). We present CommBench, a library with cross-system portability and a high-level API that enables developers to easily build microbenchmarks relevant to their use cases and gain insight into the performance (bandwidth & latency) of multiple implementation libraries on different networks. We demonstrate CommBench with three sets of microbenchmarks that profile the performance of six systems. Our experimental results reveal the effect of multiple NICs on optimizing the bandwidth across nodes and also present the performance characteristics of four available communication libraries within and across nodes of NVIDIA, AMD, and Intel GPU networks.
AB - Modern high-performance computing systems have multiple GPUs and network interface cards (NICs) per node. The resulting network architectures have multilevel hierarchies of subnetworks with different interconnect and software technologies. These systems offer multiple vendor-provided communication capabilities and library implementations (IPC, MPI, NCCL, RCCL, OneCCL) with APIs providing varying levels of performance across the different levels. Understanding this performance is currently difficult because of the wide range of architectures and programming models (CUDA, HIP, OneAPI). We present CommBench, a library with cross-system portability and a high-level API that enables developers to easily build microbenchmarks relevant to their use cases and gain insight into the performance (bandwidth & latency) of multiple implementation libraries on different networks. We demonstrate CommBench with three sets of microbenchmarks that profile the performance of six systems. Our experimental results reveal the effect of multiple NICs on optimizing the bandwidth across nodes and also present the performance characteristics of four available communication libraries within and across nodes of NVIDIA, AMD, and Intel GPU networks.
UR - http://www.scopus.com/inward/record.url?scp=85196271050&partnerID=8YFLogxK
U2 - 10.1145/3650200.3656591
DO - 10.1145/3650200.3656591
M3 - Conference contribution
AN - SCOPUS:85196271050
T3 - Proceedings of the International Conference on Supercomputing
SP - 426
EP - 436
BT - ICS 2024 - Proceedings of the 38th ACM International Conference on Supercomputing
PB - Association for Computing Machinery
Y2 - 4 June 2024 through 7 June 2024
ER -