Abstract
Near the full scale of exascale supercomputers, latency can dominate the cost of all-to-all communication even for very large message sizes. We describe GPU-aware all-to-all implementations designed to reduce latency for large message sizes at extreme scales, and we present their performance using 65536 tasks (8192 nodes) on the Frontier supercomputer at the Oak Ridge Leadership Computing Facility. Two implementations perform best for different ranges of message size, and all outperform the vendor-provided MPI_Alltoall. Our results show promising options for improving implementations of MPI_Alltoall_init.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of 2025 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC 2025 Workshops |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 461-467 |
| Number of pages | 7 |
| ISBN (Electronic) | 9798400718717 |
| DOIs | |
| State | Published - Nov 15 2025 |
| Event | 2025 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC 2025 Workshops - St. Louis, United States Duration: Nov 16 2025 → Nov 21 2025 |
Publication series
| Name | Proceedings of 2025 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC 2025 Workshops |
|---|
Conference
| Conference | 2025 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC 2025 Workshops |
|---|---|
| Country/Territory | United States |
| City | St. Louis |
| Period | 11/16/25 → 11/21/25 |
Funding
We thank the anonymous reviewers for suggestions regarding related work and performance results. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the US Department of Energy under Contract No. DE-AC05-00OR22725.
Keywords
- All-to-all communication
- GPU-aware MPI