TY - GEN
T1 - RingX
T2 - 2025 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2025
AU - Yin, Junqi
AU - Palash, Mijanur
AU - Shankar, Mallikarjun
AU - Wang, Feiyi
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/11/15
Y1 - 2025/11/15
N2 - The attention mechanism has become foundational for remarkable AI breakthroughs since the introduction of the Transformer, driving the demand for increasingly longer context to power frontier models such as large-scale reasoning language models and high-resolution image/video generators. However, its quadratic computational and memory complexities present substantial challenges. Current state-of-the-art parallel attention methods, such as ring attention, are widely adopted for long-context training but utilize a point-to-point communication strategy that fails to fully exploit the capabilities of modern HPC network architectures. In this work, we propose ringX, a scalable family of parallel attention methods optimized explicitly for HPC systems. By enhancing workload partitioning, refining communication patterns, and improving load balancing, ringX achieves up to 3.4× speedup compared to conventional ring attention on the Frontier supercomputer. Optimized for both bi-directional and causal attention mechanisms, ringX demonstrates its effectiveness through training benchmarks of a Vision Transformer (ViT) on a climate dataset and a Generative Pre-Trained Transformer (GPT) model, Llama3 8B. Our method attains an end-to-end training speedup of approximately 1.5× in both scenarios. To our knowledge, the achieved 38% model FLOPs utilization (MFU) for training Llama3 8B with a 1M-token sequence length on 4,096 GPUs represents one of the highest training efficiencies reported for long-context learning on HPC systems. Our code implementation is available at https://github.com/jqyin/ringX-attention.
AB - The attention mechanism has become foundational for remarkable AI breakthroughs since the introduction of the Transformer, driving the demand for increasingly longer context to power frontier models such as large-scale reasoning language models and high-resolution image/video generators. However, its quadratic computational and memory complexities present substantial challenges. Current state-of-the-art parallel attention methods, such as ring attention, are widely adopted for long-context training but utilize a point-to-point communication strategy that fails to fully exploit the capabilities of modern HPC network architectures. In this work, we propose ringX, a scalable family of parallel attention methods optimized explicitly for HPC systems. By enhancing workload partitioning, refining communication patterns, and improving load balancing, ringX achieves up to 3.4× speedup compared to conventional ring attention on the Frontier supercomputer. Optimized for both bi-directional and causal attention mechanisms, ringX demonstrates its effectiveness through training benchmarks of a Vision Transformer (ViT) on a climate dataset and a Generative Pre-Trained Transformer (GPT) model, Llama3 8B. Our method attains an end-to-end training speedup of approximately 1.5× in both scenarios. To our knowledge, the achieved 38% model FLOPs utilization (MFU) for training Llama3 8B with a 1M-token sequence length on 4,096 GPUs represents one of the highest training efficiencies reported for long-context learning on HPC systems. Our code implementation is available at https://github.com/jqyin/ringX-attention.
KW - HPC for AI
KW - Long-context learning
KW - Parallel attention
UR - https://www.scopus.com/pages/publications/105023983233
U2 - 10.1145/3712285.3759859
DO - 10.1145/3712285.3759859
M3 - Conference contribution
AN - SCOPUS:105023983233
T3 - Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2025
SP - 1395
EP - 1408
BT - Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2025
PB - Association for Computing Machinery, Inc
Y2 - 16 November 2025 through 21 November 2025
ER -