Skip to main navigation Skip to search Skip to main content

Sequence length scaling in vision transformers for scientific images on frontier

Research output: Contribution to journalArticlepeer-review

Abstract

Vision Transformers (ViTs) are pivotal for foundational models in scientific imagery, including Earth science applications, due to their capability to process large sequence lengths. While transformers for text have inspired scaling sequence lengths in ViTs, adapting these for ViTs introduces unique challenges. We develop distributed sequence parallelism for ViTs, enabling them to handle up to 1M tokens. Our approach, leveraging DeepSpeed-Ulysses and Long-Sequence-Segmentation with model sharding, is the first to apply sequence parallelism in ViT training, achieving a 94% batch scaling efficiency on 2,048 AMD-MI250X GPUs. Evaluating sequence parallelism in ViTs, particularly in models up to 10B parameters, highlighted substantial bottlenecks. We countered these with hybrid sequence, pipeline, and flash attention strategies, to scale beyond single GPU memory limits. Our method significantly enhances climate modeling accuracy by 20% in temperature predictions, marking the first training of a vision transformer model to convergence with a sequence length of 188K tokens, using full self-attention.

Original languageEnglish
Pages (from-to)273-290
Number of pages18
JournalInternational Journal of High Performance Computing Applications
Volume40
Issue number3
DOIs
StatePublished - May 1 2026

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study is supported by UT-Battelle; DE-AC05-00OR22725. This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (https://energy.gov/downloads/doe-public-access-plan). This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan ( https://energy.gov/downloads/doe-public-access-plan ). This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study is supported by UT-Battelle; DE-AC05-00OR22725.

Keywords

  • computer vision
  • computing methodologies
  • distributed deep learning
  • machine learning algorithms
  • parallel algorithms

Fingerprint

Dive into the research topics of 'Sequence length scaling in vision transformers for scientific images on frontier'. Together they form a unique fingerprint.

Cite this