An LLM-enabled Workflow for Understanding and Evolving HPC Scheduling Practices

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Emerging AI-enhanced and near real-time scientific workloads are challenging the traditional assumptions of HPC job scheduling systems. In this work, we present an LLM-enabled, portable workflow for analyzing a subset of Slurm job trace data from leadership-class supercomputers, exemplified through over 1.5 million jobs and 18 million job-steps on OLCF's Frontier system. Our hybrid workflow integrates a static data analysis pipeline with dynamic, AI-powered components to generate interactive dashboards and automated, interpretable insights into scheduling behavior, efficiency, and system usage patterns. We demonstrate the workflow's portability across HPC systems and show how AI-driven interpretations augment traditional visualization to uncover inefficiencies, guide policy evolution, and support more responsive scheduling strategies, enabling consistent analytics across HPC system architectures. This approach enables computer science researchers and HPC sysadmins to systematically evaluate workload characteristics and adapt HPC resource management to meet the evolving demands of modern scientific discovery.

Original languageEnglish
Title of host publication54th International Conference on Parallel Processing, ICPP 2025 - Workshops Proceedings
PublisherAssociation for Computing Machinery, Inc
Pages151-158
Number of pages8
ISBN (Electronic)9798400721090
DOIs
StatePublished - Dec 20 2025
Event54th International Conference on Parallel Processing Workshop, ICPP 2025 - San Diego, United States
Duration: Sep 8 2025Sep 11 2025

Publication series

Name54th International Conference on Parallel Processing, ICPP 2025 - Workshops Proceedings

Conference

Conference54th International Conference on Parallel Processing Workshop, ICPP 2025
Country/TerritoryUnited States
CitySan Diego
Period09/8/2509/11/25

Funding

We thank Brian Etz for early discussions about backfill scheduling analyses and Katie Knight for helping explore python plotting libraries for this work. We thank Fred Suter for early review of the paper. This research used resources of the OLCF at ORNL, which is supported by DOE's Office of Science under Contract No. DEAC05-00OR22725.

Keywords

  • AI-enabled workflows
  • HPC scheduling
  • Slurm job analytics
  • Workload characterization

Fingerprint

Dive into the research topics of 'An LLM-enabled Workflow for Understanding and Evolving HPC Scheduling Practices'. Together they form a unique fingerprint.

Cite this