Abstract
Emerging AI-enhanced and near real-time scientific workloads are challenging the traditional assumptions of HPC job scheduling systems. In this work, we present an LLM-enabled, portable workflow for analyzing a subset of Slurm job trace data from leadership-class supercomputers, exemplified through over 1.5 million jobs and 18 million job-steps on OLCF's Frontier system. Our hybrid workflow integrates a static data analysis pipeline with dynamic, AI-powered components to generate interactive dashboards and automated, interpretable insights into scheduling behavior, efficiency, and system usage patterns. We demonstrate the workflow's portability across HPC systems and show how AI-driven interpretations augment traditional visualization to uncover inefficiencies, guide policy evolution, and support more responsive scheduling strategies, enabling consistent analytics across HPC system architectures. This approach enables computer science researchers and HPC sysadmins to systematically evaluate workload characteristics and adapt HPC resource management to meet the evolving demands of modern scientific discovery.
| Original language | English |
|---|---|
| Title of host publication | 54th International Conference on Parallel Processing, ICPP 2025 - Workshops Proceedings |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 151-158 |
| Number of pages | 8 |
| ISBN (Electronic) | 9798400721090 |
| DOIs | |
| State | Published - Dec 20 2025 |
| Event | 54th International Conference on Parallel Processing Workshop, ICPP 2025 - San Diego, United States Duration: Sep 8 2025 → Sep 11 2025 |
Publication series
| Name | 54th International Conference on Parallel Processing, ICPP 2025 - Workshops Proceedings |
|---|
Conference
| Conference | 54th International Conference on Parallel Processing Workshop, ICPP 2025 |
|---|---|
| Country/Territory | United States |
| City | San Diego |
| Period | 09/8/25 → 09/11/25 |
Funding
We thank Brian Etz for early discussions about backfill scheduling analyses and Katie Knight for helping explore python plotting libraries for this work. We thank Fred Suter for early review of the paper. This research used resources of the OLCF at ORNL, which is supported by DOE's Office of Science under Contract No. DEAC05-00OR22725.
Keywords
- AI-enabled workflows
- HPC scheduling
- Slurm job analytics
- Workload characterization