Quantum/AI Topology-Aware Latency-Adaptive HPC Workflow Scheduling Optimization

Braulio Caraveo, Liwen Shih, In Saeng Suh, Travis Humble

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The growing demand for more powerful high-performance computing (HPC) systems has led to a steady rise in energy consumption by supercomputing worldwide. This study is focused on comparing our Application-Topology Mapper (ATMapper) to the popular Simple Linux Utility for Resource Management (SLURM) for the purpose of exploring methods that can further optimize job-scheduling within HPC systems. ATMapper is an Artificial-Intelligence based approach to job-scheduling that is currently being enhanced with quantum annealing (QA) to generate optimal schedules faster. We are applying QA to speedup our ATMapper process to achieve higher computing efficiency, thereby reducing HPC energy consumption. Here, we examine how four job-scheduling approaches perform in processor node assignment when using an example network architecture of 4 interconnected nodes. Using a specialized script, we are assessing the schedule of a computation flow with 11 interdependent tasks. The data movements among nodes were tracked to count for the number of interactions (network hops) between nodes needed to complete the tasks. The total number of hops and the job completion time were then used to quantify the efficiency of the different mapping approaches. In addition to SLURM, we also compare our ATMapper to the QA-enabled LBNL TIGER and the D-Wave Distributed Computing processor assignment approaches. The preliminary results showed that our topology-aware, latency-adaptive ATMapper is significantly more efficient when compared to the other scheduling approaches due to its load-imbalance network allocation. The scheduler displayed a computing efficiency of 53% by performing significantly fewer network hops than its alternatives. By reducing the number of hops, ATMapper was able to perform all 11 tasks by using only 3 nodes out of given 4. This research indicates the potential to use QA/AI for HPC job-scheduling. Later, we will test a SLURM simulator program to draw further comparisons on the effectiveness of ATMapper's scheduling approach. The results of this comparison will serve as a baseline for later improving SLURM's performance using a QA-enhanced ATMapper approach.

Original languageEnglish
Title of host publicationWorkshops Program, Posters Program, Panels Program and Tutorials Program
EditorsCandace Culhane, Greg T. Byrd, Hausi Muller, Yuri Alexeev, Yuri Alexeev, Sarah Sheldon
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages614-615
Number of pages2
ISBN (Electronic)9798331541378
DOIs
StatePublished - 2024
Event5th IEEE International Conference on Quantum Computing and Engineering, QCE 2024 - Montreal, Canada
Duration: Sep 15 2024Sep 20 2024

Publication series

NameProceedings - IEEE Quantum Week 2024, QCE 2024
Volume2

Conference

Conference5th IEEE International Conference on Quantum Computing and Engineering, QCE 2024
Country/TerritoryCanada
CityMontreal
Period09/15/2409/20/24

Funding

This research under DoE WDTS VFP support, used resources of the Oak Ridge Leadership Computing Facility at Oak Ridge National Lab, supported by USA DoE Office of Science under Contract No. DE-AC05-00OR22725.

Keywords

  • HPC resource allocation optimization
  • application-specific topology-aware HPC scheduling
  • latency-adaptive parallel SW/HW mapping
  • load imbalance
  • quantum annealing (QA)

Fingerprint

Dive into the research topics of 'Quantum/AI Topology-Aware Latency-Adaptive HPC Workflow Scheduling Optimization'. Together they form a unique fingerprint.

Cite this