Understanding Node Allocation on Leadership-Class Supercomputers with Graph Analytics

Andy Trinh, Shivam Sheth, Anil Gaihre, Caiwen Ding, Jieyang Chen, Feiyi Wang, David Pugmire, Scott Klasky, Hang Liu, Lipeng Wan

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

As the scale of modern high-performance computing (HPC) systems keeps growing, job scheduling on those systems becomes extremely challenging. Particularly, one of the important tasks job schedulers need to fulfill is to optimize the node allocation to improve the jobs' execution efficiency. In order to optimize the node allocation, the job scheduling strategy must take the network topology of the HPC system into consideration. However, existing approaches are either designed for the specific network typologies (lack of generality) or rely on the applications' communication patterns (unknown without running on HPC). In this paper, we propose a generic topology-aware node allocation strategy based on graph algorithms. Our strategy can reduce the intra-job communication overhead and the inter-job communication interference by selecting nodes that form a sub-graph with much smaller diameter. We also propose and study four different initialization strategies for our node allocation algorithm to understand how different initialization strategies affect the node allocation results and speed. We evaluate the proposed methods using 30 days of real job traces collected from the OLCF's Titan supercomputer. Compared to the native job scheduling strategy used on Titan, adopting our approach can achieve a 2.5 × diameter reduction on average, and for certain jobs the diameter reduction can be up to 8 ×.

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE International Conference on High Performance Computing and Communications, Data Science and Systems, Smart City and Dependability in Sensor, Cloud and Big Data Systems and Application, HPCC/DSS/SmartCity/DependSys 2023
EditorsJinjun Chen, Laurence T. Yang
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages780-787
Number of pages8
ISBN (Electronic)9798350330014
DOIs
StatePublished - 2023
Event25th IEEE International Conferences on High Performance Computing and Communications, 9th International Conference on Data Science and Systems, 21st IEEE International Conference on Smart City and 9th IEEE International Conference on Dependability in Sensor, Cloud and Big Data Systems and Applications, HPCC/DSS/SmartCity/DependSys 2023 - Melbourne, Australia
Duration: Dec 13 2023Dec 15 2023

Publication series

NameProceedings - 2023 IEEE International Conference on High Performance Computing and Communications, Data Science and Systems, Smart City and Dependability in Sensor, Cloud and Big Data Systems and Application, HPCC/DSS/SmartCity/DependSys 2023

Conference

Conference25th IEEE International Conferences on High Performance Computing and Communications, 9th International Conference on Data Science and Systems, 21st IEEE International Conference on Smart City and 9th IEEE International Conference on Dependability in Sensor, Cloud and Big Data Systems and Applications, HPCC/DSS/SmartCity/DependSys 2023
Country/TerritoryAustralia
CityMelbourne
Period12/13/2312/15/23

Keywords

  • Graph Analytics
  • HPC
  • job scheduling
  • topology-aware allocation

Fingerprint

Dive into the research topics of 'Understanding Node Allocation on Leadership-Class Supercomputers with Graph Analytics'. Together they form a unique fingerprint.

Cite this