Learning-Based Approaches to Estimate Job Wait Time in HTC Datacenters

Luc Gombert, Frédéric Suter

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

High Throughput Computing datacenters are a cornerstone of scientific discoveries in the fields of High Energy Physics and Astroparticles Physics. These datacenters provide thousands of users from dozens of scientific collaborations with tens of thousands computing cores and Petabytes of storage. The scheduling algorithm used in such datacenters to handle the millions of (mostly single-core) jobs submitted every month ensures a fair sharing of the computing resources among user groups, but may also cause unpredictably long job wait times for some users. The time a job will wait can be caused by many entangled factors and configuration parameters and is thus very hard to predict. Moreover, batch systems implementing a fair-share scheduling algorithm cannot provide users with any estimation of the job wait time at submission time. Therefore, we investigate in this paper how learning-based techniques applied to the logs of the batch scheduling system of a large HTC datacenter can be used to get an estimation of job wait time. First, we illustrate the need for users for such an estimation. Then, we identify some intuitive causes of this wait time from the information found in the batch system logs. We also formally analyze the correlation between job and system features and job wait time. Finally, we study several Machine Learning algorithms to implement learning-based estimators of both job wait time and job wait time ranges. Our experimental results show that a regression-based estimator can predict job wait time with a median absolute percentage error of about 54%, while a classifier that combines regression and classification assigns nearly 77% of the jobs in the right wait time range or in an immediately adjacent one.

Original languageEnglish
Title of host publicationJob Scheduling Strategies for Parallel Processing - 24th International Workshop, JSSPP 2021, Revised Selected Papers
EditorsDalibor Klusáček, Walfredo Cirne, Gonzalo P. Rodrigo
PublisherSpringer Science and Business Media Deutschland GmbH
Pages101-125
Number of pages25
ISBN (Print)9783030882235
DOIs
StatePublished - 2021
Externally publishedYes
Event24th International Workshop on Job Scheduling Strategies for Parallel Processing, JSSPP 2021 - Virtual, Online
Duration: May 21 2021May 21 2021

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12985 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference24th International Workshop on Job Scheduling Strategies for Parallel Processing, JSSPP 2021
CityVirtual, Online
Period05/21/2105/21/21

Funding

The authors would like to thank Wataru Takase and his colleagues from the Japanese High Energy Accelerator Research Organization (KEK) for providing the initial motivation for this work.

FundersFunder number
High Energy Accelerator Research Organization

    Fingerprint

    Dive into the research topics of 'Learning-Based Approaches to Estimate Job Wait Time in HTC Datacenters'. Together they form a unique fingerprint.

    Cite this