Abstract
High Throughput Computing datacenters are a cornerstone of scientific discoveries in the fields of High Energy Physics and Astroparticles Physics. These datacenters provide thousands of users from dozens of scientific collaborations with tens of thousands computing cores and Petabytes of storage. The scheduling algorithm used in such datacenters to handle the millions of (mostly single-core) jobs submitted every month ensures a fair sharing of the computing resources among user groups, but may also cause unpredictably long job wait times for some users. The time a job will wait can be caused by many entangled factors and configuration parameters and is thus very hard to predict. Moreover, batch systems implementing a fair-share scheduling algorithm cannot provide users with any estimation of the job wait time at submission time. Therefore, we investigate in this paper how learning-based techniques applied to the logs of the batch scheduling system of a large HTC datacenter can be used to get an estimation of job wait time. First, we illustrate the need for users for such an estimation. Then, we identify some intuitive causes of this wait time from the information found in the batch system logs. We also formally analyze the correlation between job and system features and job wait time. Finally, we study several Machine Learning algorithms to implement learning-based estimators of both job wait time and job wait time ranges. Our experimental results show that a regression-based estimator can predict job wait time with a median absolute percentage error of about 54%, while a classifier that combines regression and classification assigns nearly 77% of the jobs in the right wait time range or in an immediately adjacent one.
| Original language | English |
|---|---|
| Title of host publication | Job Scheduling Strategies for Parallel Processing - 24th International Workshop, JSSPP 2021, Revised Selected Papers |
| Editors | Dalibor Klusáček, Walfredo Cirne, Gonzalo P. Rodrigo |
| Publisher | Springer Science and Business Media Deutschland GmbH |
| Pages | 101-125 |
| Number of pages | 25 |
| ISBN (Print) | 9783030882235 |
| DOIs | |
| State | Published - 2021 |
| Externally published | Yes |
| Event | 24th International Workshop on Job Scheduling Strategies for Parallel Processing, JSSPP 2021 - Virtual, Online Duration: May 21 2021 → May 21 2021 |
Publication series
| Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
|---|---|
| Volume | 12985 LNCS |
| ISSN (Print) | 0302-9743 |
| ISSN (Electronic) | 1611-3349 |
Conference
| Conference | 24th International Workshop on Job Scheduling Strategies for Parallel Processing, JSSPP 2021 |
|---|---|
| City | Virtual, Online |
| Period | 05/21/21 → 05/21/21 |
Funding
The authors would like to thank Wataru Takase and his colleagues from the Japanese High Energy Accelerator Research Organization (KEK) for providing the initial motivation for this work.