TY - GEN
T1 - Reducing the human-in-the-loop component of the scheduling of large HTC workloads
AU - Azevedo, Frédéric
AU - Gombert, Luc
AU - Suter, Frédéric
N1 - Publisher Copyright:
© Springer Nature Switzerland AG 2019.
PY - 2019
Y1 - 2019
N2 - A common characteristic to major physics experiments is an ever increasing need of computing resources to process experimental data and generate simulated data. The IN2P3 Computing Center provides its 2,500 users with about 35,000 cores and processes millions of jobs every month. This workload is composed of a vast majority of sequential jobs that corresponds to Monte-Carlo simulations and related analysis made on data produced on the Large Hadron Collider at CERN. To schedule such a workload under specific constraints, the CC-IN2P3 relied for 20 years on an in-house job and resource management system complemented by an operation team who can directly act on the decisions made by the job scheduler and modify them. This system has been replaced in 2011 but legacy rules of thumb remained. Combined to other rules motivated by production constraints, they may act against the job scheduler optimizations and force the operators to apply more corrective actions than they should. In this experience report from a production system, we describe the decisions made since the end of 2016 to either transfer some of the actions done by operators to the job scheduler or make these actions become unnecessary. The physical partitioning of resources in distinct pools has been replaced by a logical partitioning that leverages scheduling queues. Then some historical constraints, such as quotas, have been relaxed. For instance, the number of concurrent jobs from a given user group allowed to access a specific resource, e.g., a storage subsystem, has been progressively increased. Finally, the computation of the fair-share by the job scheduler has been modified to be less detrimental to small groups whose jobs have a low priority. The preliminary but promising results coming from these modifications constitute the beginning of a long-term activity to change the operation procedures applied to the computing infrastructure of the IN2P3 Computing Center.
AB - A common characteristic to major physics experiments is an ever increasing need of computing resources to process experimental data and generate simulated data. The IN2P3 Computing Center provides its 2,500 users with about 35,000 cores and processes millions of jobs every month. This workload is composed of a vast majority of sequential jobs that corresponds to Monte-Carlo simulations and related analysis made on data produced on the Large Hadron Collider at CERN. To schedule such a workload under specific constraints, the CC-IN2P3 relied for 20 years on an in-house job and resource management system complemented by an operation team who can directly act on the decisions made by the job scheduler and modify them. This system has been replaced in 2011 but legacy rules of thumb remained. Combined to other rules motivated by production constraints, they may act against the job scheduler optimizations and force the operators to apply more corrective actions than they should. In this experience report from a production system, we describe the decisions made since the end of 2016 to either transfer some of the actions done by operators to the job scheduler or make these actions become unnecessary. The physical partitioning of resources in distinct pools has been replaced by a logical partitioning that leverages scheduling queues. Then some historical constraints, such as quotas, have been relaxed. For instance, the number of concurrent jobs from a given user group allowed to access a specific resource, e.g., a storage subsystem, has been progressively increased. Finally, the computation of the fair-share by the job scheduler has been modified to be less detrimental to small groups whose jobs have a low priority. The preliminary but promising results coming from these modifications constitute the beginning of a long-term activity to change the operation procedures applied to the computing infrastructure of the IN2P3 Computing Center.
UR - https://www.scopus.com/pages/publications/85060680046
U2 - 10.1007/978-3-030-10632-4_3
DO - 10.1007/978-3-030-10632-4_3
M3 - Conference contribution
AN - SCOPUS:85060680046
SN - 9783030106317
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 39
EP - 60
BT - Job Scheduling Strategies for Parallel Processing - 22nd International Workshop, JSSPP 2018, Revised Selected Papers
A2 - Klusáček, Dalibor
A2 - Cirne, Walfredo
A2 - Desai, Narayan
PB - Springer Verlag
T2 - 22nd International Workshop on Job Scheduling Strategies for Parallel Processing, JSSPP 2018
Y2 - 25 May 2018 through 25 May 2018
ER -