Improving Fairness in a Large Scale HTC System Through Workload Analysis and Simulation

Frédéric Azevedo, Dalibor Klusáček, Frédéric Suter

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

Monitoring and analyzing the execution of a workload is at the core of the operation of data centers. It allows operators to verify that the operational objectives are satisfied or detect and react to any unexpected and unwanted behavior. However, the scale and complexity of large workloads composed of millions of jobs executed each month on several thousands of cores, often limit the depth of such an analysis. This may lead to overlook some phenomena that, while not harmful at a global scale, can be detrimental to a specific class of users. In this paper, we illustrate such a situation by analyzing a large High Throughput Computing (HTC) workload trace coming from one of the largest academic computing centers in France. The Fair-Share algorithm at the core of the batch scheduler ensures that all user groups are fairly provided with an amount of computing resources commensurate to their expressed needs. However, a deeper analysis of the produced schedule, especially of the job waiting times, shows a certain degree of unfairness between user groups. We identify the configuration of the quotas and scheduling queues as the main root causes of this unfairness. We thus propose a drastic reconfiguration of the system that aims at being more suited to the characteristics of the workload and at better balancing the waiting time among user groups. We evaluate the impact of this reconfiguration through detailed simulations. The obtained results show that it still satisfies the main operational objectives while significantly improving the quality of service experienced by formerly unfavored users.

Original languageEnglish
Title of host publicationEuro-Par 2019
Subtitle of host publicationParallel Processing - 25th International Conference on Parallel and Distributed Computing, Proceedings
EditorsRamin Yahyapour
PublisherSpringer
Pages129-141
Number of pages13
ISBN (Print)9783030293994
DOIs
StatePublished - 2019
Externally publishedYes
Event25th International European Conference on Parallel and Distributed Computing, Euro-Par 2019 - Göttingen, Germany
Duration: Aug 26 2019Aug 30 2019

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11725 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference25th International European Conference on Parallel and Distributed Computing, Euro-Par 2019
Country/TerritoryGermany
CityGöttingen
Period08/26/1908/30/19

Funding

Acknowledgements. We kindly acknowledge the support provided by Meta-Centrum under the program LM2015042 and the project Reg. No. CZ.02.1.01/ 0.0/0.0/16 013/0001797 co-funded by the Ministry of Education, Youth and Sports of the Czech Republic. We also thank L. Gombert, N. Lajili, and O. Aidel for their kind help.

Fingerprint

Dive into the research topics of 'Improving Fairness in a Large Scale HTC System Through Workload Analysis and Simulation'. Together they form a unique fingerprint.

Cite this