Abstract
Monitoring and analyzing the execution of a workload is at the core of the operation of data centers. It allows operators to verify that the operational objectives are satisfied or detect and react to any unexpected and unwanted behavior. However, the scale and complexity of large workloads composed of millions of jobs executed each month on several thousands of cores, often limit the depth of such an analysis. This may lead to overlook some phenomena that, while not harmful at a global scale, can be detrimental to a specific class of users. In this paper, we illustrate such a situation by analyzing a large High Throughput Computing (HTC) workload trace coming from one of the largest academic computing centers in France. The Fair-Share algorithm at the core of the batch scheduler ensures that all user groups are fairly provided with an amount of computing resources commensurate to their expressed needs. However, a deeper analysis of the produced schedule, especially of the job waiting times, shows a certain degree of unfairness between user groups. We identify the configuration of the quotas and scheduling queues as the main root causes of this unfairness. We thus propose a drastic reconfiguration of the system that aims at being more suited to the characteristics of the workload and at better balancing the waiting time among user groups. We evaluate the impact of this reconfiguration through detailed simulations. The obtained results show that it still satisfies the main operational objectives while significantly improving the quality of service experienced by formerly unfavored users.
Original language | English |
---|---|
Title of host publication | Euro-Par 2019 |
Subtitle of host publication | Parallel Processing - 25th International Conference on Parallel and Distributed Computing, Proceedings |
Editors | Ramin Yahyapour |
Publisher | Springer |
Pages | 129-141 |
Number of pages | 13 |
ISBN (Print) | 9783030293994 |
DOIs | |
State | Published - 2019 |
Externally published | Yes |
Event | 25th International European Conference on Parallel and Distributed Computing, Euro-Par 2019 - Göttingen, Germany Duration: Aug 26 2019 → Aug 30 2019 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 11725 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 25th International European Conference on Parallel and Distributed Computing, Euro-Par 2019 |
---|---|
Country/Territory | Germany |
City | Göttingen |
Period | 08/26/19 → 08/30/19 |
Funding
Acknowledgements. We kindly acknowledge the support provided by Meta-Centrum under the program LM2015042 and the project Reg. No. CZ.02.1.01/ 0.0/0.0/16 013/0001797 co-funded by the Ministry of Education, Youth and Sports of the Czech Republic. We also thank L. Gombert, N. Lajili, and O. Aidel for their kind help.