TY - GEN
T1 - GPU age-aware scheduling to improve the reliability of leadership jobs on titan
AU - Zimmer, Christopher
AU - Maxwell, Don
AU - McNally, Stephen
AU - Atchley, Scott
AU - Vazhkudai, Sudharshan S.
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/7/2
Y1 - 2018/7/2
N2 - In 2015, OLCF's Titan supercomputer experienced a significant increase in GPU related job failures. The impact on jobs was serious and OLCF decided to replace ∼50% of the GPUs. Unfortunately, jobs using more than 20% of the machine (i.e., leadership jobs) continued to encounter higher levels of application failures. These jobs contained significant amounts of both the low-failure rate and high-failure rate GPUs. The impacts of these failures are more adversely felt by leadership jobs due to longer wait times, runtimes, and higher charge rates. In this work, we have designed techniques to increase the use of low-failure GPUs in leadership jobs through targeted resource allocation. We have employed two complementary techniques, updating both the system ordering and the allocation mechanisms. Using simulation, the application of these techniques resulted in a 33% increase in low-failure GPU hours being assigned to leadership jobs. Our GPU Age-Aware Scheduling has been used in production on Titan since July of 2017.
AB - In 2015, OLCF's Titan supercomputer experienced a significant increase in GPU related job failures. The impact on jobs was serious and OLCF decided to replace ∼50% of the GPUs. Unfortunately, jobs using more than 20% of the machine (i.e., leadership jobs) continued to encounter higher levels of application failures. These jobs contained significant amounts of both the low-failure rate and high-failure rate GPUs. The impacts of these failures are more adversely felt by leadership jobs due to longer wait times, runtimes, and higher charge rates. In this work, we have designed techniques to increase the use of low-failure GPUs in leadership jobs through targeted resource allocation. We have employed two complementary techniques, updating both the system ordering and the allocation mechanisms. Using simulation, the application of these techniques resulted in a 33% increase in low-failure GPU hours being assigned to leadership jobs. Our GPU Age-Aware Scheduling has been used in production on Titan since July of 2017.
UR - http://www.scopus.com/inward/record.url?scp=85064128319&partnerID=8YFLogxK
U2 - 10.1109/SC.2018.00010
DO - 10.1109/SC.2018.00010
M3 - Conference contribution
AN - SCOPUS:85064128319
T3 - Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
SP - 83
EP - 93
BT - Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
Y2 - 11 November 2018 through 16 November 2018
ER -