GPU age-aware scheduling to improve the reliability of leadership jobs on titan

Christopher Zimmer, Don Maxwell, Stephen McNally, Scott Atchley, Sudharshan S. Vazhkudai

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

11 Scopus citations

Abstract

In 2015, OLCF's Titan supercomputer experienced a significant increase in GPU related job failures. The impact on jobs was serious and OLCF decided to replace ∼50% of the GPUs. Unfortunately, jobs using more than 20% of the machine (i.e., leadership jobs) continued to encounter higher levels of application failures. These jobs contained significant amounts of both the low-failure rate and high-failure rate GPUs. The impacts of these failures are more adversely felt by leadership jobs due to longer wait times, runtimes, and higher charge rates. In this work, we have designed techniques to increase the use of low-failure GPUs in leadership jobs through targeted resource allocation. We have employed two complementary techniques, updating both the system ordering and the allocation mechanisms. Using simulation, the application of these techniques resulted in a 33% increase in low-failure GPU hours being assigned to leadership jobs. Our GPU Age-Aware Scheduling has been used in production on Titan since July of 2017.

Original languageEnglish
Title of host publicationProceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages83-93
Number of pages11
ISBN (Electronic)9781538683842
DOIs
StatePublished - Jul 2 2018
Event2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018 - Dallas, United States
Duration: Nov 11 2018Nov 16 2018

Publication series

NameProceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018

Conference

Conference2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
Country/TerritoryUnited States
CityDallas
Period11/11/1811/16/18

Funding

This work was supported by the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is managed by UT Battelle, LLC for the U.S. DOE (under the contract No. DE-AC05-00OR22725).

FundersFunder number
Oak Ridge National Laboratory

    Fingerprint

    Dive into the research topics of 'GPU age-aware scheduling to improve the reliability of leadership jobs on titan'. Together they form a unique fingerprint.

    Cite this