Learning from Five-year Resource-Utilization Data of Titan System

Feiyi Wang, Sarp Oral, Satyabrata Sen, Neena Imam

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

11 Scopus citations

Abstract

Titan was the flagship supercomputer at the Oak Ridge Leadership Computing Facility (OLCF). It was deployed in late 2012, became the fastest supercomputer in the world and was retired on August 2, 2019. With Titan's mission complete, this paper provides a first-order examination of the usage of its critical resources (CPU, Memory, GPU, and I/O) over a five-year production period (2015-2019). In particular, we show quantitatively that the majority of CPU time was spent on the large-scale jobs, which is consistent with the policy of driving ground-breaking science through leadership computing. We also corroborate the general observation of the low CPU-memory usage with 95% jobs utilizing only 15% or less available memory. Additionally, we correlate the increase of total job submissions and the decrease of GPU-enabled jobs during 2016 with the GPU reliability issue which impacted the large-scale runs. We further show the surprising read/write ratio over the five-year period, which contradicts the general mindset of the large-scale simulation machines being 'write-heavy'. This understanding will have potential impact on how we design our next-generation large-scale storage systems. We believe that our analyses and findings are going to be of great interest to the high-performance computing (HPC) community at large.

Original languageEnglish
Title of host publicationProceedings - 2019 IEEE International Conference on Cluster Computing, CLUSTER 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728147345
DOIs
StatePublished - Sep 2019
Event2019 IEEE International Conference on Cluster Computing, CLUSTER 2019 - Albuquerque, United States
Duration: Sep 23 2019Sep 26 2019

Publication series

NameProceedings - IEEE International Conference on Cluster Computing, ICCC
Volume2019-September
ISSN (Print)1552-5244

Conference

Conference2019 IEEE International Conference on Cluster Computing, CLUSTER 2019
Country/TerritoryUnited States
CityAlbuquerque
Period09/23/1909/26/19

Funding

ACKNOWLEDGMENTS We thank the anonymous reviewers whose comments have greatly improved this manuscript. We also thank Scott Atchley and Ross Miller for their comments and corrections. This research used resources of the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at the Oak Ridge National Laboratory, which is supported by the Office of Science of the Department of Energy under Contract DE-AC05-00OR22725. This work was supported by the United States Department of Defense (DoD) and used resources of the Computational Research and Development Programs at Oak Ridge National Laboratory.

Fingerprint

Dive into the research topics of 'Learning from Five-year Resource-Utilization Data of Titan System'. Together they form a unique fingerprint.

Cite this