Analyzing the impact of system reliability events on applications in the titan supercomputer

Rizwan A. Ashraf, Christian Engelmann

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

Extreme-scale computing systems employ Reliability, Availability and Serviceability (RAS) mechanisms and infrastructure to log events from multiple system components. In this paper, we analyze RAS logs in conjunction with the application placement and scheduling database, in order to understand the impact of common RAS events on application performance. This study conducted on the records of about 2 million applications executed on Titan supercomputer provides important insights for system users, operators and computer science researchers. Specifically, we investigate the impact of RAS events on application performance and its variability by comparing cases where events are recorded with corresponding cases where no events are recorded. Such a statistical investigation is possible since we observed that system users tend to execute their applications multiple times. Our analysis reveals that most RAS events do impact application performance, although not always. We also find that different system components affect application performance differently. In particular, our investigation includes the following components: Parallel file system processor, memory, graphics processing units, system and user software issues. Our work establishes the importance of providing feedback to system users for increasing operational efficiency of extreme-scale systems.

Original languageEnglish
Title of host publicationProceedings of FTXS 2018
Subtitle of host publication8th Workshop on Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC18: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages49-58
Number of pages10
ISBN (Electronic)9781728102221
DOIs
StatePublished - Dec 5 2018
Event8th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS 2018 - Dallas, United States
Duration: Nov 11 2018Nov 16 2018

Publication series

NameProceedings of FTXS 2018: 8th Workshop on Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC18: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference8th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS 2018
Country/TerritoryUnited States
CityDallas
Period11/11/1811/16/18

Funding

This manuscript has been authored by UT-Battelle,LLC under Contract No. DE-AC05-00OR22725with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

Keywords

  • Application-log
  • Field-Study
  • HPC-Applications
  • Log-Data-Analytics
  • Performance
  • Reliability-Availability-Serviceability-(RAS)-log
  • Supercomputers

Fingerprint

Dive into the research topics of 'Analyzing the impact of system reliability events on applications in the titan supercomputer'. Together they form a unique fingerprint.

Cite this