Real-Time Assessment of Supercomputer Status by a Comprehensive Informative Metric through Streaming Processing

Yawei Hui, Rizwan A. Ashraf, Byung H. Park, Christian Engelmann

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Supercomputers are complex systems used to simulate, understand and solve real-world problems. In order to operate these systems efficiently and for the purpose of their maintainability, an accurate, concise, and timely determination of system status is crucial for its users and operators. However, this determination is challenging due to intricately connected heterogeneous software and hardware components, and due to sheer scale of such machines. In this poster, we demonstrate work-in-progress towards realization of a real-time monitoring framework for the 18,688-node Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF). Toward this end, we discuss the use of metrics which present a one-dimensional view of the system generating various types of information from 1000s of components and utilization statistics from 100s of user applications in near real-time. We demonstrate the efficacy of these metrics to understand and visualize raw log data generated by the system which otherwise may compose of 1000s of dimensions. We also demonstrate the architecture of proposed real-time stream processing framework which integrates, processes, analyzes, visualizes and stores system log data from an array of system components.

Original languageEnglish
Title of host publicationProceedings - 2018 IEEE International Conference on Big Data, Big Data 2018
EditorsNaoki Abe, Huan Liu, Calton Pu, Xiaohua Hu, Nesreen Ahmed, Mu Qiao, Yang Song, Donald Kossmann, Bing Liu, Kisung Lee, Jiliang Tang, Jingrui He, Jeffrey Saltz
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages5339-5341
Number of pages3
ISBN (Electronic)9781538650356
DOIs
StatePublished - Jul 2 2018
Event2018 IEEE International Conference on Big Data, Big Data 2018 - Seattle, United States
Duration: Dec 10 2018Dec 13 2018

Publication series

NameProceedings - 2018 IEEE International Conference on Big Data, Big Data 2018

Conference

Conference2018 IEEE International Conference on Big Data, Big Data 2018
Country/TerritoryUnited States
CitySeattle
Period12/10/1812/13/18

Funding

This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Resilience for Extreme Scale Supercomputing Systems Program, with program manager Lucy Nowell, under contract number DE-AC05-00OR22725.

FundersFunder number
Compute and Data Environment for Science
U.S. Department of Energy
Office of Science
Advanced Scientific Computing ResearchDE-AC05-00OR22725

    Keywords

    • Availability
    • Quantification Metrics
    • Reliability
    • Serviceability
    • System Monitoring

    Fingerprint

    Dive into the research topics of 'Real-Time Assessment of Supercomputer Status by a Comprehensive Informative Metric through Streaming Processing'. Together they form a unique fingerprint.

    Cite this