A comprehensive informative metric for analyzing HPC system status using the LogSCAN platform

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

Log processing by Spark and Cassandra-based ANalytics (LogSCAN) is a newly developed analytical platform that provides flexible and scalable data gathering, transformation and computation. One major challenge is to effectively summarize the status of a complex computer system, such as the Titan supercomputer at the Oak Ridge Leadership Computing Facility (OLCF). Although there is plenty of operational and maintenance information collected and stored in real time, which may yield insights about short- and long-term system status, it is difficult to present this information in a comprehensive form. In this work, we present system information entropy (SIE), a newly developed metric that leverages the powers of traditional machine learning techniques and information theory. By compressing the multivariant multi-dimensional event information recorded during the operation of the targeted system into a single time series of SIE, we demonstrate that the historical system status can be sensitively represented concisely and comprehensively. Given a sharp indicator as SIE, we argue that follow-up analytics based on SIE will reveal in-depth knowledge about system status using other sophisticated approaches, such as pattern recognition in the temporal domain or causality analysis incorporating extra independent metrics of the system.

Original languageEnglish
Title of host publicationProceedings of FTXS 2018
Subtitle of host publication8th Workshop on Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC18: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages39-48
Number of pages10
ISBN (Electronic)9781728102221
DOIs
StatePublished - Dec 5 2018
Event8th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS 2018 - Dallas, United States
Duration: Nov 11 2018Nov 16 2018

Publication series

NameProceedings of FTXS 2018: 8th Workshop on Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC18: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference8th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS 2018
Country/TerritoryUnited States
CityDallas
Period11/11/1811/16/18

Funding

This manuscript has been authored by UT-Battelle,LLC under Contract No. DE-AC05-00OR22725with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

Keywords

  • Cloud-computing
  • Metrics
  • Visual-analytics

Fingerprint

Dive into the research topics of 'A comprehensive informative metric for analyzing HPC system status using the LogSCAN platform'. Together they form a unique fingerprint.

Cite this