TY - GEN
T1 - Real-Time Assessment of Supercomputer Status by a Comprehensive Informative Metric through Streaming Processing
AU - Hui, Yawei
AU - Ashraf, Rizwan A.
AU - Park, Byung H.
AU - Engelmann, Christian
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/7/2
Y1 - 2018/7/2
N2 - Supercomputers are complex systems used to simulate, understand and solve real-world problems. In order to operate these systems efficiently and for the purpose of their maintainability, an accurate, concise, and timely determination of system status is crucial for its users and operators. However, this determination is challenging due to intricately connected heterogeneous software and hardware components, and due to sheer scale of such machines. In this poster, we demonstrate work-in-progress towards realization of a real-time monitoring framework for the 18,688-node Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF). Toward this end, we discuss the use of metrics which present a one-dimensional view of the system generating various types of information from 1000s of components and utilization statistics from 100s of user applications in near real-time. We demonstrate the efficacy of these metrics to understand and visualize raw log data generated by the system which otherwise may compose of 1000s of dimensions. We also demonstrate the architecture of proposed real-time stream processing framework which integrates, processes, analyzes, visualizes and stores system log data from an array of system components.
AB - Supercomputers are complex systems used to simulate, understand and solve real-world problems. In order to operate these systems efficiently and for the purpose of their maintainability, an accurate, concise, and timely determination of system status is crucial for its users and operators. However, this determination is challenging due to intricately connected heterogeneous software and hardware components, and due to sheer scale of such machines. In this poster, we demonstrate work-in-progress towards realization of a real-time monitoring framework for the 18,688-node Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF). Toward this end, we discuss the use of metrics which present a one-dimensional view of the system generating various types of information from 1000s of components and utilization statistics from 100s of user applications in near real-time. We demonstrate the efficacy of these metrics to understand and visualize raw log data generated by the system which otherwise may compose of 1000s of dimensions. We also demonstrate the architecture of proposed real-time stream processing framework which integrates, processes, analyzes, visualizes and stores system log data from an array of system components.
KW - Availability
KW - Quantification Metrics
KW - Reliability
KW - Serviceability
KW - System Monitoring
UR - http://www.scopus.com/inward/record.url?scp=85062608809&partnerID=8YFLogxK
U2 - 10.1109/BigData.2018.8621862
DO - 10.1109/BigData.2018.8621862
M3 - Conference contribution
AN - SCOPUS:85062608809
T3 - Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018
SP - 5339
EP - 5341
BT - Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018
A2 - Abe, Naoki
A2 - Liu, Huan
A2 - Pu, Calton
A2 - Hu, Xiaohua
A2 - Ahmed, Nesreen
A2 - Qiao, Mu
A2 - Song, Yang
A2 - Kossmann, Donald
A2 - Liu, Bing
A2 - Lee, Kisung
A2 - Tang, Jiliang
A2 - He, Jingrui
A2 - Saltz, Jeffrey
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2018 IEEE International Conference on Big Data, Big Data 2018
Y2 - 10 December 2018 through 13 December 2018
ER -