Abstract
Supercomputers are complex systems used to simulate, understand and solve real-world problems. In order to operate these systems efficiently and for the purpose of their maintainability, an accurate, concise, and timely determination of system status is crucial for its users and operators. However, this determination is challenging due to intricately connected heterogeneous software and hardware components, and due to sheer scale of such machines. In this poster, we demonstrate work-in-progress towards realization of a real-time monitoring framework for the 18,688-node Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF). Toward this end, we discuss the use of metrics which present a one-dimensional view of the system generating various types of information from 1000s of components and utilization statistics from 100s of user applications in near real-time. We demonstrate the efficacy of these metrics to understand and visualize raw log data generated by the system which otherwise may compose of 1000s of dimensions. We also demonstrate the architecture of proposed real-time stream processing framework which integrates, processes, analyzes, visualizes and stores system log data from an array of system components.
Original language | English |
---|---|
Title of host publication | Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018 |
Editors | Naoki Abe, Huan Liu, Calton Pu, Xiaohua Hu, Nesreen Ahmed, Mu Qiao, Yang Song, Donald Kossmann, Bing Liu, Kisung Lee, Jiliang Tang, Jingrui He, Jeffrey Saltz |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 5339-5341 |
Number of pages | 3 |
ISBN (Electronic) | 9781538650356 |
DOIs | |
State | Published - Jul 2 2018 |
Event | 2018 IEEE International Conference on Big Data, Big Data 2018 - Seattle, United States Duration: Dec 10 2018 → Dec 13 2018 |
Publication series
Name | Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018 |
---|
Conference
Conference | 2018 IEEE International Conference on Big Data, Big Data 2018 |
---|---|
Country/Territory | United States |
City | Seattle |
Period | 12/10/18 → 12/13/18 |
Funding
This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Resilience for Extreme Scale Supercomputing Systems Program, with program manager Lucy Nowell, under contract number DE-AC05-00OR22725.
Keywords
- Availability
- Quantification Metrics
- Reliability
- Serviceability
- System Monitoring