TY - GEN
T1 - A Big Data Analytics Framework for HPC Log Data
T2 - 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018
AU - Park, Byung H.
AU - Hui, Yawei
AU - Boehm, Swen
AU - Ashraf, Rizwan A.
AU - Layton, Christopher
AU - Engelmann, Christian
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/10/29
Y1 - 2018/10/29
N2 - Reliability, availability and serviceability (RAS) logs of high performance computing (HPC) resources, when closely investigated in spatial and temporal dimensions, can provide invaluable information regarding system status, performance, and resource utilization. These data are often generated from multiple logging systems and sensors that cover many components of the system. The analysis of these data for finding persistent temporal and spatial insights faces two main difficulties: the volume of RAS logs makes manual inspection difficult and the unstructured nature and unique properties of log data produced by each subsystem adds another dimension of difficulty in identifying implicit correlation among recorded events. To address these issues, we recently developed a multi-user Big Data analytics framework for HPC log data at Oak Ridge National Laboratory (ORNL). This paper introduces three in-progress data analytics projects that leverage this framework to assess system status, mine event patterns, and study correlations between user applications and system events. We describe the motivation of each project and detail their workflows using three years of log data collected from ORNL's Titan supercomputer.
AB - Reliability, availability and serviceability (RAS) logs of high performance computing (HPC) resources, when closely investigated in spatial and temporal dimensions, can provide invaluable information regarding system status, performance, and resource utilization. These data are often generated from multiple logging systems and sensors that cover many components of the system. The analysis of these data for finding persistent temporal and spatial insights faces two main difficulties: the volume of RAS logs makes manual inspection difficult and the unstructured nature and unique properties of log data produced by each subsystem adds another dimension of difficulty in identifying implicit correlation among recorded events. To address these issues, we recently developed a multi-user Big Data analytics framework for HPC log data at Oak Ridge National Laboratory (ORNL). This paper introduces three in-progress data analytics projects that leverage this framework to assess system status, mine event patterns, and study correlations between user applications and system events. We describe the motivation of each project and detail their workflows using three years of log data collected from ORNL's Titan supercomputer.
KW - Big Data applications
KW - Data analysis
KW - Event log analysis
KW - High performance computing
UR - http://www.scopus.com/inward/record.url?scp=85057269230&partnerID=8YFLogxK
U2 - 10.1109/CLUSTER.2018.00073
DO - 10.1109/CLUSTER.2018.00073
M3 - Conference contribution
AN - SCOPUS:85057269230
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 571
EP - 579
BT - Proceedings - 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 10 September 2018 through 13 September 2018
ER -