TY - GEN
T1 - GUIDE
T2 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017
AU - Vazhkudai, Sudharshan S.
AU - Miller, Ross
AU - Tiwari, Devesh
AU - Zimmer, Christopher
AU - Wang, Feiyi
AU - Oral, Sarp
AU - Gunasekaran, Raghul
AU - Steinert, Deryl
N1 - Publisher Copyright:
© 2017 Copyright held by the owner/author(s).
PY - 2017/11/12
Y1 - 2017/11/12
N2 - In this paper, we describe the GUIDE framework used to collect, federate, and analyze log data from the Oak Ridge Leadership Computing Facility (OLCF), and how we use that data to derive insights into facility operations. We collect system logs and extract monitoring data at every level of the various OLCF subsystems, and have developed a suite of pre-processing tools to make the raw data consumable. The cleansed logs are then ingested and federated into a central, scalable data warehouse, Splunk, that offers storage, indexing, querying, and visualization capabilities. We have further developed and deployed a set of tools to analyze these multiple disparate log streams in concert and derive operational insights. We describe our experience from developing and deploying the GUIDE infrastructure, and deriving valuable insights on the various subsystems, based on two years of operations in the production OLCF environment.
AB - In this paper, we describe the GUIDE framework used to collect, federate, and analyze log data from the Oak Ridge Leadership Computing Facility (OLCF), and how we use that data to derive insights into facility operations. We collect system logs and extract monitoring data at every level of the various OLCF subsystems, and have developed a suite of pre-processing tools to make the raw data consumable. The cleansed logs are then ingested and federated into a central, scalable data warehouse, Splunk, that offers storage, indexing, querying, and visualization capabilities. We have further developed and deployed a set of tools to analyze these multiple disparate log streams in concert and derive operational insights. We describe our experience from developing and deploying the GUIDE infrastructure, and deriving valuable insights on the various subsystems, based on two years of operations in the production OLCF environment.
UR - http://www.scopus.com/inward/record.url?scp=85040184299&partnerID=8YFLogxK
U2 - 10.1145/3126908.3126946
DO - 10.1145/3126908.3126946
M3 - Conference contribution
AN - SCOPUS:85040184299
T3 - Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017
BT - Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017
PB - Association for Computing Machinery, Inc
Y2 - 12 November 2017 through 17 November 2017
ER -