Abstract
In this paper, we address the challenges in achieving sustainable data-driven efficiency by providing a detailed exploration of the end-to-end operational data analytics (ODA) framework that evolved through two generations of supercomputer systems at the Oak Ridge Leadership Computing Facility (OLCF). This framework addresses large data streams ingested from heavily instrumented HPC environment that accumulates multi-terabytes per day. We outline the multifaceted data life cycle across HPC procurement, operations, and research & development, identifying key obstacles and design decisions that shape effective strategies in building and supporting data pipelines end-to-end. By sharing key insights and lessons learned from our experience, we offer recommendations for the HPC community on enabling sustainable operational data analytics and beyond. Our contributions aim to bridge the gap between potential and real benefits of operational data, guiding future efforts towards integrated and sustainable operational intelligence in high-performance computing environments.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of SC 2024-W |
| Subtitle of host publication | Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 1795-1804 |
| Number of pages | 10 |
| ISBN (Electronic) | 9798350355543 |
| DOIs | |
| State | Published - 2024 |
| Event | 2024 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC Workshops 2024 - Atlanta, United States Duration: Nov 17 2024 → Nov 22 2024 |
Publication series
| Name | Proceedings of SC 2024-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis |
|---|
Conference
| Conference | 2024 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC Workshops 2024 |
|---|---|
| Country/Territory | United States |
| City | Atlanta |
| Period | 11/17/24 → 11/22/24 |
Funding
This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE).
Keywords
- Data Analytics
- Data Governance
- HPC Post-Exascale Challenges
- Machine Learning Applications
- Monitoring
- Operational Data Analytics
- Telemetry
- Visual Analytics