Tango: A Cross-layer Approach to Managing I/O Interference over Local Ephemeral Storage

Zhenbo Qiao, Qirui Tian, Zhenlu Qin, Jinzhen Wang, Qin G. Liu, Norbert Podhorszki, Scott Klasky, Hongjian Zhu

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

As simulation-based scientific discovery advances to exascale, a major question that the community is striving to answer is how to co-design data storage and complex physicsrich analytics in a way that the time to knowledge can be minimized for post-processing. A particular challenge is how to accommodate a broad spectrum of data analytics needsparticularly those that become clear only until very late during the post-processing, a scenario where existing methods, such as in situ processing, are unable or less effective in supporting data analytics. As HPC storage systems have become deeper and more complex with the recent addition of NVMe, die-stacked memory, and burst buffer, it requires fundamentally rethinking new paradigms and methods for data storage and analysis. This paper aims to address the issue of I/O interference for data analytics over local ephemeral storage, which is shared by multiple applications in a non-exclusive node usage scenario-often configured for small- to medium-sized clusters. At the core of this work is a coordinated cross-layer approach that reacts to storage interference from both storage and application layers. By decomposing and distributing analysis data across the storage hierarchy, data analytics can adapt to the interference by reducing or completely avoiding access to lower tiers whenever there is a high interference, while maintaining a prescribed error bound to limit the information loss. Meanwhile, proper actions are also taken at the storage layer to ensure sufficient bandwidth is allocated for retrieving an augmentation, which is based upon the cardinality and accuracy of the augmentation as well as the nature of an application. We evaluate three realworld data analytics, XGC, GenASiS, and CFD, on Chameleon, and quantitatively demonstrate that the I/O performance can be vastly improved, e.g., by 52% versus no adaptivity and 36% versus single-layer adaptivity, while maintaining acceptable outcomes of data analysis.

Original languageEnglish
Title of host publicationProceedings of SC 2024
Subtitle of host publicationInternational Conference for High Performance Computing, Networking, Storage and Analysis
PublisherIEEE Computer Society
ISBN (Electronic)9798350352917
DOIs
StatePublished - 2024
Event2024 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2024 - Atlanta, United States
Duration: Nov 17 2024Nov 22 2024

Publication series

NameInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
ISSN (Print)2167-4329
ISSN (Electronic)2167-4337

Conference

Conference2024 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2024
Country/TerritoryUnited States
CityAtlanta
Period11/17/2411/22/24

Keywords

  • data analysis
  • data storage
  • High-performance computing

Fingerprint

Dive into the research topics of 'Tango: A Cross-layer Approach to Managing I/O Interference over Local Ephemeral Storage'. Together they form a unique fingerprint.

Cite this