Can I/O variability be reduced on QoS-Less HPC storage systems?

Dan Huang, Qing Liu, Jong Choi, Norbert Podhorszki, Scott Klasky, Jeremy Logan, George Ostrouchov, Xubin He, Matthew Wolf

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

For a production high-performance computing (HPC) system, where storage devices are shared between multiple applications and managed in a best effort manner, I/O contention is often a major problem. In this paper, we propose a balanced messaging-based re-routing in conjunction with throttling at the middleware level. This work tackles two key challenges that have not been fully resolved in the past: whether I/O variability can be reduced on a QoS-less HPC storage system, and how to design a runtime scheduling system that can scale up to a large amount of cores. The proposed scheme uses a two-level messaging system to re-route I/O requests to a less congested storage location so that write performance is improved, while limiting the impact on read by throttling re-routing. An analytical model is derived to guide the setup of optimal throttling factor. We thoroughly analyze the virtual messaging layer overhead and explore whether the in-transit buffering is effective in managing I/O variability. Contrary to the intuition, in-transit buffer cannot completely solve the problem. It can reduce the absolute variability but not the relative variability. The proposed scheme is verified against a synthetic benchmark as well as being used by production applications.

Original languageEnglish
Article number8540017
Pages (from-to)631-645
Number of pages15
JournalIEEE Transactions on Computers
Volume68
Issue number5
DOIs
StatePublished - May 1 2019

Funding

This work is supported in part by the US National Science Foundation Grant CCF-1718297, CCF-1812861 and Department of Energy Advanced Scientific Computing Research. The work performed at Temple is partially sponsored by the US National Science Foundation under grants #1702474, #1717660, and #1813081. The experiments of this work are conducted on the HPC facilities managed by Oak Ridge National Lab and National Energy Research Scientific Computing Center.

FundersFunder number
US National Science FoundationCCF-1718297, CCF-1812861
National Science Foundation1812861, 1718297, 1813081, 1717660, 1702474
Advanced Scientific Computing Research

    Keywords

    • High-performance computing
    • quality of service
    • storage
    • variability

    Fingerprint

    Dive into the research topics of 'Can I/O variability be reduced on QoS-Less HPC storage systems?'. Together they form a unique fingerprint.

    Cite this