Abstract
For a production high-performance computing (HPC) system, where storage devices are shared between multiple applications and managed in a best effort manner, I/O contention is often a major problem. In this paper, we propose a balanced messaging-based re-routing in conjunction with throttling at the middleware level. This work tackles two key challenges that have not been fully resolved in the past: whether I/O variability can be reduced on a QoS-less HPC storage system, and how to design a runtime scheduling system that can scale up to a large amount of cores. The proposed scheme uses a two-level messaging system to re-route I/O requests to a less congested storage location so that write performance is improved, while limiting the impact on read by throttling re-routing. An analytical model is derived to guide the setup of optimal throttling factor. We thoroughly analyze the virtual messaging layer overhead and explore whether the in-transit buffering is effective in managing I/O variability. Contrary to the intuition, in-transit buffer cannot completely solve the problem. It can reduce the absolute variability but not the relative variability. The proposed scheme is verified against a synthetic benchmark as well as being used by production applications.
Original language | English |
---|---|
Article number | 8540017 |
Pages (from-to) | 631-645 |
Number of pages | 15 |
Journal | IEEE Transactions on Computers |
Volume | 68 |
Issue number | 5 |
DOIs | |
State | Published - May 1 2019 |
Funding
This work is supported in part by the US National Science Foundation Grant CCF-1718297, CCF-1812861 and Department of Energy Advanced Scientific Computing Research. The work performed at Temple is partially sponsored by the US National Science Foundation under grants #1702474, #1717660, and #1813081. The experiments of this work are conducted on the HPC facilities managed by Oak Ridge National Lab and National Energy Research Scientific Computing Center.
Funders | Funder number |
---|---|
US National Science Foundation | CCF-1718297, CCF-1812861 |
National Science Foundation | 1812861, 1718297, 1813081, 1717660, 1702474 |
Advanced Scientific Computing Research |
Keywords
- High-performance computing
- quality of service
- storage
- variability