TY - GEN
T1 - Improving large-scale storage system performance via topology-aware and balanced data placement
AU - Wang, Feiyi
AU - Oral, Sarp
AU - Gupta, Saurabh
AU - Tiwari, Devesh
AU - Vazhkudai, Sudharshan S.
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014
Y1 - 2014
N2 - With the advent of big data, the I/O subsystems of large-scale compute clusters are becoming a center of focus. More applications are putting greater demands on end-to-end I/O performance. These subsystems are often complex in design. They comprise of multiple hardware and software layers to cope with the increasing capacity, capability, and scalability requirements of data intensive applications. However, the sharing nature of storage resources and the intrinsic interactions across these layers make it a great challenge to realize end-to-end performance gains. This paper proposes a topology-aware strategy to balance the load across resources, to improve the per-application I/O performance. We demonstrate the effectiveness of our algorithm on an extreme-scale compute cluster, Titan, at the Oak Ridge Leadership Computing Facility (OLCF). Our experiments with both synthetic benchmarks and a real-world application show that, even under congestion, our proposed algorithm can improve large-scale application I/O performance significantly, resulting in both a reduction in application run time as well as a higher resolution of simulation run.
AB - With the advent of big data, the I/O subsystems of large-scale compute clusters are becoming a center of focus. More applications are putting greater demands on end-to-end I/O performance. These subsystems are often complex in design. They comprise of multiple hardware and software layers to cope with the increasing capacity, capability, and scalability requirements of data intensive applications. However, the sharing nature of storage resources and the intrinsic interactions across these layers make it a great challenge to realize end-to-end performance gains. This paper proposes a topology-aware strategy to balance the load across resources, to improve the per-application I/O performance. We demonstrate the effectiveness of our algorithm on an extreme-scale compute cluster, Titan, at the Oak Ridge Leadership Computing Facility (OLCF). Our experiments with both synthetic benchmarks and a real-world application show that, even under congestion, our proposed algorithm can improve large-scale application I/O performance significantly, resulting in both a reduction in application run time as well as a higher resolution of simulation run.
KW - High Performance Computing
KW - Parallel File System
KW - Performance Evaluation
KW - Storage Area Network
UR - http://www.scopus.com/inward/record.url?scp=84988227780&partnerID=8YFLogxK
U2 - 10.1109/PADSW.2014.7097866
DO - 10.1109/PADSW.2014.7097866
M3 - Conference contribution
AN - SCOPUS:84988227780
T3 - Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS
SP - 656
EP - 663
BT - 2014 20th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2014 - Proceedings
PB - IEEE Computer Society
T2 - 20th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2014
Y2 - 16 December 2014 through 19 December 2014
ER -