Abstract
The layout-aware data scheduling (LADS) data movement framework optimizes congestion for end-to-end data transfers. During data transfer, LADS can avoid congested storage elements by exploiting the underlying storage layout at each endpoint. This improves the I/O bandwidth and hence the data transfer rate across high-speed networks. However, the absence of fault tolerance (FT) in LADS results in data retransmission overhead and may lead to possible data integrity issues upon faults. In this paper, we propose object-logging FT mechanisms to avoid transmitting the objects that are successfully written into the parallel file system (PFS) at the sink end. Depending on the number of log files created for the whole dataset, we have classified our FT mechanisms into three different categories: file logger, transaction logger, and universal logger. Also, to address the space overhead, we have proposed different methods of populating the log files with the information of the successfully transferred objects. We have evaluated the data transfer performance and recovery time overhead of the proposed object-logging-based FT mechanisms on the LADS data transfer framework. Our experimental results show that FT mechanisms exhibit negligible overhead (< 1%) with respect to the data transfer time. However, the fault recovery time is 10% higher than the total data transfer time at any fault point.
Original language | English |
---|---|
Article number | 8672553 |
Pages (from-to) | 37448-37462 |
Number of pages | 15 |
Journal | IEEE Access |
Volume | 7 |
DOIs | |
State | Published - 2019 |
Funding
This work was supported in part by the National Research Foundation of Korea (NRF) grant funded by the Korea Government (Ministry of Science and ICT) under Grant 2018R1A1A1A05079398, in part by the Korea Institute of Science and Technology (KISTI) under Grant K-17-L03-C01-S03, and in part by the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is managed by UT Battelle, Limited Liability Company for the U.S. DOE under Contract DE-AC05-00OR22725.
Keywords
- Big data
- fault tolerance
- geo-distributed data centers
- parallel system