TY - GEN
T1 - On analytics of file transfer rates over dedicated wide-area connections
AU - Sen, Satyabrata
AU - Rao, Nageswara S.V.
AU - Liu, Qiang
AU - Imam, Neena
AU - Kettimuthu, Rajkumar
AU - Foster, Ian
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/11/14
Y1 - 2017/11/14
N2 - File transfers between the decentralized storage sites over dedicated wide-area connections are becoming increasingly important in high-performance computing and big data scenarios. Designing such scientific workflows for large file transfers is extremely challenging as they depend on the file, I/O, host, and local- and wide-area network subsystems, and their interactions. To gain insights into file-transfer rate profiles, we develop polynomial, bagging, and boosting regression models for Lustre and XFS file transfer measurements, which are collected using XDD over a suite of 10 Gbps connections with 0-366 ms round trip times (RTTs). In addition to overall trends and analytics, these regressions also provide file-transfer rate estimates for RTTs and number of parallel flows at which measurements might not have been collected. They show that bagging and boosting techniques provide closer data fits than the polynomial regression. We develop probabilistic bounds on the generalization error of these methods, which combined with the cross-validation error establish that former two are more accurate estimators than the polynomial regression. In addition, we present a method to efficiently determine the number of parallel flows to achieve a peak file-transfer rate using fewer than full sweep measurements; in our measurements, the peak is achieved in 96% of cases with 15-25% of measurements of a full sweep.
AB - File transfers between the decentralized storage sites over dedicated wide-area connections are becoming increasingly important in high-performance computing and big data scenarios. Designing such scientific workflows for large file transfers is extremely challenging as they depend on the file, I/O, host, and local- and wide-area network subsystems, and their interactions. To gain insights into file-transfer rate profiles, we develop polynomial, bagging, and boosting regression models for Lustre and XFS file transfer measurements, which are collected using XDD over a suite of 10 Gbps connections with 0-366 ms round trip times (RTTs). In addition to overall trends and analytics, these regressions also provide file-transfer rate estimates for RTTs and number of parallel flows at which measurements might not have been collected. They show that bagging and boosting techniques provide closer data fits than the polynomial regression. We develop probabilistic bounds on the generalization error of these methods, which combined with the cross-validation error establish that former two are more accurate estimators than the polynomial regression. In addition, we present a method to efficiently determine the number of parallel flows to achieve a peak file-transfer rate using fewer than full sweep measurements; in our measurements, the peak is achieved in 96% of cases with 15-25% of measurements of a full sweep.
KW - TCP
KW - Wide area transport
KW - cross-validation
KW - dedicated connections
KW - fast probing
KW - regression
KW - throughput profiling
UR - https://www.scopus.com/pages/publications/85043762242
U2 - 10.1109/eScience.2017.93
DO - 10.1109/eScience.2017.93
M3 - Conference contribution
AN - SCOPUS:85043762242
T3 - Proceedings - 13th IEEE International Conference on eScience, eScience 2017
SP - 576
EP - 585
BT - Proceedings - 13th IEEE International Conference on eScience, eScience 2017
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 13th IEEE International Conference on eScience, eScience 2017
Y2 - 24 October 2017 through 27 October 2017
ER -