TY - GEN
T1 - Explaining wide area data transfer performance
AU - Liu, Zhengchun
AU - Balaprakash, Prasanna
AU - Kettimuthu, Rajkumar
AU - Foster, Ian
N1 - Publisher Copyright:
© 2017 Association for Computing Machinery.
PY - 2017/6/26
Y1 - 2017/6/26
N2 - Disk-to-disk wide-area file transfers involve many subsystems and tunable application parameters that pose significant challenges for bottleneck detection, system optimization, and performance prediction. Performance models can be used to address these challenges but have not proved generally usable because of a need for extensive online experiments to characterize subsystems. We show here how to overcome the need for such experiments by applying machine learning methods to historical data to estimate parameters for predictive models. Starting with log data for millions of Globus transfers involving billions of files and hundreds of petabytes, we engineer features for endpoint CPU load, network interface card load, and transfer characteristics; and we use these features in both linear and nonlinear models of transfer performance, We show that the resulting models have high explanatory power. For a representative set of 30,653 transfers over 30 heavily used source-destination pairs ("edges"), totaling 2,053 TB in 46.6 million files, we obtain median absolute percentage prediction errors (MdAPE) of 7.0% and 4.6% when using distinct linear and nonlinear models per edge, respectively; when using a single nonlinear model for all edges, we obtain an MdAPE of 7.8%. Our work broadens understanding of factors that influence file transfer rate by clarifying relationships between achieved transfer rates, transfer characteristics, and competing load. Our predictions can be used for distributed workflow scheduling and optimization, and our features can also be used for optimization and explanation.
AB - Disk-to-disk wide-area file transfers involve many subsystems and tunable application parameters that pose significant challenges for bottleneck detection, system optimization, and performance prediction. Performance models can be used to address these challenges but have not proved generally usable because of a need for extensive online experiments to characterize subsystems. We show here how to overcome the need for such experiments by applying machine learning methods to historical data to estimate parameters for predictive models. Starting with log data for millions of Globus transfers involving billions of files and hundreds of petabytes, we engineer features for endpoint CPU load, network interface card load, and transfer characteristics; and we use these features in both linear and nonlinear models of transfer performance, We show that the resulting models have high explanatory power. For a representative set of 30,653 transfers over 30 heavily used source-destination pairs ("edges"), totaling 2,053 TB in 46.6 million files, we obtain median absolute percentage prediction errors (MdAPE) of 7.0% and 4.6% when using distinct linear and nonlinear models per edge, respectively; when using a single nonlinear model for all edges, we obtain an MdAPE of 7.8%. Our work broadens understanding of factors that influence file transfer rate by clarifying relationships between achieved transfer rates, transfer characteristics, and competing load. Our predictions can be used for distributed workflow scheduling and optimization, and our features can also be used for optimization and explanation.
UR - http://www.scopus.com/inward/record.url?scp=85025829346&partnerID=8YFLogxK
U2 - 10.1145/3078597.3078605
DO - 10.1145/3078597.3078605
M3 - Conference contribution
AN - SCOPUS:85025829346
T3 - HPDC 2017 - Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing
SP - 167
EP - 178
BT - HPDC 2017 - Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing
PB - Association for Computing Machinery, Inc
T2 - 26th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2017
Y2 - 26 June 2017 through 30 June 2017
ER -