TY - GEN
T1 - Big data analytics on HPC architectures
T2 - 4th IEEE International Conference on Big Data, Big Data 2016
AU - Xenopoulos, Peter
AU - Daniel, Jamison
AU - Matheson, Michael
AU - Sukumar, Sreenivas
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016
Y1 - 2016
N2 - Data driven science, accompanied by the explosion of petabytes of data, has called into need dedicated analytics computing resources. Dedicated analytics clusters require large capital outlays due to their expensive hardware requirements. Additionally, if such resources are located far from the data they analyze, they also incur substantial data transfer, which has both cost and latency implications. In this paper, we benchmark a variety of high-performance computing (HPC) architectures for classic data science algorithms, as well as conduct a cost analysis of these architectures. Additionally, we compare algorithms across analytic frameworks, as well as explore hidden costs in the form of queuing mechanisms. We observe that node architectures with large memory and high memory bandwidth are better suited for big data analytics on HPC hardware. We also conclude that cloud computing is more cost effective for small or experimental data workloads, but HPC is more cost effective at scale. Additionally, we quantify the hidden costs of queuing and how it relates to data science workloads. Finally, we observe that software developed for the cloud, such as Spark, performs significantly worse than pbdR when run in HPC environments.
AB - Data driven science, accompanied by the explosion of petabytes of data, has called into need dedicated analytics computing resources. Dedicated analytics clusters require large capital outlays due to their expensive hardware requirements. Additionally, if such resources are located far from the data they analyze, they also incur substantial data transfer, which has both cost and latency implications. In this paper, we benchmark a variety of high-performance computing (HPC) architectures for classic data science algorithms, as well as conduct a cost analysis of these architectures. Additionally, we compare algorithms across analytic frameworks, as well as explore hidden costs in the form of queuing mechanisms. We observe that node architectures with large memory and high memory bandwidth are better suited for big data analytics on HPC hardware. We also conclude that cloud computing is more cost effective for small or experimental data workloads, but HPC is more cost effective at scale. Additionally, we quantify the hidden costs of queuing and how it relates to data science workloads. Finally, we observe that software developed for the cloud, such as Spark, performs significantly worse than pbdR when run in HPC environments.
UR - http://www.scopus.com/inward/record.url?scp=85015220060&partnerID=8YFLogxK
U2 - 10.1109/BigData.2016.7840861
DO - 10.1109/BigData.2016.7840861
M3 - Conference contribution
AN - SCOPUS:85015220060
T3 - Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016
SP - 2286
EP - 2295
BT - Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016
A2 - Ak, Ronay
A2 - Karypis, George
A2 - Xia, Yinglong
A2 - Hu, Xiaohua Tony
A2 - Yu, Philip S.
A2 - Joshi, James
A2 - Ungar, Lyle
A2 - Liu, Ling
A2 - Sato, Aki-Hiro
A2 - Suzumura, Toyotaro
A2 - Rachuri, Sudarsan
A2 - Govindaraju, Rama
A2 - Xu, Weijia
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 5 December 2016 through 8 December 2016
ER -