TY - GEN
T1 - Storage-aware task scheduling for performance optimization of big data workflows
AU - Ye, Qianwen
AU - Wu, Chase Q.
AU - Cao, Huiyan
AU - Rao, Nageswara S.V.
AU - Hou, Aiqin
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/7/2
Y1 - 2018/7/2
N2 - Many large-scale applications in various domains are generating big data, which are increasingly processed and analyzed by MapReduce-based workflows deployed in Hadoop systems. In addition to computing time, the makespan of such data-intensive workflows is also largely affected by communication cost. Particularly, there are two levels of data movement during the execution of distributed workflows in Hadoop: i) from map tasks to reduce tasks within each individual MapReduce module and ii) between each pair of adjacent modules in the workflow. Traditionally, these two aspects of network traffic have been treated separately as data locality at the task and module or job level, respectively. However, the interactions between these two levels of data movement may create complicated dynamics and their compound effects remain largely unexplored. In this paper, we formulate a task scheduling problem that considers data movement at both levels to minimize the end-to-end delay of a MapReduce-based workflow. We show this problem to be NP-complete, and design a storage-aware big data workflow scheduling algorithm, referred to as SA-BWS, to optimize workflow performance in Hadoop environments. The performance superiority of SA-BWS is illustrated by extensive simulations in comparison with the default workflow engine in Hadoop and existing scheduling methods.
AB - Many large-scale applications in various domains are generating big data, which are increasingly processed and analyzed by MapReduce-based workflows deployed in Hadoop systems. In addition to computing time, the makespan of such data-intensive workflows is also largely affected by communication cost. Particularly, there are two levels of data movement during the execution of distributed workflows in Hadoop: i) from map tasks to reduce tasks within each individual MapReduce module and ii) between each pair of adjacent modules in the workflow. Traditionally, these two aspects of network traffic have been treated separately as data locality at the task and module or job level, respectively. However, the interactions between these two levels of data movement may create complicated dynamics and their compound effects remain largely unexplored. In this paper, we formulate a task scheduling problem that considers data movement at both levels to minimize the end-to-end delay of a MapReduce-based workflow. We show this problem to be NP-complete, and design a storage-aware big data workflow scheduling algorithm, referred to as SA-BWS, to optimize workflow performance in Hadoop environments. The performance superiority of SA-BWS is illustrated by extensive simulations in comparison with the default workflow engine in Hadoop and existing scheduling methods.
KW - Big data workflow
KW - Data locality
KW - MapReduce
KW - Workflow optimization
KW - Workflow scheduling
UR - http://www.scopus.com/inward/record.url?scp=85063912833&partnerID=8YFLogxK
U2 - 10.1109/BDCloud.2018.00163
DO - 10.1109/BDCloud.2018.00163
M3 - Conference contribution
AN - SCOPUS:85063912833
T3 - Proceedings - 16th IEEE International Symposium on Parallel and Distributed Processing with Applications, 17th IEEE International Conference on Ubiquitous Computing and Communications, 8th IEEE International Conference on Big Data and Cloud Computing, 11th IEEE International Conference on Social Computing and Networking and 8th IEEE International Conference on Sustainable Computing and Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018
SP - 1095
EP - 1102
BT - Proceedings - 16th IEEE International Symposium on Parallel and Distributed Processing with Applications, 17th IEEE International Conference on Ubiquitous Computing and Communications, 8th IEEE International Conference on Big Data and Cloud Computing, 11th IEEE International Conference on Social Computing and Networking and 8th IEEE International Conference on Sustainable Computing and Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018
A2 - Chen, Jinjun
A2 - Yang, Laurence T.
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 16th IEEE International Symposium on Parallel and Distributed Processing with Applications, 17th IEEE International Conference on Ubiquitous Computing and Communications, 8th IEEE International Conference on Big Data and Cloud Computing, 11th IEEE International Conference on Social Computing and Networking and 8th IEEE International Conference on Sustainable Computing and Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018
Y2 - 11 December 2018 through 13 December 2018
ER -