TY - GEN
T1 - Pipelining/Overlapping data transfer for distributed data-Intensive job execution
AU - Jung, Eun Sung
AU - Maheshwari, Ketan
AU - Kettimuthu, Rajkumar
PY - 2013
Y1 - 2013
N2 - Scientific workflows are increasingly gaining attention as both data and compute resources are getting bigger, heterogeneous, and distributed. Many scientific workflows are both compute intensive and data intensive and use distributed resources. This situation poses significant challenges in terms of real-time remote analysis and dissemination of massive datasets to scientists across the community. These challenges will be exacerbated in the exascale era. Parallel jobs in scientific workflows are common, and such parallelism can be exploited by scheduling parallel jobs among multiple execution sites for enhanced performance. Previous scheduling algorithms such as heterogeneous earliest finish time (HEFT) did not focus on scheduling thousands of jobs often seen in contemporary applications. Some techniques, such as task clustering, have been proposed to reduce the overhead of scheduling a large number of jobs. However, scheduling massively parallel jobs in distributed environments poses new challenges as data movement becomes a nontrivial factor. We propose efficient parallel execution models through pipelined execution of data transfer, incorporating network bandwidth and reserved resources at an execution site. We formally analyze those models and suggest the best model with the optimal degree of parallelism.We implement our model in the Swift parallel scripting paradigm using GridFTP. Experiments on real distributed computing resources show that our model with optimal degrees of parallelism outperform the current parallel execution model by as much as 50% reduction of total execution time.
AB - Scientific workflows are increasingly gaining attention as both data and compute resources are getting bigger, heterogeneous, and distributed. Many scientific workflows are both compute intensive and data intensive and use distributed resources. This situation poses significant challenges in terms of real-time remote analysis and dissemination of massive datasets to scientists across the community. These challenges will be exacerbated in the exascale era. Parallel jobs in scientific workflows are common, and such parallelism can be exploited by scheduling parallel jobs among multiple execution sites for enhanced performance. Previous scheduling algorithms such as heterogeneous earliest finish time (HEFT) did not focus on scheduling thousands of jobs often seen in contemporary applications. Some techniques, such as task clustering, have been proposed to reduce the overhead of scheduling a large number of jobs. However, scheduling massively parallel jobs in distributed environments poses new challenges as data movement becomes a nontrivial factor. We propose efficient parallel execution models through pipelined execution of data transfer, incorporating network bandwidth and reserved resources at an execution site. We formally analyze those models and suggest the best model with the optimal degree of parallelism.We implement our model in the Swift parallel scripting paradigm using GridFTP. Experiments on real distributed computing resources show that our model with optimal degrees of parallelism outperform the current parallel execution model by as much as 50% reduction of total execution time.
UR - http://www.scopus.com/inward/record.url?scp=84893326807&partnerID=8YFLogxK
U2 - 10.1109/ICPP.2013.93
DO - 10.1109/ICPP.2013.93
M3 - Conference contribution
AN - SCOPUS:84893326807
SN - 9780769551173
T3 - Proceedings of the International Conference on Parallel Processing
SP - 791
EP - 797
BT - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 42nd Annual International Conference on Parallel Processing, ICPP 2013
Y2 - 1 October 2013 through 4 October 2013
ER -