TY - GEN
T1 - Juggler
T2 - 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2018
AU - Belviranli, Mehmet E.
AU - Lee, Seyong
AU - Vetter, Jeffrey S.
AU - Bhuyan, Laxmi N.
N1 - Publisher Copyright:
© 2018 Copyright held by the owner/author(s).
PY - 2018/2/10
Y1 - 2018/2/10
N2 - Scientific applications with single instruction, multiple data (SIMD) computations show considerable performance improvements when run on today's graphics processing units (GPUs). However, the existence of data dependences across thread blocks may significantly impact the speedup by requiring global synchronization across multiprocessors (SMs) inside the GPU. To efficiently run applications with interblock data dependences, we need fine-granular task-based execution models that will treat SMs inside a GPU as standalone parallel processing units. Such a scheme will enable faster execution by utilizing all internal computation elements inside the GPU and eliminating unnecessary waits during device-wide global barriers. In this paper, we propose Juggler, a task-based execution scheme for GPU workloads with data dependences. The Juggler framework takes applications embedding OpenMP 4.5 tasks as input and executes them on the GPU via an efficient in-device runtime, hence eliminating the need for kernel-wide global synchronization. Juggler requires no or little modification to the source code, and once launched, the runtime entirely runs on the GPU without relying on the host through the entire execution. We have evaluated Juggler on an NVIDIA Tesla P100 GPU and obtained up to 31% performance improvement against global barrier based implementation, with minimal runtime overhead.
AB - Scientific applications with single instruction, multiple data (SIMD) computations show considerable performance improvements when run on today's graphics processing units (GPUs). However, the existence of data dependences across thread blocks may significantly impact the speedup by requiring global synchronization across multiprocessors (SMs) inside the GPU. To efficiently run applications with interblock data dependences, we need fine-granular task-based execution models that will treat SMs inside a GPU as standalone parallel processing units. Such a scheme will enable faster execution by utilizing all internal computation elements inside the GPU and eliminating unnecessary waits during device-wide global barriers. In this paper, we propose Juggler, a task-based execution scheme for GPU workloads with data dependences. The Juggler framework takes applications embedding OpenMP 4.5 tasks as input and executes them on the GPU via an efficient in-device runtime, hence eliminating the need for kernel-wide global synchronization. Juggler requires no or little modification to the source code, and once launched, the runtime entirely runs on the GPU without relying on the host through the entire execution. We have evaluated Juggler on an NVIDIA Tesla P100 GPU and obtained up to 31% performance improvement against global barrier based implementation, with minimal runtime overhead.
KW - Data dependence
KW - GP-GPU programming
KW - OpenMP 4.5
KW - Task-based execution
UR - http://www.scopus.com/inward/record.url?scp=85044325704&partnerID=8YFLogxK
U2 - 10.1145/3178487.3178492
DO - 10.1145/3178487.3178492
M3 - Conference contribution
AN - SCOPUS:85044325704
T3 - Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP
SP - 54
EP - 67
BT - PPoPP 2018 - Proceedings of the 23rd Principles and Practice of Parallel Programming
PB - Association for Computing Machinery
Y2 - 24 February 2018 through 28 February 2018
ER -