TY - GEN
T1 - Enabling and exploiting flexible task assignment on GPU through SM-centric program transformations
AU - Wu, Bo
AU - Chen, Guoyang
AU - Li, Dong
AU - Shen, Xipeng
AU - Vetter, Jeffrey
N1 - Publisher Copyright:
© Copyright 2015 ACM.
PY - 2015/6/8
Y1 - 2015/6/8
N2 - A GPU's computing power lies in its abundant memory bandwidth and massive parallelism. However, its hardware thread schedulers, despite being able to quickly distribute computation to processors, often fail to capitalize on program characteristics effectively, achieving only a fraction of the GPU's full potential. Moreover, current GPUs do not allow programmers or compilers to control this thread scheduling, forfeiting important optimization opportunities at the program level. This paper presents a transformation centered on Streaming Multiprocessors (SM); this software approach to circumventing the limitations of the hardware scheduler allows exible program-level control of scheduling. By permitting precise control of job locality on SMs, the transformation overcomes inherent limitations in prior methods. With this technique, exible control of GPU scheduling at the program level becomes feasible, which opens up new opportunities for GPU program optimizations. The second part of the paper explores how the new opportunities could be leveraged for GPU performance enhancement, what complexities there are, and how to address them. We show that some simple optimization techniques can enhance co-runs of multiple kernels and improve data locality of irregular applications, producing 20-33% average increase in performance, system throughput, and average turnaround time.
AB - A GPU's computing power lies in its abundant memory bandwidth and massive parallelism. However, its hardware thread schedulers, despite being able to quickly distribute computation to processors, often fail to capitalize on program characteristics effectively, achieving only a fraction of the GPU's full potential. Moreover, current GPUs do not allow programmers or compilers to control this thread scheduling, forfeiting important optimization opportunities at the program level. This paper presents a transformation centered on Streaming Multiprocessors (SM); this software approach to circumventing the limitations of the hardware scheduler allows exible program-level control of scheduling. By permitting precise control of job locality on SMs, the transformation overcomes inherent limitations in prior methods. With this technique, exible control of GPU scheduling at the program level becomes feasible, which opens up new opportunities for GPU program optimizations. The second part of the paper explores how the new opportunities could be leveraged for GPU performance enhancement, what complexities there are, and how to address them. We show that some simple optimization techniques can enhance co-runs of multiple kernels and improve data locality of irregular applications, producing 20-33% average increase in performance, system throughput, and average turnaround time.
KW - Compiler transformation
KW - Data affinity
KW - GPGPU
KW - Program co-run
KW - Scheduling
UR - http://www.scopus.com/inward/record.url?scp=84957606190&partnerID=8YFLogxK
U2 - 10.1145/2751205.2751213
DO - 10.1145/2751205.2751213
M3 - Conference contribution
AN - SCOPUS:84957606190
T3 - Proceedings of the International Conference on Supercomputing
SP - 119
EP - 130
BT - ICS 2015 - Proceedings of the 29th ACM International Conference on Supercomputing
PB - Association for Computing Machinery
T2 - 29th ACM International Conference on Supercomputing, ICS 2015
Y2 - 8 June 2015 through 11 June 2015
ER -