TY - GEN
T1 - Generating efficient tensor contractions for GPUs
AU - Nelson, Thomas
AU - Rivera, Axel
AU - Balaprakash, Prasanna
AU - Hall, Mary
AU - Hovland, Paul D.
AU - Jessup, Elizabeth
AU - Norris, Boyana
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/12/8
Y1 - 2015/12/8
N2 - Many scientific and numerical applications, including quantum chemistry modeling and fluid dynamics simulation, require tensor product and tensor contraction evaluation. Tensor computations are characterized by arrays with numerous dimensions, inherent parallelism, moderate data reuse and many degrees of freedom in the order in which to perform the computation. The best-performing implementation is heavily dependent on the tensor dimensionality and the target architecture. In this paper, we map tensor computations to GPUs, starting with a high-level tensor input language and producing efficient CUDA code as output. Our approach is to combine tensor-specific mathematical transformations with a GPU decision algorithm, machine learning and auto tuning of a large parameter space. Generated code shows significant performance gains over sequential and Open MP parallel code, and a comparison with Open ACC shows the importance of auto tuning and other optimizations in our framework for achieving efficient results.
AB - Many scientific and numerical applications, including quantum chemistry modeling and fluid dynamics simulation, require tensor product and tensor contraction evaluation. Tensor computations are characterized by arrays with numerous dimensions, inherent parallelism, moderate data reuse and many degrees of freedom in the order in which to perform the computation. The best-performing implementation is heavily dependent on the tensor dimensionality and the target architecture. In this paper, we map tensor computations to GPUs, starting with a high-level tensor input language and producing efficient CUDA code as output. Our approach is to combine tensor-specific mathematical transformations with a GPU decision algorithm, machine learning and auto tuning of a large parameter space. Generated code shows significant performance gains over sequential and Open MP parallel code, and a comparison with Open ACC shows the importance of auto tuning and other optimizations in our framework for achieving efficient results.
KW - Autotuning
KW - GPUs
KW - Tensor contraction
UR - http://www.scopus.com/inward/record.url?scp=84976468637&partnerID=8YFLogxK
U2 - 10.1109/ICPP.2015.106
DO - 10.1109/ICPP.2015.106
M3 - Conference contribution
AN - SCOPUS:84976468637
T3 - Proceedings of the International Conference on Parallel Processing
SP - 969
EP - 978
BT - Proceedings - 2015 44th International Annual Conference on Parallel Processing, ICPP 2015
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 44th International Conference on Parallel Processing, ICPP 2015
Y2 - 1 September 2015 through 4 September 2015
ER -