TY - GEN
T1 - Generic Matrix Multiplication for Multi-GPU Accelerated Distributed-Memory Platforms over PaRSEC
AU - Herault, Thomas
AU - Robert, Yves
AU - Bosilca, George
AU - Dongarra, Jack
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/11
Y1 - 2019/11
N2 - This paper introduces a generic and flexible matrix- matrix multiplication algorithm C = A × B for state-of-the-art computing platforms. Typically, these platforms are distributed- memory machines whose nodes are equipped with several ac- celerators. To the best of our knowledge, SLATE [9] is the only library that provides a publicly available implementation on such platforms, and it is currently limited to problem instances where the C matrix can entirely fit in the memory of the GPU accelera- tors. Our algorithm relies on the classical tile-based outer-product algorithm, but enhances it with several control dependencies to increase data re-use and to optimize communication flow from/to the accelerators within each node. The algorithm is written with the PARSEC runtime system, which allows for a fast and generic implementation, while achieving close-to-peak performance.
AB - This paper introduces a generic and flexible matrix- matrix multiplication algorithm C = A × B for state-of-the-art computing platforms. Typically, these platforms are distributed- memory machines whose nodes are equipped with several ac- celerators. To the best of our knowledge, SLATE [9] is the only library that provides a publicly available implementation on such platforms, and it is currently limited to problem instances where the C matrix can entirely fit in the memory of the GPU accelera- tors. Our algorithm relies on the classical tile-based outer-product algorithm, but enhances it with several control dependencies to increase data re-use and to optimize communication flow from/to the accelerators within each node. The algorithm is written with the PARSEC runtime system, which allows for a fast and generic implementation, while achieving close-to-peak performance.
KW - Accelerator architectures
KW - Linear Algebra
KW - Run- time environment
UR - http://www.scopus.com/inward/record.url?scp=85078700382&partnerID=8YFLogxK
U2 - 10.1109/ScalA49573.2019.00010
DO - 10.1109/ScalA49573.2019.00010
M3 - Conference contribution
AN - SCOPUS:85078700382
T3 - Proceedings of ScalA 2019: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis
SP - 33
EP - 41
BT - Proceedings of ScalA 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 10th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2019
Y2 - 18 November 2019
ER -