TY - GEN
T1 - BLAS-3 Optimized by OmpSs Regions (LASs Library)
AU - Valero-Lara, Pedro
AU - Catalán, Sandra
AU - Martorell, Xavier
AU - Labarta, Jesús
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/3/19
Y1 - 2019/3/19
N2 - In this paper we propose a set of optimizations for the BLAS-3 routines of LASs library (Linear Algebra routines on OmpSs) and perform a detailed analysis of the impact of the proposed changes in terms of performance and execution time. OmpSs allows to use regions in the dependences of the tasks. This helps not only in the programming of the algorithmic optimizations, but also in the reduction of the execution time achieved by such optimizations. Different strategies are implemented in order to reduce the amount of tasks created (when there is enough parallelism) during the execution of BLAS-3 operations in the original LASs. Also a better IPC is obtained thanks to a better memory hierarchy exploitation. More specifically, we increase the performance, in particular on big matrices, about 12% for TRSM, and 17% for GEMM with respect to the original version of LASs, even using less cores in the case of GEMM/SYMM. Moreover, when LASs is compared to the OpenMP reference dense linear algebra library PLASMA, performance is increased up to 12.5% for GEMM/SYMM, while for TRSM/TRMM this value raises to 15%.
AB - In this paper we propose a set of optimizations for the BLAS-3 routines of LASs library (Linear Algebra routines on OmpSs) and perform a detailed analysis of the impact of the proposed changes in terms of performance and execution time. OmpSs allows to use regions in the dependences of the tasks. This helps not only in the programming of the algorithmic optimizations, but also in the reduction of the execution time achieved by such optimizations. Different strategies are implemented in order to reduce the amount of tasks created (when there is enough parallelism) during the execution of BLAS-3 operations in the original LASs. Also a better IPC is obtained thanks to a better memory hierarchy exploitation. More specifically, we increase the performance, in particular on big matrices, about 12% for TRSM, and 17% for GEMM with respect to the original version of LASs, even using less cores in the case of GEMM/SYMM. Moreover, when LASs is compared to the OpenMP reference dense linear algebra library PLASMA, performance is increased up to 12.5% for GEMM/SYMM, while for TRSM/TRMM this value raises to 15%.
KW - BLAS-3
KW - OmpSs
KW - regions
KW - tasking
UR - http://www.scopus.com/inward/record.url?scp=85063866967&partnerID=8YFLogxK
U2 - 10.1109/EMPDP.2019.8671545
DO - 10.1109/EMPDP.2019.8671545
M3 - Conference contribution
AN - SCOPUS:85063866967
T3 - Proceedings - 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2019
SP - 25
EP - 32
BT - Proceedings - 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2019
Y2 - 13 February 2019 through 15 February 2019
ER -