TY - GEN
T1 - Enhancing parallelism of tile bidiagonal transformation on multicore architectures using tree reduction
AU - Ltaief, Hatem
AU - Luszczek, Piotr
AU - Dongarra, Jack
PY - 2012
Y1 - 2012
N2 - The objective of this paper is to enhance the parallelism of the tile bidiagonal transformation using tree reduction on multicore architectures. First introduced by Ltaief et. al [LAPACK Working Note #247, 2011], the bidiagonal transformation using tile algorithms with a two-stage approach has shown very promising results on square matrices. However, for tall and skinny matrices, the inherent problem of processing the panel in a domino-like fashion generates unnecessary sequential tasks. By using tree reduction, the panel is horizontally split, which creates another dimension of parallelism and engenders many concurrent tasks to be dynamically scheduled on the available cores. The results reported in this paper are very encouraging. The new tile bidiagonal transformation, targeting tall and skinny matrices, outperforms the state-of-the-art numerical linear algebra libraries LAPACK V3.2 and Intel MKL ver. 10.3 by up to 29-fold speedup and the standard two-stage PLASMA BRD by up to 20-fold speedup, on an eight socket hexa-core AMD Opteron multicore shared-memory system.
AB - The objective of this paper is to enhance the parallelism of the tile bidiagonal transformation using tree reduction on multicore architectures. First introduced by Ltaief et. al [LAPACK Working Note #247, 2011], the bidiagonal transformation using tile algorithms with a two-stage approach has shown very promising results on square matrices. However, for tall and skinny matrices, the inherent problem of processing the panel in a domino-like fashion generates unnecessary sequential tasks. By using tree reduction, the panel is horizontally split, which creates another dimension of parallelism and engenders many concurrent tasks to be dynamically scheduled on the available cores. The results reported in this paper are very encouraging. The new tile bidiagonal transformation, targeting tall and skinny matrices, outperforms the state-of-the-art numerical linear algebra libraries LAPACK V3.2 and Intel MKL ver. 10.3 by up to 29-fold speedup and the standard two-stage PLASMA BRD by up to 20-fold speedup, on an eight socket hexa-core AMD Opteron multicore shared-memory system.
KW - Bidiagonal Transformation
KW - Dynamic Scheduling
KW - High Performance Computing
KW - Multicore Architecture
KW - Tree Reduction
UR - http://www.scopus.com/inward/record.url?scp=84865266292&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-31464-3_67
DO - 10.1007/978-3-642-31464-3_67
M3 - Conference contribution
AN - SCOPUS:84865266292
SN - 9783642314636
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 661
EP - 670
BT - Parallel Processing and Applied Mathematics - 9th International Conference, PPAM 2011, Revised Selected Papers
T2 - 9th International Conference on Parallel Processing and Applied Mathematics, PPAM 2011
Y2 - 11 September 2011 through 14 September 2011
ER -