TY - GEN
T1 - Scalable tile communication-avoiding QR factorization on multicore cluster systems
AU - Song, Fengguang
AU - Ltaief, Hatem
AU - Hadri, Bilel
AU - Dongarra, Jack
PY - 2010
Y1 - 2010
N2 - As tile linear algebra algorithms continue achieving high performance on shared-memory multicore architectures, it is a challenging task to make them scalable on distributed-memory multicore cluster machines. The main contribution of this paper is the extension to the distributed-memory environment of the previous work done by Hadri et al. on Communication-Avoiding QR (CA-QR) factorizations for tall and skinny matrices (initially done on shared-memory multicore systems). The fine granularity of tile algorithms associated with communication-avoiding techniques for the QR factorization presents a high degree of parallelism where multiple tasks can be concurrently executed, computation and communication largely overlapped, and computation steps fully pipelined. A decentralized dynamic scheduler has then been integrated as a runtime system to efficiently schedule tasks across the distributed resources. Our experimental results performed on two clusters (with dual-core and 8-core nodes, respectively) and a Cray XT5 system with 12-core nodes show that the tile CA-QR factorization is able to outperform the de facto ScaLAPACK library by up to 4 times for tall and skinny matrices, and has good scalability on up to 3,072 cores.
AB - As tile linear algebra algorithms continue achieving high performance on shared-memory multicore architectures, it is a challenging task to make them scalable on distributed-memory multicore cluster machines. The main contribution of this paper is the extension to the distributed-memory environment of the previous work done by Hadri et al. on Communication-Avoiding QR (CA-QR) factorizations for tall and skinny matrices (initially done on shared-memory multicore systems). The fine granularity of tile algorithms associated with communication-avoiding techniques for the QR factorization presents a high degree of parallelism where multiple tasks can be concurrently executed, computation and communication largely overlapped, and computation steps fully pipelined. A decentralized dynamic scheduler has then been integrated as a runtime system to efficiently schedule tasks across the distributed resources. Our experimental results performed on two clusters (with dual-core and 8-core nodes, respectively) and a Cray XT5 system with 12-core nodes show that the tile CA-QR factorization is able to outperform the de facto ScaLAPACK library by up to 4 times for tall and skinny matrices, and has good scalability on up to 3,072 cores.
UR - http://www.scopus.com/inward/record.url?scp=78650817787&partnerID=8YFLogxK
U2 - 10.1109/SC.2010.48
DO - 10.1109/SC.2010.48
M3 - Conference contribution
AN - SCOPUS:78650817787
SN - 9781424475575
T3 - 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010
BT - 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010
T2 - 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010
Y2 - 13 November 2010 through 19 November 2010
ER -