TY - GEN
T1 - Hierarchical QR factorization algorithms for multi-core cluster systems
AU - Dongarra, Jack
AU - Faverge, Mathieu
AU - Herault, Thomas
AU - Langou, Julien
AU - Robert, Yves
PY - 2012
Y1 - 2012
N2 - This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed multi-core nodes. %equipped with accelerators. These platforms make the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential kernels executed by the cores (high sequential performance), low number of messages in a parallel distributed setting (small latency term), and fine granularity (high parallelism). Each tile algorithm is uniquely characterized by its sequence of reduction trees. In the context of a cluster of multicores, in order to minimize the number of inter-processor communications (aka, "communication- avoiding" algorithm), it is natural to consider two-level hierarchical trees composed of an "inter-node" tree which acts on top of "intra-node" trees. At the intra-node level, we propose a hierarchical tree made of three levels: (0) "TS level" for cache-friendliness, (1) "low level" for decoupled highly parallel inter-node reductions, (2) "coupling level" to efficiently resolve interactions between local reductions and global reductions. Our hierarchical algorithm and its implementation are flexible and modular, and can accommodate several kernel types, different distribution layouts, and a variety of reduction trees at all levels, both inter-cluster and intra-cluster. Numerical experiments on a cluster of multicore nodes (1) confirm that each of the four levels of our hierarchical tree contributes to build up performance and (2) build insights on how these levels influence performance and interact within each other. Our implementation of the new algorithm with the DAGUE scheduling tool significantly outperforms currently available QR factorization softwares for all matrix shapes, thereby bringing a new advance in numerical linear algebra for petascale and exascale platforms.
AB - This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed multi-core nodes. %equipped with accelerators. These platforms make the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential kernels executed by the cores (high sequential performance), low number of messages in a parallel distributed setting (small latency term), and fine granularity (high parallelism). Each tile algorithm is uniquely characterized by its sequence of reduction trees. In the context of a cluster of multicores, in order to minimize the number of inter-processor communications (aka, "communication- avoiding" algorithm), it is natural to consider two-level hierarchical trees composed of an "inter-node" tree which acts on top of "intra-node" trees. At the intra-node level, we propose a hierarchical tree made of three levels: (0) "TS level" for cache-friendliness, (1) "low level" for decoupled highly parallel inter-node reductions, (2) "coupling level" to efficiently resolve interactions between local reductions and global reductions. Our hierarchical algorithm and its implementation are flexible and modular, and can accommodate several kernel types, different distribution layouts, and a variety of reduction trees at all levels, both inter-cluster and intra-cluster. Numerical experiments on a cluster of multicore nodes (1) confirm that each of the four levels of our hierarchical tree contributes to build up performance and (2) build insights on how these levels influence performance and interact within each other. Our implementation of the new algorithm with the DAGUE scheduling tool significantly outperforms currently available QR factorization softwares for all matrix shapes, thereby bringing a new advance in numerical linear algebra for petascale and exascale platforms.
KW - QR factorization
KW - cluster
KW - distributed memory
KW - hierarchical architecture
KW - multicore
KW - numerical linear algebra
UR - http://www.scopus.com/inward/record.url?scp=84866882555&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2012.62
DO - 10.1109/IPDPS.2012.62
M3 - Conference contribution
AN - SCOPUS:84866882555
SN - 9780769546759
T3 - Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012
SP - 607
EP - 618
BT - Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012
T2 - 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012
Y2 - 21 May 2012 through 25 May 2012
ER -