TY - JOUR
T1 - Scalability issues affecting the design of a dense linear algebra library
AU - Dongarra, Jack J.
AU - Vandegeijn, Robert A.
AU - Walker, David W.
PY - 1994/9
Y1 - 1994/9
N2 - This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widely used LAPACK library to run efficiently on scalable concurrent computers. To ensure good scalability and performance, the ScaLAPACK routines are based on block-partitioned algorithms that reduce the frequency of data movement between different levels of the memory hierarchy, and particularly between processors. The block cyclic data distribution, that is used in all three factorization algorithms, is described. An outline of the sequential and parallel block-partitioned algorithms is given. Approximate models of algorithms′ performance are presented to indicate which factors in the design of the algorithm have an impact upon scalability. These models are compared with timings results on a 128-node Intel iPSC/860 hypercube. It is shown that the routines are highly scalable on this machine for problems that occupy more than about 25% of the memory on each processor, and that the measured timings are consistent with the performance model. The contribution of this paper goes beyond reporting our experience: our implementations are available in the public domain.
AB - This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widely used LAPACK library to run efficiently on scalable concurrent computers. To ensure good scalability and performance, the ScaLAPACK routines are based on block-partitioned algorithms that reduce the frequency of data movement between different levels of the memory hierarchy, and particularly between processors. The block cyclic data distribution, that is used in all three factorization algorithms, is described. An outline of the sequential and parallel block-partitioned algorithms is given. Approximate models of algorithms′ performance are presented to indicate which factors in the design of the algorithm have an impact upon scalability. These models are compared with timings results on a 128-node Intel iPSC/860 hypercube. It is shown that the routines are highly scalable on this machine for problems that occupy more than about 25% of the memory on each processor, and that the measured timings are consistent with the performance model. The contribution of this paper goes beyond reporting our experience: our implementations are available in the public domain.
UR - http://www.scopus.com/inward/record.url?scp=0000778168&partnerID=8YFLogxK
U2 - 10.1006/jpdc.1994.1108
DO - 10.1006/jpdc.1994.1108
M3 - Article
AN - SCOPUS:0000778168
SN - 0743-7315
VL - 22
SP - 523
EP - 537
JO - Journal of Parallel and Distributed Computing
JF - Journal of Parallel and Distributed Computing
IS - 3
ER -