TY - JOUR
T1 - The Design and Implementation of the Reduction Routines in ScaLAPACK
AU - Choi, Jaeyoung
AU - Dongarra, Jack
AU - Ostrouchov, Susan
AU - Petitet, Antoine P.
AU - Walker, David W.
AU - Whaley, R. Clint
PY - 1995/1/1
Y1 - 1995/1/1
N2 - This chapter discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed by an outline of ScaLAPACK, which is a distributed memory version of LAPACK currently under development. The importance of block-partitioned algorithms in reducing the frequency of data movement between different levels of hierarchical memory is stressed. The use of such algorithms helps reduce the message startup costs on distributed memory concurrent computers. Other key ideas in our approach are the use of distributed versions of the Level 3 Basic Linear Algebra Subprograms (BLAS) as computational building block, and the use of Basic Linear Algebra Communication Subgrograms (BLACS) as communication building blocks. Together the distributed BLAS and the BLACS can be used to construct higher-level, algorithms, and hide many details of the parallelism from the application developer. The block-cyclic data distribution is described, and adopted as a good way of distributing block-partitioned matrices. Block-partitioned versions of the Cholesky and LU factorizations are presented, and optimization issues associated with the implementation of the LU factorization algorithm on distributed memory concurrent computers are discussed, together with its performance on the Intel Delta system. Finally, approaches to the design of library interfaces are reviewed.
AB - This chapter discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed by an outline of ScaLAPACK, which is a distributed memory version of LAPACK currently under development. The importance of block-partitioned algorithms in reducing the frequency of data movement between different levels of hierarchical memory is stressed. The use of such algorithms helps reduce the message startup costs on distributed memory concurrent computers. Other key ideas in our approach are the use of distributed versions of the Level 3 Basic Linear Algebra Subprograms (BLAS) as computational building block, and the use of Basic Linear Algebra Communication Subgrograms (BLACS) as communication building blocks. Together the distributed BLAS and the BLACS can be used to construct higher-level, algorithms, and hide many details of the parallelism from the application developer. The block-cyclic data distribution is described, and adopted as a good way of distributing block-partitioned matrices. Block-partitioned versions of the Cholesky and LU factorizations are presented, and optimization issues associated with the implementation of the LU factorization algorithm on distributed memory concurrent computers are discussed, together with its performance on the Intel Delta system. Finally, approaches to the design of library interfaces are reviewed.
UR - http://www.scopus.com/inward/record.url?scp=85023430704&partnerID=8YFLogxK
U2 - 10.1016/S0927-5452(06)80013-4
DO - 10.1016/S0927-5452(06)80013-4
M3 - Article
AN - SCOPUS:85023430704
SN - 0927-5452
VL - 10
SP - 177
EP - 202
JO - Advances in Parallel Computing
JF - Advances in Parallel Computing
IS - C
ER -