TY - GEN
T1 - Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems
AU - Song, Fengguang
AU - Yarkhan, Asim
AU - Dongarra, Jack
PY - 2009
Y1 - 2009
N2 - This paper presents a dynamic task scheduling approach to executing dense linear algebra algorithms on multicore systems (either shared-memory or distributed-memory). We use a task-based library to replace the existing linear algebra subroutines such as PBLAS to transparently provide the same interface and computational function as the ScaLAPACK library. Linear algebra programs are written with the task-based library and executed by a dynamic runtime system. We mainly focus our runtime system design on the metric of performance scalability. We propose a distributed algorithm to solve data dependences without process cooperation. We have implemented the runtime system and applied it to three linear algebra algorithms: Cholesky, LU, and QR factorizations. Our experiments on both shared-memory machines (16, 32 cores) and distributed-memory machines (1024 cores) demonstrate that our runtime system is able to achieve good scalability. Furthermore, we provide analytical analysis to show why the tiled algorithms are scalable and the expected execution time.
AB - This paper presents a dynamic task scheduling approach to executing dense linear algebra algorithms on multicore systems (either shared-memory or distributed-memory). We use a task-based library to replace the existing linear algebra subroutines such as PBLAS to transparently provide the same interface and computational function as the ScaLAPACK library. Linear algebra programs are written with the task-based library and executed by a dynamic runtime system. We mainly focus our runtime system design on the metric of performance scalability. We propose a distributed algorithm to solve data dependences without process cooperation. We have implemented the runtime system and applied it to three linear algebra algorithms: Cholesky, LU, and QR factorizations. Our experiments on both shared-memory machines (16, 32 cores) and distributed-memory machines (1024 cores) demonstrate that our runtime system is able to achieve good scalability. Furthermore, we provide analytical analysis to show why the tiled algorithms are scalable and the expected execution time.
UR - http://www.scopus.com/inward/record.url?scp=74049102092&partnerID=8YFLogxK
U2 - 10.1145/1654059.1654079
DO - 10.1145/1654059.1654079
M3 - Conference contribution
AN - SCOPUS:74049102092
SN - 9781605587448
T3 - Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09
BT - Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09
T2 - Conference on High Performance Computing Networking, Storage and Analysis, SC '09
Y2 - 14 November 2009 through 20 November 2009
ER -