Abstract
LU factorization is the most computationally intensive step in solving systems of linear equations. By obtaining first the LU factorization of the coefficient matrix, we then may readily solve the system using backward substitution. The computational cost of LU factorization in terms floating point operations is cubic. There are various efforts to improve the performance of LU factorization. We propose a multi-core multi-GPU hybrid LU factorization algorithm that leverages the strengths of both multiple CPUs and multiple GPUs. Our algorithm uses some of the CPU cores for panel factorization, and the rest of the CPU cores together with all the available GPUs for trailing submatrix updates. Our algorithm employs both dynamic scheduling and static scheduling. Experiments show that our approach reaches 1134 Gflop/s with 4 Fermi GPU boards when combined with the total of 48 CPU cores from AMD. This is the first time such level of performance have been reported in a shared memory environment. Execution trace shows that our code also achieves good load balance and high system utilization.
Original language | English |
---|---|
Pages (from-to) | 106-115 |
Number of pages | 10 |
Journal | Procedia Computer Science |
Volume | 9 |
DOIs | |
State | Published - 2012 |
Event | 12th Annual International Conference on Computational Science, ICCS 2012 - Omaha, NB, United States Duration: Jun 4 2012 → Jun 6 2012 |
Funding
This work was supported by NSF through through grant 1038814. Email addresses: [email protected] (Yulu Jia), [email protected] (Piotr Luszczek), [email protected] (Jack Dongarra) 1Corresponding author
Keywords
- Hardware accelerators
- Hybrid
- LU factorization
- Multi-core multi-GPU