TY - JOUR
T1 - Accelerating scientific computations with mixed precision algorithms
AU - Baboulin, Marc
AU - Buttari, Alfredo
AU - Dongarra, Jack
AU - Kurzak, Jakub
AU - Langou, Julie
AU - Langou, Julien
AU - Luszczek, Piotr
AU - Tomov, Stanimire
PY - 2009/12
Y1 - 2009/12
N2 - On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented. Program summary: Program title: ITER-REF. Catalogue identifier: AECO_v1_0. Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AECO_v1_0.html. Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland. Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html. No. of lines in distributed program, including test data, etc.: 7211. No. of bytes in distributed program, including test data, etc.: 41 862. Distribution format: tar.gz. Programming language: FORTRAN 77. Computer: desktop, server. Operating system: Unix/Linux. RAM: 512 Mbytes. Classification: 4.8. External routines: BLAS (optional). Nature of problem: On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. Solution method: Mixed precision algorithms stem from the observation that, in many cases, a single precision solution of a problem can be refined to the point where double precision accuracy is achieved. A common approach to the solution of linear systems, either dense or sparse, is to perform the LU factorization of the coefficient matrix using Gaussian elimination. First, the coefficient matrix A is factored into the product of a lower triangular matrix L and an upper triangular matrix U. Partial row pivoting is in general used to improve numerical stability resulting in a factorization P A = L U, where P is a permutation matrix. The solution for the system is achieved by first solving L y = P b (forward substitution) and then solving U x = y (backward substitution). Due to round-off errors, the computed solution, x, carries a numerical error magnified by the condition number of the coefficient matrix A. In order to improve the computed solution, an iterative process can be applied, which produces a correction to the computed solution at each iteration, which then yields the method that is commonly known as the iterative refinement algorithm. Provided that the system is not too ill-conditioned, the algorithm produces a solution correct to the working precision. Running time: seconds/minutes.
AB - On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented. Program summary: Program title: ITER-REF. Catalogue identifier: AECO_v1_0. Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AECO_v1_0.html. Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland. Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html. No. of lines in distributed program, including test data, etc.: 7211. No. of bytes in distributed program, including test data, etc.: 41 862. Distribution format: tar.gz. Programming language: FORTRAN 77. Computer: desktop, server. Operating system: Unix/Linux. RAM: 512 Mbytes. Classification: 4.8. External routines: BLAS (optional). Nature of problem: On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. Solution method: Mixed precision algorithms stem from the observation that, in many cases, a single precision solution of a problem can be refined to the point where double precision accuracy is achieved. A common approach to the solution of linear systems, either dense or sparse, is to perform the LU factorization of the coefficient matrix using Gaussian elimination. First, the coefficient matrix A is factored into the product of a lower triangular matrix L and an upper triangular matrix U. Partial row pivoting is in general used to improve numerical stability resulting in a factorization P A = L U, where P is a permutation matrix. The solution for the system is achieved by first solving L y = P b (forward substitution) and then solving U x = y (backward substitution). Due to round-off errors, the computed solution, x, carries a numerical error magnified by the condition number of the coefficient matrix A. In order to improve the computed solution, an iterative process can be applied, which produces a correction to the computed solution at each iteration, which then yields the method that is commonly known as the iterative refinement algorithm. Provided that the system is not too ill-conditioned, the algorithm produces a solution correct to the working precision. Running time: seconds/minutes.
KW - Iterative refinement
KW - Mixed precision
KW - Numerical linear algebra
UR - http://www.scopus.com/inward/record.url?scp=70350583245&partnerID=8YFLogxK
U2 - 10.1016/j.cpc.2008.11.005
DO - 10.1016/j.cpc.2008.11.005
M3 - Article
AN - SCOPUS:70350583245
SN - 0010-4655
VL - 180
SP - 2526
EP - 2533
JO - Computer Physics Communications
JF - Computer Physics Communications
IS - 12
ER -