Abstract
Singular Value QR (SVQR) can orthonormalize a set of dense vectors with the minimum communication (one global reduction between the parallel processing units, and BLAS-3 to performmost of its local computation). As a result, compared to other orthogonalization schemes, SVQR obtains superior performance on many of the current computers, where the communication has become significantly more expensive compared to the arithmetic operations. In this article, we study the stability and performance of various SVQR implementations on multicore CPUs with a GPU. Our focus is on the dense triangular solve, which performs half of the total floating-point operations of SVQR. As a part of this study, we examine an adaptive mixed-precision variant of SVQR, which decides if a lower-precision arithmetic can be used for the triangular solution at runtime without increasing the order of its orthogonality error (though its backward error is significantly greater). If the greater backward error can be tolerated, then our performance results with an NVIDIA Kepler GPU show that the mixed-precision SVQR can obtain a speedup of up to 1.36 over the standard SVQR.
Original language | English |
---|---|
Article number | a10 |
Journal | ACM Transactions on Mathematical Software |
Volume | 43 |
Issue number | 2 |
DOIs | |
State | Published - Sep 2016 |
Externally published | Yes |
Keywords
- GPU computation
- Mixed precision
- Orthogonalization