TY - GEN
T1 - Performance of random sampling for computing low-rank approximations of a dense matrix on GPUs
AU - Mary, Théo
AU - Yamazaki, Ichitaro
AU - Kurzak, Jakub
AU - Luszczek, Piotr
AU - Tomov, Stanimire
AU - Dongarra, Jack
N1 - Publisher Copyright:
© 2015 ACM.
PY - 2015/11/15
Y1 - 2015/11/15
N2 - A low-rank approximation of a dense matrix plays an important role in many applications. To compute such an approximation, a common approach uses the QR factorization with column pivoting (QRCP). Though the reliability and efficiency of QRCP have been demonstrated, this deterministic approach requires costly communication at each step of the factorization. Since such communication is becoming increasingly expensive on modern computers, an alternative approach based on random sampling, which can be implemented using communication-optimal kernels, is becoming attractive. To study its potential, in this paper, we compare the performance of random sampling with that of QRCP on an NVIDIA Kepler GPU. Our performance results demonstrate that random sampling can be up to 12.8x faster than the deterministic approach for computing the approximation of the same accuracy. We also present the parallel scaling of the random sampling over multiple GPUs on a single compute node, showing a speedup of 3.8x over three Kepler GPUs. These results demonstrate the potential of the random sampling as an excellent computational tool for many applications, and its potential is likely to grow on the emerging computers with the increasing communication costs.
AB - A low-rank approximation of a dense matrix plays an important role in many applications. To compute such an approximation, a common approach uses the QR factorization with column pivoting (QRCP). Though the reliability and efficiency of QRCP have been demonstrated, this deterministic approach requires costly communication at each step of the factorization. Since such communication is becoming increasingly expensive on modern computers, an alternative approach based on random sampling, which can be implemented using communication-optimal kernels, is becoming attractive. To study its potential, in this paper, we compare the performance of random sampling with that of QRCP on an NVIDIA Kepler GPU. Our performance results demonstrate that random sampling can be up to 12.8x faster than the deterministic approach for computing the approximation of the same accuracy. We also present the parallel scaling of the random sampling over multiple GPUs on a single compute node, showing a speedup of 3.8x over three Kepler GPUs. These results demonstrate the potential of the random sampling as an excellent computational tool for many applications, and its potential is likely to grow on the emerging computers with the increasing communication costs.
UR - http://www.scopus.com/inward/record.url?scp=84966692129&partnerID=8YFLogxK
U2 - 10.1145/2807591.2807613
DO - 10.1145/2807591.2807613
M3 - Conference contribution
AN - SCOPUS:84966692129
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - Proceedings of SC 2015
PB - IEEE Computer Society
T2 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015
Y2 - 15 November 2015 through 20 November 2015
ER -