TY - GEN
T1 - Accelerated Constrained Sparse Tensor Factorization on Massively Parallel Architectures
AU - Soh, Yongseok
AU - Kannan, Ramakrishnan
AU - Sao, Piyush
AU - Choi, Jee
N1 - Publisher Copyright:
© 2024 Owner/Author.
PY - 2024/8/12
Y1 - 2024/8/12
N2 - This study presents the first constrained sparse tensor factorization (cSTF) framework that optimizes and fully offloads computation to massively parallel GPU architectures, and the first performance characterization of cSTF on GPU architectures. In contrast to prior work on tensor factorization, where the matricized tensor times Khatri-Rao product (MTTKRP) is the primary performance bottleneck, our systematic analysis of the cSTF algorithm on GPUs reveals that adding constraints creates an additional bottleneck in the update operation for many real-world sparse tensors. While executing the update operation on the GPU brings significant speedup over its CPU counterpart, it remains a significant bottleneck. To further accelerate the update operation, we propose cuADMM, a new update algorithm that leverages algorithmic and code optimization strategies to minimize both computation and data movement on GPUs. As a result, our framework delivers significantly improved performance compared to prior state-of-the-art. On 10 real-world sparse tensors, our framework achieves geometric mean speedup of 5.1 × (max 41.59 ×) and 7.01 × (max 58.05 ×) on the NIVIDA A100 and H100 GPUs, respectively, over the state-of-the-art SPLATT library running on a 26-core Intel Ice Lake Xeon CPU.
AB - This study presents the first constrained sparse tensor factorization (cSTF) framework that optimizes and fully offloads computation to massively parallel GPU architectures, and the first performance characterization of cSTF on GPU architectures. In contrast to prior work on tensor factorization, where the matricized tensor times Khatri-Rao product (MTTKRP) is the primary performance bottleneck, our systematic analysis of the cSTF algorithm on GPUs reveals that adding constraints creates an additional bottleneck in the update operation for many real-world sparse tensors. While executing the update operation on the GPU brings significant speedup over its CPU counterpart, it remains a significant bottleneck. To further accelerate the update operation, we propose cuADMM, a new update algorithm that leverages algorithmic and code optimization strategies to minimize both computation and data movement on GPUs. As a result, our framework delivers significantly improved performance compared to prior state-of-the-art. On 10 real-world sparse tensors, our framework achieves geometric mean speedup of 5.1 × (max 41.59 ×) and 7.01 × (max 58.05 ×) on the NIVIDA A100 and H100 GPUs, respectively, over the state-of-the-art SPLATT library running on a 26-core Intel Ice Lake Xeon CPU.
KW - accelerated sparse tensor Factorization
KW - algorithm
KW - constrained tensor factorization
KW - high performance
UR - http://www.scopus.com/inward/record.url?scp=85202434303&partnerID=8YFLogxK
U2 - 10.1145/3673038.3673128
DO - 10.1145/3673038.3673128
M3 - Conference contribution
AN - SCOPUS:85202434303
T3 - ACM International Conference Proceeding Series
SP - 107
EP - 116
BT - 53rd International Conference on Parallel Processing, ICPP 2024 - Main Conference Proceedings
PB - Association for Computing Machinery
T2 - 53rd International Conference on Parallel Processing, ICPP 2024
Y2 - 12 August 2024 through 15 August 2024
ER -