TY - GEN
T1 - High performance streaming tensor decomposition
AU - Soh, Yongseok
AU - Flick, Patrick
AU - Liu, Xing
AU - Smith, Shaden
AU - Checconi, Fabio
AU - Petrini, Fabrizio
AU - Choi, Jee
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021/5
Y1 - 2021/5
N2 - We present a new algorithm for computing tensor decomposition on streaming data that achieves up to 102× speedup over the state-of-the-art CP-stream algorithm through lower computational complexity and performance optimization. For each streaming time slice, our algorithm partitions the factor matrix rows into those with and without updates and keeps them in Gram matrix form to significantly reduce the required computation. We also improve the scalability and performance of the matricized tensor times Khatri-Rao product (MTTKRP) kernel, a key performance bottleneck in many tensor decomposition algorithms, by reducing the synchronization overhead through the combined use of mutex locks and thread-local memory. For problems with constraints (e.g., non-negativity), we apply data blocking and operation fusion to the alternating direction method of multiplier (ADMM) kernel in the constrained CP-stream algorithm. By combining this ADMM optimization with the aforementioned MTTKRP optimization, our improved algorithm achieves up to 47× speedup over the original. We evaluate the performance and scalability of our new algorithm and optimization techniques using a 56-core quad-socket Intel Xeon system on four representative real-world tensors.
AB - We present a new algorithm for computing tensor decomposition on streaming data that achieves up to 102× speedup over the state-of-the-art CP-stream algorithm through lower computational complexity and performance optimization. For each streaming time slice, our algorithm partitions the factor matrix rows into those with and without updates and keeps them in Gram matrix form to significantly reduce the required computation. We also improve the scalability and performance of the matricized tensor times Khatri-Rao product (MTTKRP) kernel, a key performance bottleneck in many tensor decomposition algorithms, by reducing the synchronization overhead through the combined use of mutex locks and thread-local memory. For problems with constraints (e.g., non-negativity), we apply data blocking and operation fusion to the alternating direction method of multiplier (ADMM) kernel in the constrained CP-stream algorithm. By combining this ADMM optimization with the aforementioned MTTKRP optimization, our improved algorithm achieves up to 47× speedup over the original. We evaluate the performance and scalability of our new algorithm and optimization techniques using a 56-core quad-socket Intel Xeon system on four representative real-world tensors.
KW - Algorithm
KW - High performance
KW - Streaming
KW - Tensor decomposition
UR - https://www.scopus.com/pages/publications/85106429168
U2 - 10.1109/IPDPS49936.2021.00078
DO - 10.1109/IPDPS49936.2021.00078
M3 - Conference contribution
AN - SCOPUS:85106429168
T3 - Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021
SP - 683
EP - 692
BT - Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021
Y2 - 17 May 2021 through 21 May 2021
ER -