TY - GEN
T1 - Large-scale multi-dimensional document clustering on GPU clusters
AU - Zhang, Yongpeng
AU - Mueller, Frank
AU - Cui, Xiaohui
AU - Potok, Thomas
PY - 2010
Y1 - 2010
N2 - Document clustering plays an important role in data mining systems. Recently, a flocking-based document clustering algorithm has been proposed to solve the problem through simulation resembling the flocking behavior of birds in nature. This method is superior to other clustering algorithms, including k-means, in the sense that the outcome is not sensitive to the initial state. One limitation of this approach is that the algorithmic complexity is inherently quadratic in the number of documents. As a result, execution time becomes a bottleneck with large number of documents. In this paper, we assess the benefits of exploiting the computational power of Beowulf-like clusters equipped with contemporary Graphics Processing Units (GPUs) as a means to significantly reduce the runtime of flocking-based document clustering. Our framework scales up to over one million documents processed simultaneously in a sixteen-node moderate GPU cluster. Results are also compared to a four-node cluster with higher-end GPUs. On these clusters, we observe 30X-50X speedups, which demonstrate the potential of GPU clusters to efficiently solve massive data mining problems. Such speedups combined with the scalability potential and accelerator-based parallelization are unique in the domain of document-based data mining, to the best of our knowledge.
AB - Document clustering plays an important role in data mining systems. Recently, a flocking-based document clustering algorithm has been proposed to solve the problem through simulation resembling the flocking behavior of birds in nature. This method is superior to other clustering algorithms, including k-means, in the sense that the outcome is not sensitive to the initial state. One limitation of this approach is that the algorithmic complexity is inherently quadratic in the number of documents. As a result, execution time becomes a bottleneck with large number of documents. In this paper, we assess the benefits of exploiting the computational power of Beowulf-like clusters equipped with contemporary Graphics Processing Units (GPUs) as a means to significantly reduce the runtime of flocking-based document clustering. Our framework scales up to over one million documents processed simultaneously in a sixteen-node moderate GPU cluster. Results are also compared to a four-node cluster with higher-end GPUs. On these clusters, we observe 30X-50X speedups, which demonstrate the potential of GPU clusters to efficiently solve massive data mining problems. Such speedups combined with the scalability potential and accelerator-based parallelization are unique in the domain of document-based data mining, to the best of our knowledge.
UR - http://www.scopus.com/inward/record.url?scp=77954001837&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2010.5470429
DO - 10.1109/IPDPS.2010.5470429
M3 - Conference contribution
AN - SCOPUS:77954001837
SN - 9781424464432
T3 - Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010
BT - Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010
T2 - 24th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2010
Y2 - 19 April 2010 through 23 April 2010
ER -