TY - JOUR
T1 - Data-intensive document clustering on graphics processing unit (GPU) clusters
AU - Zhang, Yongpeng
AU - Mueller, Frank
AU - Cui, Xiaohui
AU - Potok, Thomas
PY - 2011/2
Y1 - 2011/2
N2 - Document clustering is a central method to mine massive amounts of data. Due to the explosion of raw documents generated on the Internet and the necessity to analyze them efficiently in various intelligent information systems, clustering techniques have reached their limitations on single processors. Instead of single processors, general-purpose multi-core chips are increasingly deployed in response to diminishing returns in single-processor speedup due to the frequency wall, but multi-core benefits only provide linear speedups while the number of documents in the Internet is growing exponentially. Accelerating hardware devices represent a novel promise for improving the performance for data-intensive problems such as document clustering. They offer more radical designs with a higher level of parallelism but adaptation to novel programming environments. In this paper, we assess the benefits of exploiting the computational power of graphics processing units (GPUs) to study two fundamental problems in document mining, namely to calculate the term frequencyinverse document frequency (TFIDF) and cluster a large set of documents. We transform traditional algorithms into accelerated parallel counterparts that can be efficiently executed on many-core GPU architectures. We assess our implementations on various platforms, ranging from stand-alone GPU desktops to Beowulf-like clusters equipped with contemporary GPU cards. We observe at least one order of magnitude speedups over CPU-only desktops and clusters. This demonstrates the potential of exploiting GPU clusters to efficiently solve massive document mining problems. Such speedups combined with the scalability potential and accelerator-based parallelization are unique in the domain of document-based data mining, to the best of our knowledge.
AB - Document clustering is a central method to mine massive amounts of data. Due to the explosion of raw documents generated on the Internet and the necessity to analyze them efficiently in various intelligent information systems, clustering techniques have reached their limitations on single processors. Instead of single processors, general-purpose multi-core chips are increasingly deployed in response to diminishing returns in single-processor speedup due to the frequency wall, but multi-core benefits only provide linear speedups while the number of documents in the Internet is growing exponentially. Accelerating hardware devices represent a novel promise for improving the performance for data-intensive problems such as document clustering. They offer more radical designs with a higher level of parallelism but adaptation to novel programming environments. In this paper, we assess the benefits of exploiting the computational power of graphics processing units (GPUs) to study two fundamental problems in document mining, namely to calculate the term frequencyinverse document frequency (TFIDF) and cluster a large set of documents. We transform traditional algorithms into accelerated parallel counterparts that can be efficiently executed on many-core GPU architectures. We assess our implementations on various platforms, ranging from stand-alone GPU desktops to Beowulf-like clusters equipped with contemporary GPU cards. We observe at least one order of magnitude speedups over CPU-only desktops and clusters. This demonstrates the potential of exploiting GPU clusters to efficiently solve massive document mining problems. Such speedups combined with the scalability potential and accelerator-based parallelization are unique in the domain of document-based data mining, to the best of our knowledge.
KW - Accelerators
KW - Data-intensive computing
KW - High-performance computing
UR - http://www.scopus.com/inward/record.url?scp=78650416698&partnerID=8YFLogxK
U2 - 10.1016/j.jpdc.2010.08.002
DO - 10.1016/j.jpdc.2010.08.002
M3 - Article
AN - SCOPUS:78650416698
SN - 0743-7315
VL - 71
SP - 211
EP - 224
JO - Journal of Parallel and Distributed Computing
JF - Journal of Parallel and Distributed Computing
IS - 2
ER -