Abstract
We study parallel connected components algorithms onGPUs in comparison with CPUs. Although straightforward implementation of PRAMalgorithms performs relatively better on GPUs than on CPUs, the GPU memory subsystem performance is poor due to non-coalesced random accesses.
We argue that generic sort-based access coalescing is too costly on GPUs.We propose a new coalescing technique and a new meta algorithm to improve locality and performance. Our optimization achieves up to 2.7 times speedup over the straightforward implementation. Interestingly, our optimization also works well on CPUs.
Comparing the best-performing algorithms on GPUs and CPUs, we find our new algorithm is the fastest on GPUs and the second fastest on CPUs, while the parallel Rem’s algorithm is the fastest on CPUs but does not perform well on GPUs due to path divergence.
Original language | English |
---|---|
Title of host publication | Euro-Par 2014 |
Subtitle of host publication | Parallel Processing Workshops - Euro-Par 2014 International Workshops, Revised Selected Papers |
Editors | Luís Lopes |
Publisher | Springer Verlag |
Pages | 153-164 |
Number of pages | 12 |
Edition | Part 1 |
ISBN (Electronic) | 9783319143248 |
DOIs | |
State | Published - 2014 |
Externally published | Yes |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Number | Part 1 |
Volume | 8805 |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Keywords
- CPU
- Connected Components
- GPU
- Multi-core