Abstract
This paper introduces Pandora, a parallel algorithm for computing dendrograms, the hierarchical cluster trees for single linkage clustering (SLC). Current parallel approaches construct dendrograms by partitioning a minimum spanning tree and removing edges. However, they struggle with skewed, hard-to-parallelize real-world dendrograms. Consequently, computing dendrograms is the sequential bottleneck in HDBSCAN*[21], a popular SLC variant. Pandora uses recursive tree contraction to address this limitation. Pandora contracts nodes to construct progressively smaller trees. It computes the smallest contracted dendrogram and expands it by inserting contracted edges. This recursive strategy is highly parallel, skew-independent, work-optimal, and well-suited for GPUs and multicores. We develop a performance portable implementation of Pandora in Kokkos[31] and evaluate its performance on multicore CPUs and multi-vendor GPUs (e.g., Nvidia, AMD) for dendrogram construction in HDBSCAN*. Multithreaded Pandora is 2.2x faster than the current best-multithreaded implementation. Our GPU version achieves 6-20x speedup on AMD GPUs and 10-37x on NVIDIA GPUs over multithreaded Pandora. Pandora removes HDBSCAN*'s sequential bottleneck, greatly boosting efficiency, particularly with GPUs.
| Original language | English |
|---|---|
| Title of host publication | 53rd International Conference on Parallel Processing, ICPP 2024 - Main Conference Proceedings |
| Publisher | Association for Computing Machinery |
| Pages | 908-918 |
| Number of pages | 11 |
| ISBN (Electronic) | 9798400708428 |
| DOIs | |
| State | Published - Aug 12 2024 |
| Event | 53rd International Conference on Parallel Processing, ICPP 2024 - Gotland, Sweden Duration: Aug 12 2024 → Aug 15 2024 |
Publication series
| Name | ACM International Conference Proceeding Series |
|---|
Conference
| Conference | 53rd International Conference on Parallel Processing, ICPP 2024 |
|---|---|
| Country/Territory | Sweden |
| City | Gotland |
| Period | 08/12/24 → 08/15/24 |
Funding
This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.