Parallel k-means clustering of geospatial data sets using manycore CPU architectures

Richard Tran Mills, Vamsi Sripathi, Jitendra Kumar, Sarat Sreepathi, Forrest Hoffman, William Hargrove

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

The increasing availability of high-resolution geospatiotemporal data sets from sources such as observatory networks, remote sensing platforms, and computational Earth system models has opened new possibilities for knowledge discovery and mining of weather, climate, ecological, and other geoscientific data sets fused from disparate sources. Many of the standard tools used on individual workstations are impractical for the analysis and synthesis of data sets of this size; however, new algorithmic approaches that can effectively utilize the complex memory hierarchies and the extremely high levels of parallelism available in state-of-the-art high-performance computing platforms can enable such analysis. Here, we describe pKluster, an open-source tool we have developed for accelerated k-means clustering of geospatial and geospatiotemporal data, and discuss algorithmic modifications and code optimizations we have made to enable it to effectively use parallel machines based on novel CPU architectures-such as the Intel Knights Landing Xeon Phi and Skylake Xeon processors-with many cores and hardware threads, and employing significant single instruction, multiple data (SIMD) parallelism. We outline some applications of the code in ecology and climate science contexts and present a detailed discussion of the performance of the code for one such application, LiDAR-derived vertical vegetation structure classification.

Original languageEnglish
Title of host publicationProceedings - 18th IEEE International Conference on Data Mining Workshops, ICDMW 2018
EditorsHanghang Tong, Zhenhui Li, Feida Zhu, Jeffrey Yu
PublisherIEEE Computer Society
Pages787-794
Number of pages8
ISBN (Electronic)9781538692882
DOIs
StatePublished - Jul 2 2018
Event18th IEEE International Conference on Data Mining Workshops, ICDMW 2018 - Singapore, Singapore
Duration: Nov 17 2018Nov 20 2018

Publication series

NameIEEE International Conference on Data Mining Workshops, ICDMW
Volume2018-November
ISSN (Print)2375-9232
ISSN (Electronic)2375-9259

Conference

Conference18th IEEE International Conference on Data Mining Workshops, ICDMW 2018
Country/TerritorySingapore
CitySingapore
Period11/17/1811/20/18

Bibliographical note

Publisher Copyright:
© 2018 IEEE.

Funding

R. T. Mills was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. JK and FMH were partially supported by the Next Generation Ecosystem Experiments -Arctic (NGEE Arctic) project, which is sponsored by the Terrestrial Ecosystem Sciences (TES) Program, and the Reducing Uncertainties in Biogeochemical Interactions through Synthesis and Computation Scientific Focus Area (RUBISCO SFA), which is sponsored by the Regional and Global Model Analysis (RGMA) Program. The TES and RGMA Programs are activities of the Climate and Environmental Sciences Division (CESD) of the Office of Biological and Environmental Research (BER) in the U.S. Department of Energy Office of Science. WWH, JK, and FMH claim additional support from the Eastern Forest Environmental Threat Assessment Center (EFETAC) in the U.S. Department of Agriculture Forest Service. This manuscript has been authored by UChicago Argonne, LLC under Contract No. DE-AC02-06CH11357 with the U.S. Department of Energy. This manuscript has been co-authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725with the U.S. Department of Energy.

FundersFunder number
Eastern Forest Environmental Threat Assessment Center
U.S. Department of Agriculture Forest Service
U.S. Department of Energy Office of Science
UT-Battelle, LLC
National Nuclear Security Administration

    Keywords

    • clustering
    • geospatial
    • high performance computing
    • k-means
    • manycore

    Fingerprint

    Dive into the research topics of 'Parallel k-means clustering of geospatial data sets using manycore CPU architectures'. Together they form a unique fingerprint.

    Cite this