Performance evaluation of python based data analytics frameworks in summit: Early experiences

  • Benjamín Hernández
  • , Suhas Somnath
  • , Junqi Yin
  • , Hao Lu
  • , Joe Eaton
  • , Peter Entschev
  • , John Kirkham
  • , Zahra Ronaghi

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

The explosion in the volumes of data generated from everlarger simulation campaigns and experiments or observations necessitates competent tools for data wrangling and analysis). While the Oak Ridge Leadership Computing Facility (OLCF) provides a variety of tools to perform data wrangling and data analysis tasks, Python based tools often lack scalability, or the ability to fully exploit the computational capability of OLCF’s Summit supercomputer. NVIDIA RAPIDS and Dask offer a promising solution to accelerate and distribute data analytics workloads from personal computers to heterogeneous supercomputing systems. We discuss early performance evaluation results of RAPIDS and Dask on Summit to understand their capabilities, scalability, and limitations. Our evaluation includes a subset of RAPIDS libraries, i.e., cuDF, cuML, and cuGraph, and Chainer’s CuPy, and their multi-GPU variants when available. We also draw on the observed trends from the performance evaluation results to discuss best practices for maximizing performance.

Original languageEnglish
Title of host publicationDriving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI - 17th Smoky Mountains Computational Sciences and Engineering Conference, SMC 2020, Revised Selected Papers
EditorsJeffrey Nichols, Arthur ‘Barney’ Maccabe, Suzanne Parete-Koon, Becky Verastegui, Oscar Hernandez, Theresa Ahearn
PublisherSpringer Science and Business Media Deutschland GmbH
Pages366-380
Number of pages15
ISBN (Print)9783030633929
DOIs
StatePublished - 2021
Event17th Smoky Mountains Computational Sciences and Engineering Conference, SMC 2020 - Virtual, Online
Duration: Aug 26 2020Aug 28 2020

Publication series

NameCommunications in Computer and Information Science
Volume1315 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference17th Smoky Mountains Computational Sciences and Engineering Conference, SMC 2020
CityVirtual, Online
Period08/26/2008/28/20

Funding

B. Hernández et al.—Contributed Equally. This manuscript has been co-authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy. gov/downloads/doe-public-access-plan). Acknowledgments. This research used resources of the Oak Ridge Leadership Computing Facility (OLCF) at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Keywords

  • Data analytics
  • GPU
  • Multi-threaded
  • Performance evaluation
  • Python

Fingerprint

Dive into the research topics of 'Performance evaluation of python based data analytics frameworks in summit: Early experiences'. Together they form a unique fingerprint.

Cite this