A bloom filter based scalable data integrity check tool for large-scale dataset

Sisi Xiong, Feiyi Wang, Qing Cao

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

8 Scopus citations

Abstract

Large scale HPC applications are becoming increasingly data intensive. At Oak Ridge Leadership Computing Facility (OLCF), we are observing the number of files curated under individual project are reaching as high as 200 millions and project data size is exceeding petabytes. These simulation datasets, once validated, often needs to be transferred to archival system for long term storage or shared with the rest of the research community. Ensuring the data integrity of the full dataset at this scale is paramount important but also a daunting task. This is especially true considering that most conventional tools are serial and file-based, unwieldy to use and/or can't scale to meet user's demand.To tackle this particular challenge, this paper presents the design, implementation and evaluation of a scalable parallel checksumming tool, fsum, which we developed at OLCF. It is built upon the principle of parallel tree walk and work-stealing pattern to maximize parallelism and is capable of generating a single, consistent signature for the entire dataset at extreme scale. We also applied a novel bloom-filter based technique in aggregating signatures to overcome the signature ordering requirement. Given the probabilistic nature of bloom filter, we provided a detailed error and trade-off analysis. Using multiple datasets from production environment, we demonstrated that our tool can efficiently handle both very large files as well as many small-file based datasets. Our preliminary test showed that on the same hardware, it outperforms conventional tool by as much as 4×. It also exhibited near-linear scaling properties when provisioned with more compute resources.

Original languageEnglish
Title of host publicationProceedings of PDSW-DISCS 2016
Subtitle of host publication1st Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems - Held in conjunction with SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages55-60
Number of pages6
ISBN (Electronic)9781509052165
DOIs
StatePublished - Jan 30 2017
Event1st Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, PDSW-DISCS 2016 - Salt Lake City, United States
Duration: Nov 14 2016 → …

Publication series

NameProceedings of PDSW-DISCS 2016: 1st Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems - Held in conjunction with SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference1st Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, PDSW-DISCS 2016
Country/TerritoryUnited States
CitySalt Lake City
Period11/14/16 → …

Funding

This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.

Fingerprint

Dive into the research topics of 'A bloom filter based scalable data integrity check tool for large-scale dataset'. Together they form a unique fingerprint.

Cite this