Abstract
Large scale HPC applications are becoming increasingly data intensive. At Oak Ridge Leadership Computing Facility (OLCF), we are observing the number of files curated under individual project are reaching as high as 200 millions and project data size is exceeding petabytes. These simulation datasets, once validated, often needs to be transferred to archival system for long term storage or shared with the rest of the research community. Ensuring the data integrity of the full dataset at this scale is paramount important but also a daunting task. This is especially true considering that most conventional tools are serial and file-based, unwieldy to use and/or can't scale to meet user's demand.To tackle this particular challenge, this paper presents the design, implementation and evaluation of a scalable parallel checksumming tool, fsum, which we developed at OLCF. It is built upon the principle of parallel tree walk and work-stealing pattern to maximize parallelism and is capable of generating a single, consistent signature for the entire dataset at extreme scale. We also applied a novel bloom-filter based technique in aggregating signatures to overcome the signature ordering requirement. Given the probabilistic nature of bloom filter, we provided a detailed error and trade-off analysis. Using multiple datasets from production environment, we demonstrated that our tool can efficiently handle both very large files as well as many small-file based datasets. Our preliminary test showed that on the same hardware, it outperforms conventional tool by as much as 4×. It also exhibited near-linear scaling properties when provisioned with more compute resources.
Original language | English |
---|---|
Title of host publication | Proceedings of PDSW-DISCS 2016 |
Subtitle of host publication | 1st Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems - Held in conjunction with SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 55-60 |
Number of pages | 6 |
ISBN (Electronic) | 9781509052165 |
DOIs | |
State | Published - Jan 30 2017 |
Event | 1st Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, PDSW-DISCS 2016 - Salt Lake City, United States Duration: Nov 14 2016 → … |
Publication series
Name | Proceedings of PDSW-DISCS 2016: 1st Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems - Held in conjunction with SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis |
---|
Conference
Conference | 1st Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, PDSW-DISCS 2016 |
---|---|
Country/Territory | United States |
City | Salt Lake City |
Period | 11/14/16 → … |
Funding
This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.