Abstract
Inline deduplication dramatically improves storage space utilization. However, it degrades I/O throughput due to compute-intensive deduplication operations such as chunking, fingerprinting or hashing of chunk content, and redundant lookup I/Os over the network in the I/O path. In particular, the fingerprint or hash generation of content contributes largely to the degraded I/O throughput and is computationally expensive. In this article, we propose Crocus, a framework that enables compute resource orchestration to enhance cluster-wide deduplication performance. In particular, Crocus takes into account all compute resources such as local and remote {CPU, GPU} by managing decentralized compute pools. An opportunistic Load-Aware Fingerprint Scheduler (LAFS), distributes and offloads compute-intensive deduplication operations in a load-aware fashion to compute pools. Crocus is highly generic and can be adopted in both inline and offline deduplication with different storage tier configurations. We implemented Crocus in Ceph scale-out storage system. Our extensive evaluation shows that Crocus reduces the fingerprinting overhead by 86 percent with 4KB chunk size compared to Ceph with baseline deduplication while maintaining high disk-space savings. Our proposed LAFS scheduler, when tested in different internal and external contention scenarios also showed 54 percent improvement over a fixed or static scheduling approach.
Original language | English |
---|---|
Article number | 8993857 |
Pages (from-to) | 1740-1753 |
Number of pages | 14 |
Journal | IEEE Transactions on Parallel and Distributed Systems |
Volume | 31 |
Issue number | 8 |
DOIs | |
State | Published - Aug 1 2020 |
Externally published | Yes |
Funding
This work was supported by the National Research Foundation of Korea (NRF) Grant funded by the Korea Government (Ministry of Science and ICT) under Grant NRF-2018R1A1A1A05079398.
Funders | Funder number |
---|---|
Ministry of Science, ICT and Future Planning | NRF-2018R1A1A1A05079398 |
National Research Foundation of Korea |
Keywords
- Distributed file systems
- scheduling
- storage management