Abstract
Multi-level erasure coding (MLEC) has seen large deployments in the field, but there is no in-depth study of design considerations for MLEC at scale. In this paper, we provide comprehensive design considerations and analysis of MLEC at scale. We introduce the design space of MLEC in multiple dimensions, including various code parameter selections, chunk placement schemes, and various repair methods. We quantify their performance and durability, and show which MLEC schemes and repair methods can provide the best tolerance against independent/correlated failures and reduce repair network traffic by orders of magnitude. To achieve this, we use various evaluation strategies including simulation, splitting, dynamic programming, and mathematical modeling. We also compare the performance and durability of MLEC with other EC schemes such as SLEC and LRC and show that MLEC can provide high durability with higher encoding throughput and less repair network traffic over both SLEC and LRC.
Original language | English |
---|---|
Title of host publication | SC 2023 - International Conference for High Performance Computing, Networking, Storage and Analysis |
Publisher | IEEE Computer Society |
ISBN (Electronic) | 9798400701092 |
DOIs | |
State | Published - 2023 |
Event | 2023 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2023 - Denver, United States Duration: Nov 12 2023 → Nov 17 2023 |
Publication series
Name | International Conference for High Performance Computing, Networking, Storage and Analysis, SC |
---|---|
ISSN (Print) | 2167-4329 |
ISSN (Electronic) | 2167-4337 |
Conference
Conference | 2023 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2023 |
---|---|
Country/Territory | United States |
City | Denver |
Period | 11/12/23 → 11/17/23 |
Funding
We thank the anonymous reviewers for their tremendous feedback and comments. We also would like to thank Gary Grider from Los Alamos National Lab (LANL) for his helpful discussions on real-world deployment considerations of MLEC. This material was supported by funding from NSF grant No. CCF-2119184, funding from Oak Ridge National Laboratory (ORNL) under Contract No. DE-AC05-00OR22725 with Office of Science of the U.S. Department of Energy, as well as generous donations from Seagate. The experiments in this paper were performed in the Chameleon [49, 50] testbed. Any opinions, findings, and conclusions, or recommendations expressed herein are those of the authors and do not necessarily reflect the views of the NSF or other institutions.
Keywords
- Data Centers
- Data Protection
- Erasure Coding
- HPC Storage
- Reliability
- Scalable Storage
- System-Design Tradeoffs