TY - GEN
T1 - Design Considerations and Analysis of Multi-Level Erasure Coding in Large-Scale Data Centers
AU - Wang, Meng
AU - Mao, Jiajun
AU - Rana, Rajdeep
AU - Bent, John
AU - Olmez, Serkay
AU - George, Anjus
AU - Ransom, Garrett Wilson
AU - Li, Jun
AU - Gunawi, Haryadi S.
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023
Y1 - 2023
N2 - Multi-level erasure coding (MLEC) has seen large deployments in the field, but there is no in-depth study of design considerations for MLEC at scale. In this paper, we provide comprehensive design considerations and analysis of MLEC at scale. We introduce the design space of MLEC in multiple dimensions, including various code parameter selections, chunk placement schemes, and various repair methods. We quantify their performance and durability, and show which MLEC schemes and repair methods can provide the best tolerance against independent/correlated failures and reduce repair network traffic by orders of magnitude. To achieve this, we use various evaluation strategies including simulation, splitting, dynamic programming, and mathematical modeling. We also compare the performance and durability of MLEC with other EC schemes such as SLEC and LRC and show that MLEC can provide high durability with higher encoding throughput and less repair network traffic over both SLEC and LRC.
AB - Multi-level erasure coding (MLEC) has seen large deployments in the field, but there is no in-depth study of design considerations for MLEC at scale. In this paper, we provide comprehensive design considerations and analysis of MLEC at scale. We introduce the design space of MLEC in multiple dimensions, including various code parameter selections, chunk placement schemes, and various repair methods. We quantify their performance and durability, and show which MLEC schemes and repair methods can provide the best tolerance against independent/correlated failures and reduce repair network traffic by orders of magnitude. To achieve this, we use various evaluation strategies including simulation, splitting, dynamic programming, and mathematical modeling. We also compare the performance and durability of MLEC with other EC schemes such as SLEC and LRC and show that MLEC can provide high durability with higher encoding throughput and less repair network traffic over both SLEC and LRC.
KW - Data Centers
KW - Data Protection
KW - Erasure Coding
KW - HPC Storage
KW - Reliability
KW - Scalable Storage
KW - System-Design Tradeoffs
UR - http://www.scopus.com/inward/record.url?scp=85190477752&partnerID=8YFLogxK
U2 - 10.1145/3581784.3607072
DO - 10.1145/3581784.3607072
M3 - Conference contribution
AN - SCOPUS:85190477752
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - SC 2023 - International Conference for High Performance Computing, Networking, Storage and Analysis
PB - IEEE Computer Society
T2 - 2023 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2023
Y2 - 12 November 2023 through 17 November 2023
ER -