TY - GEN
T1 - RAPIDS
T2 - 32nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2023
AU - Wan, Lipeng
AU - Chen, Jieyang
AU - Liang, Xin
AU - Gainaru, Ana
AU - Gong, Qian
AU - Liu, Qing
AU - Whitney, Ben
AU - Arulraj, Joy
AU - Liu, Zhengchun
AU - Foster, Ian
AU - Klasky, Scott
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/8/7
Y1 - 2023/8/7
N2 - In modern science, big data plays an increasingly important role. Many scientific applications, such as running simulations on supercomputers or conducting experiments on advanced instruments, produce huge amount of data at unprecedented speed. Analyzing and understanding such big data is the key for scientists to make scientific breakthroughs. However, data might become unavailable for scientists to access when outages or maintenance of the storage system occur, which severely hinders scientific discovery. To improve the data availability, data duplication and erasure coding (EC) are often used. But as the scientific data gets larger, using these two methods can cause considerable storage and network overhead. In this paper, we propose RAPIDS, a hybrid approach that combines the multigrid-based error-bounded lossy compression with erasure coding, to significantly reduce the storage and network overhead required for maintaining high data availability. Our experiments show that RAPIDS reduces the storage overhead by up to 7.5x and network overhead by up to 3x to achieve the same level of availability compared to the regular EC method. We improve RAPIDS by building two models to optimize the fault tolerance configurations and data gathering strategy. We demonstrate that RAPIDS significantly improves performance when running on many CPU cores in parallel or on GPUs.
AB - In modern science, big data plays an increasingly important role. Many scientific applications, such as running simulations on supercomputers or conducting experiments on advanced instruments, produce huge amount of data at unprecedented speed. Analyzing and understanding such big data is the key for scientists to make scientific breakthroughs. However, data might become unavailable for scientists to access when outages or maintenance of the storage system occur, which severely hinders scientific discovery. To improve the data availability, data duplication and erasure coding (EC) are often used. But as the scientific data gets larger, using these two methods can cause considerable storage and network overhead. In this paper, we propose RAPIDS, a hybrid approach that combines the multigrid-based error-bounded lossy compression with erasure coding, to significantly reduce the storage and network overhead required for maintaining high data availability. Our experiments show that RAPIDS reduces the storage overhead by up to 7.5x and network overhead by up to 3x to achieve the same level of availability compared to the regular EC method. We improve RAPIDS by building two models to optimize the fault tolerance configurations and data gathering strategy. We demonstrate that RAPIDS significantly improves performance when running on many CPU cores in parallel or on GPUs.
KW - data availability
KW - scientific data management
UR - http://www.scopus.com/inward/record.url?scp=85169584306&partnerID=8YFLogxK
U2 - 10.1145/3588195.3592983
DO - 10.1145/3588195.3592983
M3 - Conference contribution
AN - SCOPUS:85169584306
T3 - HPDC 2023 - Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing
SP - 87
EP - 100
BT - HPDC 2023 - Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing
PB - Association for Computing Machinery, Inc
Y2 - 16 June 2023 through 23 June 2023
ER -