TY - GEN
T1 - SciLance
T2 - 25th IEEE International Conference on Cluster Computing, CLUSTER 2023
AU - Wang, Xinying
AU - Wan, Lipeng
AU - Klasky, Scott
AU - Zhao, Dongfang
AU - Yan, Feng
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Elastic cloud computing provides new opportunities for accelerating the process of scientific discovery. However, unlike high-performance computing (HPC) systems that are built and optimized for workloads with intensive inter-node communication demands, the low-latency and high bandwidth communication capability is only enabled on a few very expensive high-end instance types in the cloud, which leads to poor cost-effectiveness. In addition, re-balancing the workload through extra data movement among compute nodes is a common way to mitigate the load imbalance issue in many scientific simulations, which further amplifies the communication pressure and makes it challenging to efficiently use cloud resources. To this end, we propose SciLance, which addresses the workload imbalance challenge by utilizing the heterogeneous and elastic resources offered by cloud platforms. Particularly, instead of moving data excessively among compute instances to balance the workload, SciLance dynamically adjusts the computer instances used for running parallel tasks based on the runtime imbalance identified through profiling. We prototype SciLance and perform extensive evaluation using adaptive mesh refinement (AMR) based scientific applications. The evaluation results demonstrate that SciLance can achieve up to 36.63% better performance with 16.91% lower cost for AMR-based simulation codes.
AB - Elastic cloud computing provides new opportunities for accelerating the process of scientific discovery. However, unlike high-performance computing (HPC) systems that are built and optimized for workloads with intensive inter-node communication demands, the low-latency and high bandwidth communication capability is only enabled on a few very expensive high-end instance types in the cloud, which leads to poor cost-effectiveness. In addition, re-balancing the workload through extra data movement among compute nodes is a common way to mitigate the load imbalance issue in many scientific simulations, which further amplifies the communication pressure and makes it challenging to efficiently use cloud resources. To this end, we propose SciLance, which addresses the workload imbalance challenge by utilizing the heterogeneous and elastic resources offered by cloud platforms. Particularly, instead of moving data excessively among compute instances to balance the workload, SciLance dynamically adjusts the computer instances used for running parallel tasks based on the runtime imbalance identified through profiling. We prototype SciLance and perform extensive evaluation using adaptive mesh refinement (AMR) based scientific applications. The evaluation results demonstrate that SciLance can achieve up to 36.63% better performance with 16.91% lower cost for AMR-based simulation codes.
KW - load balancing
KW - parallel computing
KW - resource management
UR - http://www.scopus.com/inward/record.url?scp=85179520692&partnerID=8YFLogxK
U2 - 10.1109/CLUSTER52292.2023.00012
DO - 10.1109/CLUSTER52292.2023.00012
M3 - Conference contribution
AN - SCOPUS:85179520692
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 49
EP - 59
BT - Proceedings - 2023 IEEE International Conference on Cluster Computing, CLUSTER 2023
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 31 October 2023 through 3 November 2023
ER -