Abstract
Elastic cloud computing provides new opportunities for accelerating the process of scientific discovery. However, unlike high-performance computing (HPC) systems that are built and optimized for workloads with intensive inter-node communication demands, the low-latency and high bandwidth communication capability is only enabled on a few very expensive high-end instance types in the cloud, which leads to poor cost-effectiveness. In addition, re-balancing the workload through extra data movement among compute nodes is a common way to mitigate the load imbalance issue in many scientific simulations, which further amplifies the communication pressure and makes it challenging to efficiently use cloud resources. To this end, we propose SciLance, which addresses the workload imbalance challenge by utilizing the heterogeneous and elastic resources offered by cloud platforms. Particularly, instead of moving data excessively among compute instances to balance the workload, SciLance dynamically adjusts the computer instances used for running parallel tasks based on the runtime imbalance identified through profiling. We prototype SciLance and perform extensive evaluation using adaptive mesh refinement (AMR) based scientific applications. The evaluation results demonstrate that SciLance can achieve up to 36.63% better performance with 16.91% lower cost for AMR-based simulation codes.
Original language | English |
---|---|
Title of host publication | Proceedings - 2023 IEEE International Conference on Cluster Computing, CLUSTER 2023 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 49-59 |
Number of pages | 11 |
ISBN (Electronic) | 9798350307924 |
DOIs | |
State | Published - 2023 |
Event | 25th IEEE International Conference on Cluster Computing, CLUSTER 2023 - Santa Fe, United States Duration: Oct 31 2023 → Nov 3 2023 |
Publication series
Name | Proceedings - IEEE International Conference on Cluster Computing, ICCC |
---|---|
ISSN (Print) | 1552-5244 |
Conference
Conference | 25th IEEE International Conference on Cluster Computing, CLUSTER 2023 |
---|---|
Country/Territory | United States |
City | Santa Fe |
Period | 10/31/23 → 11/3/23 |
Funding
This work was partially supported by National Science Foundation CAREER-2048044. This research was also supported by the ECP CODAR and Sirius-2 projects through the AdvancedScientific Computing Research (ASCR) program of Department of Energy. This research used resources of the Oak Ridge Leadership Computing Facility, a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.
Keywords
- load balancing
- parallel computing
- resource management