SciLance: Mitigate Load Imbalance for Parallel Scientific Applications in Cloud Environments

Xinying Wang, Lipeng Wan, Scott Klasky, Dongfang Zhao, Feng Yan

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Elastic cloud computing provides new opportunities for accelerating the process of scientific discovery. However, unlike high-performance computing (HPC) systems that are built and optimized for workloads with intensive inter-node communication demands, the low-latency and high bandwidth communication capability is only enabled on a few very expensive high-end instance types in the cloud, which leads to poor cost-effectiveness. In addition, re-balancing the workload through extra data movement among compute nodes is a common way to mitigate the load imbalance issue in many scientific simulations, which further amplifies the communication pressure and makes it challenging to efficiently use cloud resources. To this end, we propose SciLance, which addresses the workload imbalance challenge by utilizing the heterogeneous and elastic resources offered by cloud platforms. Particularly, instead of moving data excessively among compute instances to balance the workload, SciLance dynamically adjusts the computer instances used for running parallel tasks based on the runtime imbalance identified through profiling. We prototype SciLance and perform extensive evaluation using adaptive mesh refinement (AMR) based scientific applications. The evaluation results demonstrate that SciLance can achieve up to 36.63% better performance with 16.91% lower cost for AMR-based simulation codes.

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE International Conference on Cluster Computing, CLUSTER 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages49-59
Number of pages11
ISBN (Electronic)9798350307924
DOIs
StatePublished - 2023
Event25th IEEE International Conference on Cluster Computing, CLUSTER 2023 - Santa Fe, United States
Duration: Oct 31 2023Nov 3 2023

Publication series

NameProceedings - IEEE International Conference on Cluster Computing, ICCC
ISSN (Print)1552-5244

Conference

Conference25th IEEE International Conference on Cluster Computing, CLUSTER 2023
Country/TerritoryUnited States
CitySanta Fe
Period10/31/2311/3/23

Funding

This work was partially supported by National Science Foundation CAREER-2048044. This research was also supported by the ECP CODAR and Sirius-2 projects through the AdvancedScientific Computing Research (ASCR) program of Department of Energy. This research used resources of the Oak Ridge Leadership Computing Facility, a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.

FundersFunder number
National Science FoundationCAREER-2048044
U.S. Department of Energy
Office of ScienceDE-AC05-00OR22725
Advanced Scientific Computing Research

    Keywords

    • load balancing
    • parallel computing
    • resource management

    Fingerprint

    Dive into the research topics of 'SciLance: Mitigate Load Imbalance for Parallel Scientific Applications in Cloud Environments'. Together they form a unique fingerprint.

    Cite this