Evaluation of Programming Models to Address Load Imbalance on Distributed Multi-Core CPUs: A Case Study with Block Low-Rank Factorization

Yu Pei, George Bosilca, Ichitaro Yamazaki, Akihiro Ida, Jack Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

To minimize data movement, many parallel ap-plications statically distribute computational tasks among the processes. However, modern simulations often encounters ir-regular computational tasks whose computational loads change dynamically at runtime or are data dependent. As a result, load imbalance among the processes at each step of simulation is a natural situation that must be dealt with at the programming level. The de facto parallel programming approach, flat MPI (one process per core), is hardly suitable to manage the lack of balance, imposing significant idle time on the simulation as processes have to wait for the slowest process at each step of simulation. One critical application for many domains is the LU factor-ization of a large dense matrix stored in the Block Low-Rank (BLR) format. Using the low-rank format can significantly reduce the cost of factorization in many scientific applications, including the boundary element analysis of electrostatic field. However, the partitioning of the matrix based on underlying geometry leads to different sizes of the matrix blocks whose numerical ranks change at each step of factorization, leading to the load imbalance among the processes at each step of factorization. We use BLR LU factorization as a test case to study the programmability and performance of five different programming approaches: (1) flat MPI, (2) Adaptive MPI (Charm++), (3) MPI + OpenMP, (4) parameterized task graph (PTG), and (5) dynamic task discovery (DTD). The last two versions use a task-based paradigm to express the algorithm; we rely on the PaRSEC run-time system to execute the tasks. We first point out programming features needed to efficiently solve this category of problems, hinting at possible alternatives to the MPI+X programming paradigm. We then evaluate the programmability of the different approaches, detailing our experience implementing the algorithm using each of the models. Finally, we show the performance result on the Intel Haswell-based Bridges system at the Pittsburgh Supercomputing Center (PSC) and analyze the effectiveness of the implementations to address the load imbalance.

Original languageEnglish
Title of host publicationProceedings of PAW-ATM 2019
Subtitle of host publicationParallel Applications Workshop, Alternatives to MPI+X, Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages25-36
Number of pages12
ISBN (Electronic)9781728159799
DOIs
StatePublished - Nov 2019
Externally publishedYes
Event2019 IEEE/ACM Parallel Applications Workshop, Alternatives to MPI+X, PAW-ATM 2019 - Denver, United States
Duration: Nov 17 2019 → …

Publication series

NameProceedings of PAW-ATM 2019: Parallel Applications Workshop, Alternatives to MPI+X, Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference2019 IEEE/ACM Parallel Applications Workshop, Alternatives to MPI+X, PAW-ATM 2019
Country/TerritoryUnited States
CityDenver
Period11/17/19 → …

Funding

†Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy National Nuclear Security Administration under contract de-na0003525. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government. This work was supported in part by Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures and High Performance Computing Infrastructure in Japan (Project ID: jh180012), JSPS KAKENHI Grant Numbers #17H01749 and #17K19962, the National Science Foundation under Grant No. 1740250, and the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration, under prime contract #DE-AC05-00OR22725, and UT Battelle subaward #4000152412 and #4000151974.

FundersFunder number
U.S. Department of Energy Office of Science
National Science Foundation17-SC-20-SC, 1740250
Battelle4000152412, 4000151974
National Nuclear Security Administration-AC05-00OR22725, de-na0003525
Japan Society for the Promotion of Science17H01749, 17K19962

    Fingerprint

    Dive into the research topics of 'Evaluation of Programming Models to Address Load Imbalance on Distributed Multi-Core CPUs: A Case Study with Block Low-Rank Factorization'. Together they form a unique fingerprint.

    Cite this