A Framework to Exploit Data Sparsity in Tile Low-Rank Cholesky Factorization

Qinglei Cao, Rabab Alomairy, Yu Pei, George Bosilca, Hatem Ltaief, David Keyes, Jack Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

10 Scopus citations

Abstract

We present a general framework that couples the PaRSEC runtime system and the HiCMA numerical library to solve challenging 3D data-sparse problems. Though formally dense, many matrix operators possess a rank structured property that can be exploited during the most time-consuming computational phase, i.e., the matrix factorization. In particular, this work highlights how a software bundle powered by a task-based programming model can address the heterogeneous workloads engendered by compressing the dense operator. Using Tile Low-Rank (TLR) approximation, our approach consists in capturing the most significant information in each tile of the matrix using a threshold which satisfies the application's accuracy requirements. Matrix operations are performed on the compressed data layout, reducing memory footprint and algorithmic complexity. Our proposed software solution accommodates a range of traditional data structures of linear algebra, i.e., from dense and data-sparse to sparse, within a single matrix operation. Separation of concerns is at the heart: hardware-agnostic implementation, asynchronous execution with a dynamic runtime system, and high performance numerical kernels, to prepare scientific applications to embrace exascale opportunities. This ambition necessitates extensions to PaRSEC that incorporate information related to data structure and rank distribution into the runtime decision-making. We introduce two runtime optimizations to address the challenges encountered when confronted with a large rank disparity: (1) a trimming procedure performed at runtime to cut away data dependencies from the directed acyclic graph discovered to be no longer required after compression and (2) a rank-aware diamond-shaped data distribution to mitigate the load imbalance overheads, reduce data movement, and conserve memory foot-print. We assess our implementation using 3D unstructured mesh deformation based on Radial Basis Function (RBF) interpolation. We report performance results on two different high-performance supercomputers and compare against existing state-of-the-art implementation. Our implementation shows up to 7-fold on Shaheen II and 9-fold on Fugaku performance superiority in situations where the 3D unstructured mesh deformation application renders a matrix operator with low density. Our software framework solves a formally dense 3D problem with 52M mesh points on 65K cores in about half an hour. This multidisciplinary work emphasizes the need for runtime systems to go beyond their primary responsibility of task scheduling on massively parallel hardware system, by synergistically bridging matrix algebra libraries with scientific applications.

Original languageEnglish
Title of host publicationProceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages414-424
Number of pages11
ISBN (Electronic)9781665481069
DOIs
StatePublished - 2022
Event36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022 - Virtual, Online, France
Duration: May 30 2022Jun 3 2022

Publication series

NameProceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022

Conference

Conference36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022
Country/TerritoryFrance
CityVirtual, Online
Period05/30/2206/3/22

Funding

This research was supported in part by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. For computer time, this research used the supercomputer Shaheen II at King Abdullah University of Science & Technology (KAUST) in Thuwal Saudi Arabia and the supercomputer Fugaku provided by RIKEN through the HPCI System Research Project (Project ID: hp200310 and ra010009).

FundersFunder number
Office of Science
National Nuclear Security Administration
King Abdullah University of Science and Technology
RIKENhp200310, ra010009

    Keywords

    • Dynamic runtime system
    • HPC
    • Low-rank approximations
    • Mesh deformations
    • Task-based programming model

    Fingerprint

    Dive into the research topics of 'A Framework to Exploit Data Sparsity in Tile Low-Rank Cholesky Factorization'. Together they form a unique fingerprint.

    Cite this