Generic Matrix Multiplication for Multi-GPU Accelerated Distributed-Memory Platforms over PaRSEC

Thomas Herault, Yves Robert, George Bosilca, Jack Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

17 Scopus citations

Abstract

This paper introduces a generic and flexible matrix- matrix multiplication algorithm C = A × B for state-of-the-art computing platforms. Typically, these platforms are distributed- memory machines whose nodes are equipped with several ac- celerators. To the best of our knowledge, SLATE [9] is the only library that provides a publicly available implementation on such platforms, and it is currently limited to problem instances where the C matrix can entirely fit in the memory of the GPU accelera- tors. Our algorithm relies on the classical tile-based outer-product algorithm, but enhances it with several control dependencies to increase data re-use and to optimize communication flow from/to the accelerators within each node. The algorithm is written with the PARSEC runtime system, which allows for a fast and generic implementation, while achieving close-to-peak performance.

Original languageEnglish
Title of host publicationProceedings of ScalA 2019
Subtitle of host publication10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages33-41
Number of pages9
ISBN (Electronic)9781728159898
DOIs
StatePublished - Nov 2019
Externally publishedYes
Event10th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2019 - Denver, United States
Duration: Nov 18 2019 → …

Publication series

NameProceedings of ScalA 2019: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference10th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2019
Country/TerritoryUnited States
CityDenver
Period11/18/19 → …

Funding

ACKNOWLEDGEMENT This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. It used resources of the Oak Ridge Leadership Computing Facility at ORNL, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

FundersFunder number
National Science Foundation1450300
U.S. Department of EnergyDE-AC05-00OR22725
Office of Science
National Nuclear Security Administration

    Keywords

    • Accelerator architectures
    • Linear Algebra
    • Run- time environment

    Fingerprint

    Dive into the research topics of 'Generic Matrix Multiplication for Multi-GPU Accelerated Distributed-Memory Platforms over PaRSEC'. Together they form a unique fingerprint.

    Cite this