Abstract
General Matrix Multiplication (GEMM) is a critical operation underpinning a wide range of applications in high-performance computing (HPC) and artificial intelligence (AI). The emergence of hardware optimized for low-precision arithmetic necessitates a reevaluation of numerical algorithms to leverage mixed-precision computations, achieving improved performance and energy efficiency. This research introduces an adaptive mixed-precision GEMM framework that supports different precision formats at fine-grained tile/block levels. We utilize the PaRSEC runtime system to balance workloads across various architectures. The performance scales well on ARM CPU-based Fugaku supercomputer, Nvidia GPU-based A100 DGX, and AMD GPU-based Frontier supercomputer. This research aims to enhance computational efficiency and accuracy by bridging algorithmic advancements and hardware innovations, driving transformative progress in various applications.
| Original language | English |
|---|---|
| Title of host publication | Asynchronous Many-Task Systems and Applications - 3rd International Workshop, WAMTA 2025, Proceedings |
| Editors | Patrick Diehl, Qinglei Cao, Thomas Herault, George Bosilca |
| Publisher | Springer Science and Business Media Deutschland GmbH |
| Pages | 174-185 |
| Number of pages | 12 |
| ISBN (Print) | 9783031971952 |
| DOIs | |
| State | Published - 2026 |
| Event | 3rd International Workshop on Asynchronous Many-Task Systems and Applications, WAMTA 2025 - St. Louis, United States Duration: Feb 19 2025 → Feb 21 2025 |
Publication series
| Name | Lecture Notes in Computer Science |
|---|---|
| Volume | 15690 LNCS |
| ISSN (Print) | 0302-9743 |
| ISSN (Electronic) | 1611-3349 |
Conference
| Conference | 3rd International Workshop on Asynchronous Many-Task Systems and Applications, WAMTA 2025 |
|---|---|
| Country/Territory | United States |
| City | St. Louis |
| Period | 02/19/25 → 02/21/25 |
Funding
This research was supported by internal awards from Saint Louis University (Grant-0001651 and PROJ-000498) and the U.S. National Science Foundation (Award OAC-2451577). For computer time, this research used the Lonestar6 cluster from Texas Advanced Computing Center, the compute node at Innovative Computing Laboratory of the University of Tennessee, Knoxville, the Fugaku supercomputer at RIKEN, and Frontier supercomputer at Oak Ridge National Laboratory.
Keywords
- General matrix multiply
- High-performance computing
- Mixed precision
- Task-based runtime