Abstract
This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular, half precision is used in many very large scale applications, such as those associated with machine learning.
Original language | English |
---|---|
Article number | 21 |
Journal | ACM Transactions on Mathematical Software |
Volume | 47 |
Issue number | 3 |
DOIs | |
State | Published - Jun 2021 |
Externally published | Yes |
Funding
This material is based upon work supported in part by the National Science Foundation under Grants No. OAC 1740250 and CSR 1514286 and OAC 2004850, NVIDIA, the Department of Energy, and in part by the Russian Scientific Foundation, Agreement N14-11-00190. This project was also funded in part from the European Union’s Horizon 2020 research and innovation programme under the NLAFET grant agreement No 671633. Authors’ addresses: A. Abdelfattah, M. Gates, P. Luszczek, and S. Tomov, University of Tennessee, 1122 Volunteer Blvd., Suite 203, Knoxville, TN 37996-3450, USA; emails: [email protected], [email protected], [email protected], [email protected]; T. Costa, NVIDIA, Santa Clara; J. Dongarra, University of Tennessee, Oak Ridge National Laboratory, and University of Manchester, Knoxville; email: [email protected]; A. Haidar, NVIDIA, Knoxville; email: [email protected]; S. Hammarling and N. J. Higham, University of Manchester, Manchester, UK; emails: [email protected], [email protected]; J. Kurzak, AMD, Knoxville; M. Zounon, NAG Ltd., Manchester, UK; email: [email protected]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2021 Association for Computing Machinery. 0098-3500/2021/06-ART21 $15.00 https://doi.org/10.1145/3431921
Funders | Funder number |
---|---|
National Science Foundation | CSR 1514286, OAC 2004850, OAC 1740250 |
U.S. Department of Energy | |
NVIDIA | |
Horizon 2020 Framework Programme | 671633 |
Russian Science Foundation | N14-11-00190 |
Keywords
- BLAS
- batched BLAS