Abstract
Matrix multiplication (GEMM) is the most important operation in dense linear algebra. Because it is a compute-bound operation that is rich in data reuse, many applications from different scientific domains cast their most performance-critical stages to use GEMM. With the rise of batch linear algebra, batched GEMM operations have become increasingly popular in domains other than dense linear solvers, such as tensor contractions, sparse direct solvers, and machine learning. In particular for the latter, batched GEMM in reduced precision (i.e., FP16) has been the core operation of many deep learning frameworks. This paper introduces an optimized batched GEMM for FP16 arithmetic (HGEMM) using graphics processing units (GPUs). We provide a detailed design strategy that takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs. The developed solution uses low-level APIs provided by the vendor in an optimized design that overcomes the limitations imposed by the hardware (in the form of discrete configurations). The outcome is a highly flexible GPU kernel that provides a lot of controls to the developer, despite the aforementioned restrictions. The paper also pays particular attention to multiplications of very small matrices that cannot fully occupy the Tensor Core units. Our results show that the proposed design can outperform the highly optimized vendor routine for sizes up to 100 by factors between 1.2× and 10× using a Tesla V100 GPU. For extremely small matrices, the observed speedups range between 1.8× and 26×.
Original language | English |
---|---|
Title of host publication | Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 111-122 |
Number of pages | 12 |
ISBN (Electronic) | 9781728112466 |
DOIs | |
State | Published - May 2019 |
Event | 33rd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019 - Rio de Janeiro, Brazil Duration: May 20 2019 → May 24 2019 |
Publication series
Name | Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019 |
---|
Conference
Conference | 33rd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019 |
---|---|
Country/Territory | Brazil |
City | Rio de Janeiro |
Period | 05/20/19 → 05/24/19 |
Funding
This work is partially supported by NSF Grant No. OAC 1740250 and CSR 1514286, NVIDIA, and the Department of Energy under the Exascale Computing Project (17-SC-20-SC and LLNL subcontract under DOE contract DE-AC52-07NA27344).
Keywords
- Batched linear algebra
- FP16 arithmetic
- GPU computing
- Matrix multiplication