NVIDIA tensor core programmability, performance & precision

Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, Jeffrey S. Vetter

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

246 Scopus citations

Abstract

The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called Tensor Core that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision. Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores.

Original languageEnglish
Title of host publicationProceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages522-531
Number of pages10
ISBN (Print)9781538655559
DOIs
StatePublished - Aug 3 2018
Event32nd IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018 - Vancouver, Canada
Duration: May 21 2018May 25 2018

Publication series

NameProceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018

Conference

Conference32nd IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018
Country/TerritoryCanada
CityVancouver
Period05/21/1805/25/18

Bibliographical note

Publisher Copyright:
© 2018 IEEE.

Funding

FundersFunder number
Horizon 2020 Framework Programme671500

    Keywords

    • GEMM
    • GPU Programming
    • Mixed Precision
    • NVIDIA Tensor Cores

    Fingerprint

    Dive into the research topics of 'NVIDIA tensor core programmability, performance & precision'. Together they form a unique fingerprint.

    Cite this