Abstract
While GPUs are responsible for training the vast majority of state-of-the-art deep learning models, the implications of their architecture are often overlooked when designing new deep learning (DL) models. As a consequence, modifying a DL model to be more amenable to the target hardware can significantly improve the runtime performance of DL training and inference. In this paper, we provide a set of guidelines for users to maximize the runtime performance of their transformer models. These guidelines have been created by carefully considering the impact of various model hyperparameters controlling model shape on the efficiency of the underlying computation kernels executed on the GPU. We find the throughput of models with "efficient"model shapes is up to 39% higher while preserving accuracy compared to models with a similar number of parameters but with unoptimized shapes.
Original language | English |
---|---|
Title of host publication | 53rd International Conference on Parallel Processing, ICPP 2024 - Main Conference Proceedings |
Publisher | Association for Computing Machinery |
Pages | 84-96 |
Number of pages | 13 |
ISBN (Electronic) | 9798400708428 |
DOIs | |
State | Published - Aug 12 2024 |
Event | 53rd International Conference on Parallel Processing, ICPP 2024 - Gotland, Sweden Duration: Aug 12 2024 → Aug 15 2024 |
Publication series
Name | ACM International Conference Proceeding Series |
---|
Conference
Conference | 53rd International Conference on Parallel Processing, ICPP 2024 |
---|---|
Country/Territory | Sweden |
City | Gotland |
Period | 08/12/24 → 08/15/24 |
Funding
This research is supported in part by NSF grants #1818253, #1854828, #2007991, #2018627, #2311830, #2312927, and XRAC grant #NCR-130002.