TY - GEN
T1 - The Case for Co-Designing Model Architectures with Hardware
AU - Anthony, Quentin
AU - Hatef, Jacob
AU - Narayanan, Deepak
AU - Biderman, Stella
AU - Bekman, Stas
AU - Yin, Junqi
AU - Shafi, Aamir
AU - Subramoni, Hari
AU - Panda, Dhabaleswar
N1 - Publisher Copyright:
© 2024 Owner/Author.
PY - 2024/8/12
Y1 - 2024/8/12
N2 - While GPUs are responsible for training the vast majority of state-of-the-art deep learning models, the implications of their architecture are often overlooked when designing new deep learning (DL) models. As a consequence, modifying a DL model to be more amenable to the target hardware can significantly improve the runtime performance of DL training and inference. In this paper, we provide a set of guidelines for users to maximize the runtime performance of their transformer models. These guidelines have been created by carefully considering the impact of various model hyperparameters controlling model shape on the efficiency of the underlying computation kernels executed on the GPU. We find the throughput of models with "efficient"model shapes is up to 39% higher while preserving accuracy compared to models with a similar number of parameters but with unoptimized shapes.
AB - While GPUs are responsible for training the vast majority of state-of-the-art deep learning models, the implications of their architecture are often overlooked when designing new deep learning (DL) models. As a consequence, modifying a DL model to be more amenable to the target hardware can significantly improve the runtime performance of DL training and inference. In this paper, we provide a set of guidelines for users to maximize the runtime performance of their transformer models. These guidelines have been created by carefully considering the impact of various model hyperparameters controlling model shape on the efficiency of the underlying computation kernels executed on the GPU. We find the throughput of models with "efficient"model shapes is up to 39% higher while preserving accuracy compared to models with a similar number of parameters but with unoptimized shapes.
UR - http://www.scopus.com/inward/record.url?scp=85202446301&partnerID=8YFLogxK
U2 - 10.1145/3673038.3673136
DO - 10.1145/3673038.3673136
M3 - Conference contribution
AN - SCOPUS:85202446301
T3 - ACM International Conference Proceeding Series
SP - 84
EP - 96
BT - 53rd International Conference on Parallel Processing, ICPP 2024 - Main Conference Proceedings
PB - Association for Computing Machinery
T2 - 53rd International Conference on Parallel Processing, ICPP 2024
Y2 - 12 August 2024 through 15 August 2024
ER -