TY - GEN
T1 - Scavenger
T2 - 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing Workshops, CCGridW 2023
AU - Tyagi, Sahil
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Deep learning (DL) models learn non-linear functions and relationships by iteratively training on given data. To accelerate training further, data-parallel training [1] launches multiple instances of training process on separate partitions of data and periodically aggregates model updates. With the availability of VMs in the cloud, choosing the 'right' cluster configuration for data-parallel training presents non-trivial challenges. We tackle this problem by considering both the parallel and statistical efficiency of distributed training w.r.t. the cluster size configuration and batch-size in training. We build performance models to evaluate the pareto-relationship between cost and time of DL training across different cluster and batch-size configurations and develop Scavenger as a cloud service for searching optimum cloud configurations in an online, blackbox manner.
AB - Deep learning (DL) models learn non-linear functions and relationships by iteratively training on given data. To accelerate training further, data-parallel training [1] launches multiple instances of training process on separate partitions of data and periodically aggregates model updates. With the availability of VMs in the cloud, choosing the 'right' cluster configuration for data-parallel training presents non-trivial challenges. We tackle this problem by considering both the parallel and statistical efficiency of distributed training w.r.t. the cluster size configuration and batch-size in training. We build performance models to evaluate the pareto-relationship between cost and time of DL training across different cluster and batch-size configurations and develop Scavenger as a cloud service for searching optimum cloud configurations in an online, blackbox manner.
KW - data parallel training
KW - deep learning
KW - distributed training
KW - machine learning
KW - performance modeling
UR - https://www.scopus.com/pages/publications/85166735521
U2 - 10.1109/CCGridW59191.2023.00081
DO - 10.1109/CCGridW59191.2023.00081
M3 - Conference contribution
AN - SCOPUS:85166735521
T3 - Proceedings - 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing Workshops, CCGridW 2023
SP - 349
EP - 350
BT - Proceedings - 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing Workshops, CCGridW 2023
A2 - Simmhan, Yogesh
A2 - Altintas, Ilkay
A2 - Varbanescu, Ana-Lucia
A2 - Balaji, Pavan
A2 - Prasad, Abhinandan S.
A2 - Carnevale, Lorenzo
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 1 May 2023 through 4 May 2023
ER -