Abstract
Deep learning systems are optimized for clusters with homogeneous resources. However, heterogeneity is prevalent in computing infrastructure across edge, cloud and HPC. When training neural networks using stochastic gradient descent techniques on heterogeneous resources, performance degrades due to stragglers and stale updates. In this work, we develop an adaptive batch-scaling framework called OmniLearn to mitigate the effects of heterogeneity in distributed training. Our approach is inspired by proportional controllers to balance computation across heterogeneous servers, and works under varying resource availability. By dynamically adjusting worker mini-batches at runtime, OmniLearn reduces training time by 14-85%. We also investigate asynchronous training, where our techniques improve accuracy by up to 6.9%.
| Original language | English |
|---|---|
| Pages (from-to) | 1253-1267 |
| Number of pages | 15 |
| Journal | IEEE Transactions on Parallel and Distributed Systems |
| Volume | 36 |
| Issue number | 6 |
| DOIs | |
| State | Published - 2025 |
| Externally published | Yes |
Funding
Received 6 July 2024; revised 5 February 2025; accepted 9 March 2025. Date of publication 18 March 2025; date of current version 24 April 2025. This work was supported in part by the National Science Foundation (NSF) under Grant OAC-2112606. Recommended for acceptance by F. Zhang. (Corresponding author: Sahil Tyagi.) The authors are with the Department of Intelligent Systems Engineering School: Luddy School of Informatics, Computing and Engineering University: Indiana University Bloomington, Indiana 47408 USA (e-mail: [email protected]; [email protected]). Digital Object Identifier 10.1109/TPDS.2025.3553066
Keywords
- Distributed training
- deep learning
- heterogeneous systems
- synchronous and asynchronous communication