OmniLearn: A Framework for Distributed Deep Learning Over Heterogeneous Clusters

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Deep learning systems are optimized for clusters with homogeneous resources. However, heterogeneity is prevalent in computing infrastructure across edge, cloud and HPC. When training neural networks using stochastic gradient descent techniques on heterogeneous resources, performance degrades due to stragglers and stale updates. In this work, we develop an adaptive batch-scaling framework called OmniLearn to mitigate the effects of heterogeneity in distributed training. Our approach is inspired by proportional controllers to balance computation across heterogeneous servers, and works under varying resource availability. By dynamically adjusting worker mini-batches at runtime, OmniLearn reduces training time by 14-85%. We also investigate asynchronous training, where our techniques improve accuracy by up to 6.9%.

Original languageEnglish
Pages (from-to)1253-1267
Number of pages15
JournalIEEE Transactions on Parallel and Distributed Systems
Volume36
Issue number6
DOIs
StatePublished - 2025
Externally publishedYes

Funding

Received 6 July 2024; revised 5 February 2025; accepted 9 March 2025. Date of publication 18 March 2025; date of current version 24 April 2025. This work was supported in part by the National Science Foundation (NSF) under Grant OAC-2112606. Recommended for acceptance by F. Zhang. (Corresponding author: Sahil Tyagi.) The authors are with the Department of Intelligent Systems Engineering School: Luddy School of Informatics, Computing and Engineering University: Indiana University Bloomington, Indiana 47408 USA (e-mail: [email protected]; [email protected]). Digital Object Identifier 10.1109/TPDS.2025.3553066

Keywords

  • Distributed training
  • deep learning
  • heterogeneous systems
  • synchronous and asynchronous communication

Fingerprint

Dive into the research topics of 'OmniLearn: A Framework for Distributed Deep Learning Over Heterogeneous Clusters'. Together they form a unique fingerprint.

Cite this