TY - GEN
T1 - Flexible Communication for Optimal Distributed Learning over Unpredictable Networks
AU - Tyagi, Sahil
AU - Swany, Martin
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Gradient compression alleviates expensive communication in distributed deep learning by sending fewer values and its corresponding indices, typically via Allgather (AG). Training with high compression ratio (CR) achieves high accuracy like DenseSGD, but has lower parallel scaling due to high communication cost (i.e., parallel efficiency). Using lower CRs improves parallel efficiency by lowering synchronization cost, but degrades model accuracy as well (statistical efficiency). Further, speedup attained with different models and CRs also varies with network latency, effective bandwidth and collective op used for aggregation. In many cases, collectives like Allreduce (AR) have lower cost than AG to exchange the same amount of data. In this paper, we propose an AR-compatible Top k compressor that is bandwidth-optimal and thus performs better than AG in certain network configurations. We develop a flexible communication strategy that switches between AG and AR based on which collective is optimal in the current settings, and model the pareto-relationship between parallel and statistical efficiency as a multiobjective optimization (MOO) problem to dynamically adjust CR and accelerate training while still converging to high accuracy.
AB - Gradient compression alleviates expensive communication in distributed deep learning by sending fewer values and its corresponding indices, typically via Allgather (AG). Training with high compression ratio (CR) achieves high accuracy like DenseSGD, but has lower parallel scaling due to high communication cost (i.e., parallel efficiency). Using lower CRs improves parallel efficiency by lowering synchronization cost, but degrades model accuracy as well (statistical efficiency). Further, speedup attained with different models and CRs also varies with network latency, effective bandwidth and collective op used for aggregation. In many cases, collectives like Allreduce (AR) have lower cost than AG to exchange the same amount of data. In this paper, we propose an AR-compatible Top k compressor that is bandwidth-optimal and thus performs better than AG in certain network configurations. We develop a flexible communication strategy that switches between AG and AR based on which collective is optimal in the current settings, and model the pareto-relationship between parallel and statistical efficiency as a multiobjective optimization (MOO) problem to dynamically adjust CR and accelerate training while still converging to high accuracy.
UR - https://www.scopus.com/pages/publications/85184979491
U2 - 10.1109/BigData59044.2023.10386724
DO - 10.1109/BigData59044.2023.10386724
M3 - Conference contribution
AN - SCOPUS:85184979491
T3 - Proceedings - 2023 IEEE International Conference on Big Data, BigData 2023
SP - 925
EP - 935
BT - Proceedings - 2023 IEEE International Conference on Big Data, BigData 2023
A2 - He, Jingrui
A2 - Palpanas, Themis
A2 - Hu, Xiaohua
A2 - Cuzzocrea, Alfredo
A2 - Dou, Dejing
A2 - Slezak, Dominik
A2 - Wang, Wei
A2 - Gruca, Aleksandra
A2 - Lin, Jerry Chun-Wei
A2 - Agrawal, Rakesh
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2023 IEEE International Conference on Big Data, BigData 2023
Y2 - 15 December 2023 through 18 December 2023
ER -