TY - GEN
T1 - Partial data permutation for training deep neural networks
AU - Cong, Guojing
AU - Zhang, Li
AU - Yang, Chih Chieh
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/5
Y1 - 2020/5
N2 - Random data permutation is considered as a best practice for training deep neural networks. When the input is large, permuting the full dataset is costly and limits scaling on distributed systems. Some practitioners use partial or no permutation that may potentially result in poor convergence.We propose a partitioned data permutation scheme as a low-cost alternative to full data permutation. Analyzing their entropy, we show that the two sampling schemes are asymptotically identical. We also show with minibatch SGD, both sampling schemes produce unbiased estimators of the true gradient. In addition, they have the same bound on the second moment of the gradient. Thus they have similar convergence properties. Our experiments confirm that SGD has similar training performance in practice with both sampling schemes.We further show that due to inherent randomness such as data augmentation and dropout in the training, even faster sampling schemes than partial permutation such as sequential sampling can achieve good performance. However, if no extra randomness is present in training, sampling schemes with low entropy can indeed degrade performance significantly.
AB - Random data permutation is considered as a best practice for training deep neural networks. When the input is large, permuting the full dataset is costly and limits scaling on distributed systems. Some practitioners use partial or no permutation that may potentially result in poor convergence.We propose a partitioned data permutation scheme as a low-cost alternative to full data permutation. Analyzing their entropy, we show that the two sampling schemes are asymptotically identical. We also show with minibatch SGD, both sampling schemes produce unbiased estimators of the true gradient. In addition, they have the same bound on the second moment of the gradient. Thus they have similar convergence properties. Our experiments confirm that SGD has similar training performance in practice with both sampling schemes.We further show that due to inherent randomness such as data augmentation and dropout in the training, even faster sampling schemes than partial permutation such as sequential sampling can achieve good performance. However, if no extra randomness is present in training, sampling schemes with low entropy can indeed degrade performance significantly.
UR - http://www.scopus.com/inward/record.url?scp=85089069240&partnerID=8YFLogxK
U2 - 10.1109/CCGrid49817.2020.00-17
DO - 10.1109/CCGrid49817.2020.00-17
M3 - Conference contribution
AN - SCOPUS:85089069240
T3 - Proceedings - 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020
SP - 728
EP - 735
BT - Proceedings - 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020
A2 - Lefevre, Laurent
A2 - Varela, Carlos A.
A2 - Pallis, George
A2 - Toosi, Adel N.
A2 - Rana, Omer
A2 - Buyya, Rajkumar
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020
Y2 - 11 May 2020 through 14 May 2020
ER -