TY - GEN
T1 - On Adam Trained Models and a Parallel Method to Improve the Generalization Performance
AU - Cong, Guojing
AU - Buratti, Luca
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/7/2
Y1 - 2018/7/2
N2 - Adam is a popular stochastic optimizer that uses adaptive estimates of lower-order moments to update weights and requires little hyper-parameter tuning. Some recent studies have called the generalization and out-of-sample behavior of such adaptive gradient methods into question, and argued that such methods are of only marginal value. Notably for many of the well-known image classification tasks such as CIFAR-10 and ImageNet-1K, current models with best validation performance are still trained with SGD with a manual schedule of learning rate reduction. We analyze Adam and SGD trained models for 7 popular neural network architectures for image classification tasks using the CIFAR-10 dataset. Visualization shows that for classification Adam trained models frequently 'focus' on areas of the images not occupied by the objects to be classified. Weight statistics reveal that Adam trained models have larger weights and L2 norms than SGD trained ones. Our experiments show that weight decay and reducing the initial learning rates improve generalization performance of Adam, but there still remains a gap between Adam and SGD trained models. To bridge the generalization gap, we adopt a K-step model averaging parallel algorithm with the Adam optimizer. With very sparse communication, the algorithm achieves high parallel efficiency. For the 7 models the average improvement in validation accuracy over SGD is 0.72%, and the average parallel speedup is 2.5 with 6 GPUs.
AB - Adam is a popular stochastic optimizer that uses adaptive estimates of lower-order moments to update weights and requires little hyper-parameter tuning. Some recent studies have called the generalization and out-of-sample behavior of such adaptive gradient methods into question, and argued that such methods are of only marginal value. Notably for many of the well-known image classification tasks such as CIFAR-10 and ImageNet-1K, current models with best validation performance are still trained with SGD with a manual schedule of learning rate reduction. We analyze Adam and SGD trained models for 7 popular neural network architectures for image classification tasks using the CIFAR-10 dataset. Visualization shows that for classification Adam trained models frequently 'focus' on areas of the images not occupied by the objects to be classified. Weight statistics reveal that Adam trained models have larger weights and L2 norms than SGD trained ones. Our experiments show that weight decay and reducing the initial learning rates improve generalization performance of Adam, but there still remains a gap between Adam and SGD trained models. To bridge the generalization gap, we adopt a K-step model averaging parallel algorithm with the Adam optimizer. With very sparse communication, the algorithm achieves high parallel efficiency. For the 7 models the average improvement in validation accuracy over SGD is 0.72%, and the average parallel speedup is 2.5 with 6 GPUs.
UR - http://www.scopus.com/inward/record.url?scp=85063039699&partnerID=8YFLogxK
U2 - 10.1109/MLHPC.2018.8638641
DO - 10.1109/MLHPC.2018.8638641
M3 - Conference contribution
AN - SCOPUS:85063039699
T3 - Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis
SP - 85
EP - 94
BT - Proceedings of MLHPC 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2018 IEEE/ACM Machine Learning in HPC Environments, MLHPC 2018
Y2 - 12 November 2018
ER -