On Adam Trained Models and a Parallel Method to Improve the Generalization Performance

Guojing Cong, Luca Buratti

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

Adam is a popular stochastic optimizer that uses adaptive estimates of lower-order moments to update weights and requires little hyper-parameter tuning. Some recent studies have called the generalization and out-of-sample behavior of such adaptive gradient methods into question, and argued that such methods are of only marginal value. Notably for many of the well-known image classification tasks such as CIFAR-10 and ImageNet-1K, current models with best validation performance are still trained with SGD with a manual schedule of learning rate reduction. We analyze Adam and SGD trained models for 7 popular neural network architectures for image classification tasks using the CIFAR-10 dataset. Visualization shows that for classification Adam trained models frequently 'focus' on areas of the images not occupied by the objects to be classified. Weight statistics reveal that Adam trained models have larger weights and L2 norms than SGD trained ones. Our experiments show that weight decay and reducing the initial learning rates improve generalization performance of Adam, but there still remains a gap between Adam and SGD trained models. To bridge the generalization gap, we adopt a K-step model averaging parallel algorithm with the Adam optimizer. With very sparse communication, the algorithm achieves high parallel efficiency. For the 7 models the average improvement in validation accuracy over SGD is 0.72%, and the average parallel speedup is 2.5 with 6 GPUs.

Original languageEnglish
Title of host publicationProceedings of MLHPC 2018
Subtitle of host publicationMachine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages85-94
Number of pages10
ISBN (Electronic)9781728101804
DOIs
StatePublished - Jul 2 2018
Externally publishedYes
Event2018 IEEE/ACM Machine Learning in HPC Environments, MLHPC 2018 - Dallas, United States
Duration: Nov 12 2018 → …

Publication series

NameProceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference2018 IEEE/ACM Machine Learning in HPC Environments, MLHPC 2018
Country/TerritoryUnited States
CityDallas
Period11/12/18 → …

Fingerprint

Dive into the research topics of 'On Adam Trained Models and a Parallel Method to Improve the Generalization Performance'. Together they form a unique fingerprint.

Cite this