Abstract
Architectural and hyperparameter design choices can influence deep-learner (DL) model fidelity but can also be affected by malformed training and validation data. However, practitioners may spend significant time refining layers and hyperparameters before discovering that distorted training data were impeding the training progress. We found that an evolutionary algorithm (EA) can be used to troubleshoot this kind of DL problem. An EA evaluated thousands of DL configurations on Summit that yielded no overall improvement in DL performance, which suggested problems with the training and validation data. We suspected that contrast limited adaptive histogram equalization enhancement that was applied to previously generated digital surface models, for which we were training DLs to find errors, had damaged the training data. Subsequent runs with an alternative global normalization yielded significantly improved DL performance. However, the DL intersection over unions still exhibited consistent subpar performance, which suggested further problems with the training data and DL approach. Nonetheless, we were able to diagnose this problem within a 12-hour span via Summit runs, which prevented several weeks of unproductive trial-and-error DL configuration refinement and allowed for a more timely convergence on an ultimately viable solution.
Original language | English |
---|---|
Article number | 8935167 |
Journal | IBM Journal of Research and Development |
Volume | 64 |
Issue number | 3-4 |
DOIs | |
State | Published - May 1 2020 |
Funding
We would like to thank the OLCF for their generous allocation of Summit node-hours, which made this article possible, and J. Morrison and B. Smith of the OLCF for their extraordinary efforts to resolve Summit-related problems during our research. We would also like to thank Dr. J. Bassett and E. Scott of George Mason University (GMU) for their assistance in developing the supporting EA toolkit, Library for Evolutionary Algorithms in Python (LEAP). This work was supported in part by an appointment to the Oak Ridge National Laboratory ASTRO Program, sponsored by the U.S. Department of Energy and administered by the Oak Ridge Institute for Science and Education, in part by the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility, under Contract DE-AC05-00OR22725, and in part by the UT-Battelle, LLC under Contract DE-AC05-00OR22725 with the U.S. Department of Energy.