Abstract
The objective of reinforcement learning (RL) is to find an optimal strategy for solving a dynamical control problem. Evolution strategy (ES) has been shown great promise in many challenging reinforcement learning (RL) tasks, where the underlying dynamical system is only accessible as a black box such that adjoint methods cannot be used. However, existing ES methods have two limitations that hinder its applicability in RL. First, most existing methods rely on Monte Carlo based gradient estimators to generate search directions. Due to low accuracy of Monte Carlo estimators, the RL training suffers from slow convergence and requires more iterations to reach the optimal solution. Second, the landscape of the reward function can be deceptive and may contain many local maxima, causing ES algorithms to prematurely converge and be unable to explore other parts of the parameter space with potentially greater rewards. In this work, we employ a Directional Gaussian Smoothing Evolutionary Strategy (DGS-ES) to accelerate RL training, which is well-suited to address these two challenges with its ability to (i) provide gradient estimates with high accuracy, and (ii) find nonlocal search direction which lays stress on large-scale variation of the reward function and disregards local fluctuation.
Original language | English |
---|---|
Pages (from-to) | 4119-4135 |
Number of pages | 17 |
Journal | Electronic Research Archive |
Volume | 29 |
Issue number | 6 |
DOIs | |
State | Published - Dec 2021 |
Funding
Acknowledgments. This material was based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under contract and award numbers ERKJ369, by the DOE SciDac FastMath project, and by the Artificial Intelligence Initiative at the Oak Ridge National Laboratory (ORNL). ORNL is operated by UT-Battelle, LLC., for the U.S. Department of Energy under Contract DE-AC05-00OR22725. Key words and phrases. Reinforcement learning, Gaussian smoothing, non-convex optimization, stochastic control, high-dimensional optimization, Guass-Hermite quadrature. ∗ Corresponding author: Guannan Zhang ([email protected]). Notice: This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Keywords
- Gaussian smoothing
- Guass-Hermite quadrature
- Reinforcement learning
- high-dimensional optimization
- non-convex optimization
- stochastic control