Abstract:
After a brief overview of on-line deep reinforcement learning (DRL), a survey of exploratory policy generation methods in online DRL algorithms for a single agent is presented for the exploration-exploitation dilemma in the training process based on the relationship between the exploratory policy and the task policy. Firstly, the exploratory policy generation methods in reward space and parametric space of the task policy are discussed. For the exploration in reward space, the methods of adding intrinsic rewards are discussed in classification, and research progresses are analyzed based on the advantages and disadvantages of these methods. For the exploration in parametric space, the representation methods of the individual fitness function in neuroevolution algorithms are discussed with the task performance and diversity considered simultaneously. Afterwards, the exploration methods of combining the traditional action space and the parametric space are analyzed. Subsequently, a brief introduction is given to high-level task target space and task-independent exploratory policy generation methods. Finally, the methods dealing with safety constraints for exploratory policies are discussed, and the challenges faced by the exploratory policies and the future research directions are given.