基于TS-TD3的动态环境端到端无地图导航方法
An End-to-End Mapless Navigation Method Based on TS-TD3 in Dynamic Environment
-
摘要: 针对基于地图的移动机器人导航框架部署在动态复杂环境时出现的问题, 提出一种基于时序-双延迟深度确定性策略梯度(TS-TD3)的无地图导航方法。首先, 将动态场景(具有环境部分可观测性)的导航任务定义为部分可观测马尔可夫决策过程(POMDP)。其次, 引入经过长短期记忆组件处理的历史信息作为模型的输入, 为策略网络的确定性策略梯度引入历史信息基准, 以处理隐藏在环境观测集合中的状态信息, 将关注导航动作时序关联性的评价标准引入评价网络。再次, 通过专家经验网络在训练前期指导策略网络的输出, 以规范导航动作。最后, 建立演员-评论家框架的深度强化学习(DRL)端到端模型, 根据传感器感知结果直接输出控制动作。与主流DRL方法进行对比实验, 在仿真实验中, 该方法运动轨迹自然、稳定、具有连续性, 能处理多动态障碍物交汇情况, 整体导航效果表现最优; 在真实动态环境的测试中, 模型未作调整直接部署在未知环境中, 模型的导航效果和泛化性得到验证。Abstract: Aiming at the problems of map-based mobile robot navigation framework deployed in dynamic complex environment, a mapless navigation method based on TS-TD3(time series twin delayed deep deterministic policy gradient) is proposed. Firstly, navigation tasks in dynamic scenarios(usually with partially observable environment) are defined as partially observable Markov decision process(POMDP). Secondly, the historical information processed by the long short-term memory components is introduced as the input of the model. The historical information baseline is introduced into the deterministic policy gradient of the actor network to process the state information hidden in the environmental observation set.The critic criteria concerned with the temporal correlation of navigation actions is introduced into the critic network. Thirdly, the expert experience network is used to guide the output of the actor network in the early stage of training to standardize the navigation actions. Finally, the deep reinforcement learning(DRL) based end-to-end model of the actor-critic framework is established, and the actions are controlled directly according to the sensor perception. Compared with the mainstream DRL methods, the motion trajectory obtained by the proposed method is natural, stable and continuous in the simulation experiment, the intersection of multiple dynamic obstacles can be dealed with, and the overall navigation performance is optimal.In the test in real dynamic environment, the model is directly deployed in an unknown environment without adjustment, and the navigation effect and generalization of the model are verified.