马璐, 刘成菊, 林立民, 徐斌辰, 陈启军. 基于AM-RPPO的双足机器人适应性行走控制算法[J]. 机器人, 2019, 41(6): 731-741. DOI: 10.13973/j.cnki.robot.180785
引用本文: 马璐, 刘成菊, 林立民, 徐斌辰, 陈启军. 基于AM-RPPO的双足机器人适应性行走控制算法[J]. 机器人, 2019, 41(6): 731-741. DOI: 10.13973/j.cnki.robot.180785
MA Lu, LIU Chengju, LIN Limin, XU Binchen, CHEN Qijun. AM-RPPO Based Control Method for Biped Adaptive Locomotion[J]. ROBOT, 2019, 41(6): 731-741. DOI: 10.13973/j.cnki.robot.180785
Citation: MA Lu, LIU Chengju, LIN Limin, XU Binchen, CHEN Qijun. AM-RPPO Based Control Method for Biped Adaptive Locomotion[J]. ROBOT, 2019, 41(6): 731-741. DOI: 10.13973/j.cnki.robot.180785

基于AM-RPPO的双足机器人适应性行走控制算法

AM-RPPO Based Control Method for Biped Adaptive Locomotion

  • 摘要: 提出了一种带有注意力机制和循环近端策略优化(AM-RPPO)的深度强化学习(DRL)方法并将其应用于双足机器人的适应性行走控制.首先,对未知环境下双足机器人关节空间行走控制问题依照部分可观测马尔可夫决策过程(POMDP)进行建模,指出了DRL算法近端策略优化(PPO)对真实状态的估计存在偏差的问题.其次,引入循环神经网络(RNN)架构,分析了RNN对时序环境观测状态不同于多层感知机的正向传播过程,说明了RNN相对于传统神经网络的优势,并且将RNN分别嵌入动作生成网络和价值函数生成网络中.再次,引入在深度学习诸多领域应用广泛的注意力机制(AM),利用AM建立基于不同时间步的状态,求得最终价值函数的权重差异化模型.最后,通过仿真实验验证了提出的AM-RPPO算法对存在高维状态信息输入的双足机器人控制问题的有效性.

     

    Abstract: An AM-RPPO (attention mechanism-recurrent proximal policy optimization) based deep reinforcement learning (DRL) is proposed and applied to the adaptive locomotion control of biped robots. Firstly, the walking control problem in joint space for biped robots in unknown environment is modeled according to partially observable Markov decision process (POMDP). And the bias of estimation for the real state by DRL algorithm and proximal policy optimization (PPO) is illustrated. Next, the architecture of recurrent neural network (RNN) is introduced, and the forward propagation process of observation states in timing sequence environment by RNN is analyzed, which is different from multi-layer perceptrons. The RNN is embedded in the action generation network and the value function generation network respectively, and its advantages relative to the traditional neural networks are demonstrated. Thirdly, the attention mechanism (AM) widely used in many fields of deep learning, is introduced to obtain the states at different time steps and establish a weighted differentiation model of the final value function. Finally, the effectiveness of the proposed AM-RPPO algorithm for the locomotion control of biped robots with high-dimensional states is verified through simulation experiments.

     

/

返回文章
返回