Abstract:An AM-RPPO (attention mechanism-recurrent proximal policy optimization) based deep reinforcement learning (DRL) is proposed and applied to the adaptive locomotion control of biped robots. Firstly, the walking control problem in joint space for biped robots in unknown environment is modeled according to partially observable Markov decision process (POMDP). And the bias of estimation for the real state by DRL algorithm and proximal policy optimization (PPO) is illustrated. Next, the architecture of recurrent neural network (RNN) is introduced, and the forward propagation process of observation states in timing sequence environment by RNN is analyzed, which is different from multi-layer perceptrons. The RNN is embedded in the action generation network and the value function generation network respectively, and its advantages relative to the traditional neural networks are demonstrated. Thirdly, the attention mechanism (AM) widely used in many fields of deep learning, is introduced to obtain the states at different time steps and establish a weighted differentiation model of the final value function. Finally, the effectiveness of the proposed AM-RPPO algorithm for the locomotion control of biped robots with high-dimensional states is verified through simulation experiments.
[1] Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540):529-533.
[2] Gibney E. Google AI algorithm masters ancient game of Go[J]. Nature, 2016, 529(7587):445-446.
[3] Nishigai K, Ito K. Control of multi-legged robot using reinforcement learning with body image and application to a real robot[C]//IEEE International Conference on Robotics and Biomimetics. Piscataway, USA:IEEE, 2011:2511-2516.
[4] Erden M S, Leblebiciolu K. Free gait generation with reinforcement learning for a six-legged robot[J]. Robotics and Autonomous Systems, 2008, 56(3):199-212.
[5] Rao J H, An H L, Zhang T H, et al. Single leg operationalspace control of quadruped robot based on reinforcement learning[C]//IEEE Chinese Guidance, Navigation and Control Conference. Piscataway, USA:IEEE, 2017:597-602.
[6] Peng X B, Berseth G, van de Panne M. Dynamic terrain traversal skills using reinforcement learning[J]. ACM Transactions on Graphics, 2015, 34(4). DOI:10.1145/2766910.
[7] Hwangbo J, Lee J, Dosovitskiy A, et al. Learning agile and dynamic motor skills for legged robots[J]. Science Robotics, 2019, 4(26). DOI:10.1126/scirobotics. aau5872.
[8] Lee Y, Wampler K, Bernstein G, et al. Motion fields for interactive character locomotion[J]. ACM Transactions on Graphics, 2010, 29(6). DOI:10.1145/1866158.1866160.
[9] Liu C J, Yang J, An K, et al. Rhythmic-reflex hybrid adaptive walking control of biped robot[J]. Journal of Intelligent and Robotic Systems, 2019, 94(3-4):603-619.
[10] Di Canio G, Stoyanov S, Balmori I T, et al. Adaptive combinatorial neural control for robust locomotion of a biped robot[C]//14th International Conference on the Simulation of Adaptive Behavior. Cham, Switzerland:Springer, 2016:317-328.
[11] Zhong G L, Chen L, Jiao Z D, et al. Locomotion control and gait planning of a novel hexapod robot using biomimetic neurons[J]. IEEE Transactions on Control Systems Technology, 2018, 26(2):624-636.
[12] Levine S, Abbeel P. Learning neural network policies with guided policy search under unknown dynamics[M]//Advances in Neural Information Processing Systems, Vol.27. La Jolla, USA:Neural Information Processing Systems Foundation, 2014:1071-1079.
[13] Hausknecht M, Stone P. Deep reinforcement learning in parameterized action space[EB/OL]. (2016-02-16)[2018-12-25]. https://arxiv.org/abs/1511.04143.
[14] André J, Santos C, Costa L. Skill memory in biped locomotion[J]. Journal of Intelligent and Robotic Systems, 2016, 82(3-4):379-397.
[15] Peng X B, Berseth G, Yin K K, et al. DeepLoco:Dynamic locomotion skills using hierarchical deep reinforcement learning[J]. ACM Transactions on Graphics, 2017, 36(4). DOI:10.1145/3072959.3073602.
[16] Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms[EB/OL]. (2017-08-28)[2018-12-25]. https://arxiv.org/abs/1707.06347.
[17] Wierstra D, Förster A, Peters J, et al. Solving deep memory POMDPs with recurrent policy gradients[C]//17th InternationalConference on Artificial Neural Networks. Berlin, Germany:Springer, 2007:697-706.
[18] Wierstra D, Förster A, Peters J, et al. Recurrent policy gradients[J]. Logic Journal of the IGPL, 2010, 18(5):620-634.
[19] Tan C Q, Wei F R, Wang W H, et al. Multiway attention networks for modeling sentence pairs[C]//27th International Joint Conference on Artificial Intelligence. 2018:4411-4417.
[20] Firat O, Cho K, Bengio Y. Multi-way, multilingual neural machine translation with a shared attention mechanism[C]//2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language.[S.l.]:Association for Computational Linguistics, 2016:866-875.
[21] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[M]//Advances in Neural Information Processing Systems, Vol.30. La Jolla, USA:Neural Information Processing Systems Foundation, 2017:5998-6008.
[22] Mnih V, Heess N, Graves A. Recurrent models of visual attention[M]//Advances in Neural Information Processing Systems, Vol.27. La Jolla, USA:Neural Information Processing Systems Foundation, 2014:2204-2212.
[23] Xu K, Ba J, Kiros R, et al. Show, attend and tell:Neural image caption generation with visual attention[C]//International Conference on Machine Learning. 2015:2048-2057.
[24] Yang Z C, He X D, Gao J F, et al. Stacked attention networks for image question answering[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2016:21-29.
[25] Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous control with deep reinforcement learning[EB/OL]. (2015-09-09)[2018-12-25]. https://arxiv.org/abs/1509.02971.
[26] Heess N, Hunt J J, Lillicrap T P, et al. Memory-based control with recurrent neural networks[EB/OL]. (2015-12-14)[2018-12-25]. https://arxiv.org/abs/1512.04455.
[27] Thrun S, Burgard W, Fox D. Probabilistic robotics[M]. Cambridge, USA:MIT Press, 2005.
[28] Rolls E T, McCabe C, Redoute J. Expected value, reward outcome, and temporal difference error representations in a probabilistic decision task[J]. Cerebral Cortex, 2008, 18(3):652-663.
[29] Wierstra D, Förster A, Peters J, et al. Solving deep memory POMDPs with recurrent policy gradients[C]//17th International Conference on Artificial Neural Networks. Berlin, Germany:Springer, 2007:697-706.
[30] Hausknecht M, Stone P. Deep recurrent Q-learning for partially observable MDPs[EB/OL]. (2015-07-23)[2018-12-25]. https://arxiv.org/abs/1507.06527.
[31] Chung J, Gulcehre C, Cho K H, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[EB/OL]. (2014-12-11)[2018-12-25]. https://arxiv.org/abs/1412.3555.
[32] Foerster J, Nardelli N, Farquhar G, et al. Stabilising experience replay for deep multi-agent reinforcement learning[EB/OL]. (2018-05-21)[2018-12-25]. https://arxiv.org/abs/1702.08887.
[33] Yin W P, Schütze H, Xiang B, et al. ABCNN:Attention-basedconvolutional neural network for modeling sentence pairs[EB/OL]. (2015-12-16)[2018-12-25]. https://arxiv.org/abs/1512.05193.
[34] Sun Y R, Fisher R. Object-based visual attention for computer vision[J]. Artificial Intelligence, 2003, 146(1):77-123.
[35] Niv Y, Daniel R, Geana A, et al. Reinforcement learning in multidimensional environments relies on attention mechanisms[J]. Journal of Neuroscience, 2015, 35(21):8145-8157.
[36] Choi J, Lee B J, Zhang B T. Multi-focus attention network for efficient deep reinforcement learning[EB/OL]. (2017-12-13)[2018-12-25]. https://arxiv.org/abs/1712.04603v1.
[37] Plappert M,Andrychowicz M,Ray A,等.OpenAI发布训练实体机器人的最新模拟环境[J].机器人产业,2018(2):32-37. Plappert M, Andrychowicz M, Ray A, et al. OpenAI releases a new simulation environment for training physical robots[J]. Robot Industry, 2018(2):32-37.
[38] Todorov E, Erez T, Tassa Y. MuJoCo:A physics engine for model-based control[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, USA:IEEE, 2012:5026-5033.
[39] Liu X L, Deng Z D, Yang G R. Drivable road detection based on dilated FPN with feature aggregation[C]//IEEE International Conference on Tools with Artificial Intelligence. Piscataway, USA:IEEE, 2017:1128-1134.