多南讯, 吕强, 林辉灿, 卫恒. 迈进高维连续空间:深度强化学习在机器人领域中的应用[J]. 机器人, 2019, 41(2): 276-288.DOI: 10.13973/j.cnki.robot.180336.
DUO Nanxun, LÜ Qiang, LIN Huican, WEI Heng. Step into High-Dimensional and Continuous Action Space: A Survey on Applications of Deep Reinforcement Learning to Robotics. ROBOT, 2019, 41(2): 276-288. DOI: 10.13973/j.cnki.robot.180336.
Abstract:Firstly, the emergence and development of DRL (deep reinforcement learning) are reviewed. Secondly, DRL algorithms used in high-dimensional and continuous action space are classified into value function approximation based algorithms, policy approximation based algorithms and other structures based algorithms. Then, typical DRL algorithms and their characteristics are introduced, especially their ideas, advantages and disadvantages. Finally, the future trends of applying DRL to robotics are forecasted according to the development directions of DRL algorithms.
[1] Caloud P, Choi W, Latombe J C, et al. Indoor automation with many mobile robots[C]//IEEE International Workshop on Intelligent Robots and Systems. Piscataway, USA:IEEE, 1990:67-72.
[2] Burgard W, Moors M, Stachniss C, et al. Coordinated multi-robot exploration[J]. IEEE Transactions on Robotics, 2005, 21(3):376-386.
[3] Qian S H, Ge S R, Wang Y S, et al. Research status of the disaster rescue robot and its applications to the mine rescue[J]. Robot, 2006, 28(3):350-354.
[4] Roberts R, Ta D N, Straub J, et al. Saliency detection and model-based tracking:A two part vision system for small robot navigation in forested environment[M]//Proceedings of SPIE, Vol.8387. Bellingham, USA:SPIE, 2012:No.838705.
[5] Jiang X Z. Servo control of joint driven by two pneumatic muscles in opposing pair configuration for rehabilitation robot[D]. Wuhan:Huazhong University of Science and Technology, 2011.
[6] Tesauro G. TD-gammon, a self-teaching backgammon program, achieves master-level play[J]. Neural Computation, 1994, 6(2):215-219.
[7] Silver D, Schrittwieser J, Simonyan K, et al. Mastering the game of Go without human knowledge[J]. Nature, 2017, 550(7676):354-359.
[8] Tian Y D, Zhu Y. Better computer Go player with neural network and long-term prediction[EB/OL]. (2016-02-29)[2018-05-01]. https://arxiv.org/pdf/1511.06410.pdf.
[9] Kocsis L, Szepesvári C. Bandit based Monte-Carlo planning[M]//Lecture Notes in Computer Science, Vol.4212. Berlin, Germany:Springer, 2006:282-293.
[10] Zhao T T, Hachiya H, Niu G, et al. Analysis and improvementof policy gradient estimation[J]. Neural Networks, 2012, 26(2):118-129.
[11] Mnih V, Kavukcuoglu K, Silver D, et al. Playing Atari with deep reinforcement learning[EB/OL]. (2013-12-19)[2018-05-01]. https://arxiv.org/pdf/1312.5602.pdf.
[12] Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540):529-533.
[13] Watkins C J C H, Dayan P. Technical note:Q-learning[J]. Machine Learning, 1992, 8(3-4):279-292.
[14] Riedmiller M. Neural fitted Q iteration-First experiences with a data efficient neural reinforcement learning method[C]//16th European Conference on Machine Learning. Berlin, Germany:Springer-Verlag, 2005:317-328.
[15] Lange S, Riedmiller M. Deep auto-encoder neural networks in reinforcement learning[C]//International Joint Conference on Neural Networks. Piscataway, USA:IEEE, 2010.
[16] Farahmand A M, Nabi S, Nikovski D N. Deep reinforcement learning for partial differential equation control[C]//American Control Conference. Piscataway, USA:IEEE, 2017:3120-3127.
[17] Kober J, Bagnell J A, Peters J. Reinforcement learning in robotics:A survey[J]. International Journal of Robotics Research, 2013, 32(11):1238-1274.
[18] Barto A G, Mahadevan S. Recent advances in hierarchical reinforcement learning[J]. Discrete Event Dynamic Systems, 2003, 13(1-2):41-77.
[19] 温暖,刘正华,祝令谱,等.深度强化学习在变体飞行器自主外形优化中的应用[J].宇航学报,2017,38(11):1153-1159.Wen N, Liu Z H, Zhu L P, et al. Deep reinforcement learning and its application on autonomous shape optimization for morphing aircrafts[J]. Journal of Astronautics, 2017, 38(11):1153-1159.
[20] Parisi S, Ramstedt S, Peters J. Goal-driven dimensionality reduction for reinforcement learning[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, USA:IEEE, 2017:4634-4639.
[21] Duan Y, Chen X, Houthooft R, et al. Benchmarking deep reinforcement learning for continuous control[C]//33rd International Conference on International Conference on Machine Learning. USA:International Machine Learning Society, 2016:2001-2014.
[22] Laskey M, Chuck C, Lee J, et al. Comparing human-centric and robot-centric sampling for robot deep learning from demonstrations[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA:IEEE, 2017:358-365.
[23] Thananjeyan B, Garg A, Krishnan S, et al. Multilateral surgical pattern cutting in 2D orthotropic gauze with deep reinforcement learning policies for tensioning[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA:IEEE, 2017:2371-2378.
[24] Chebotar Y, Hausman K, Zhang M, et al. Combining model-based and model-free updates for trajectory-centric reinforcement learning[C]//34th International Conference on Machine Learning. USA:International Machine Learning Society, 2017:1173-1185.
[25] 施国强.RoboCup3D仿真系统中仿人机器人的全向步态及高层决策的实现[D].合肥:合肥工业大学,2010.Shi G Q. Implementation of omni-directional walking and high-level decision for humanoid robots in RoboCup3D simulation system[D]. Hefei:Hefei University of Technology, 2010.
[26] Popov I, Heess N, Lillicrap T, et al. Data-efficient deep reinforcement learning for dexterous manipulation[EB/OL]. (2017-04-10)[2018-05-01]. https://arxiv.org/pdf/1704.03073.pdf.
[27] Inoue T, de Magistris G, Munawar A, et al. Deep reinforcement learning for high precision assembly tasks[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, USA:IEEE, 2017:819-825.
[28] Kirchner F. Q-learning of complex behaviours on a six-legged walking machine[J]. Robotics and Autonomous Systems, 1998, 25(3-4):253-262.
[29] Hart S, Grupen R. Learning generalizable control programs[J]. IEEE Transactions on Autonomous Mental Development, 2011, 3(3):216-231.
[30] Lin L J. Self-improving reactive agents based on reinforcementlearning, planning and teaching[J]. Machine Learning, 1992, 8(3-4):293-321.
[31] Zhang Q C, Lin M, Yang L T, et al. Energy-efficient schedulingfor real-time systems based on deep Q-Learning model[J]. IEEE Transactions on Sustainable Computing, 2017. DOI:02110.1109/TSUSC.2017.2743704.
[32] Schaul T, Quan J, Antonoglou I, et al. Prioritized experiencereplay[EB/OL]. (2016-02-25)[2018-05-01]. https://arxiv.org/pdf/1511.05952.pdf.
[33] Thrun S, Schwartz A. Issues in using function approximation for reinforcement learning[C]//Proceedings of the 1993 Connectionist Models Summer School. Mahwah, USA:Lawrence Erlbaum Associates, 1994:255-263.
[34] van Hasselt H, Guez A, Silver D. Deep reinforcement learningwith double Q-learning[EB/OL]. (2015-12-08)[2018-05-01]. https://arxiv.org/pdf/1509.06461.pdf.
[35] Ngai D C K, Yung N H C. Double action Q-learning for obstacleavoidance in a dynamically changing environment[C]//IEEE Intelligent Vehicles Symposium. Piscataway, USA:IEEE, 2005:211-216.
[36] van Hasselt H. Double Q-learning[C]//24th Annual Conference on Neural Information Processing Systems. USA:Curran Associates Inc., 2010:2613-2621.
[37] Wang Z Y, Schaul T, Hessel M, et al. Dueling network architectures for deep reinforcement learning[C]//33rd International Conference on Machine Learning. USA:International Machine Learning Society, 2016:2939-2947.
[38] Tai L, Li S H, Liu M. A deep-network solution towards model-less obstacle avoidance[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, USA:IEEE, 2016:2759-2764.
[39] Tai L, Liu M. A robot exploration strategy based onQ-learning network[C]//IEEE International Conference on Real-Time Computing and Robotics. Piscataway, USA:IEEE, 2016:57-62.
[40] Sasaki H, Horiuchi T, Kato S. Experimental study on behavior acquisition of mobile robot by deep Q-network[J]. Journal of Advanced Computational Intelligence and Intelligent Informatics, 2017, 21(5):840-848.
[41] Miyazaki K, Kimura H, Kobayashi S, et al. Theory and applications of reinforcement learning based on profit sharing[J]. Journal of Japanese Society for Artificial Intelligence, 1999, 14(5):800-807.
[42] Bai T Z, Yang J N, Chen J, et al. Double-task deep Q-Learning with multiple views[C]//2017 IEEE International Conference on Computer Vision Workshops. Piscataway, USA:IEEE, 2018:1050-1058.
[43] Gu S X, Lillicrap T, Sutskever U, et al. Continuous deepQ-learning with model-based acceleration[C]//33rd International Conference on Machine Learning. USA:International Machine Learning Society, 2016:4135-4148.
[44] Arukumaran K, Deisenroth M, Brundage M, et al. A brief survey of deep reinforcement learning[EB/OL]. (2017-09-28)[2018-05-01]. https://arxiv.org/pdf/1708.05866.pdf.
[45] Kakade S, Langford J. Approximately optimal approximate reinforcement learning[C]//19th International Conference on Machine Learning. San Francisco, USA:Morgan Kaufmann Publishers Inc., 2002:267-274.
[46] Schulman J, Levine S, Moritz P, et al. Trust region policy optimization[C]//32nd International Conference on Machine Learning. USA:International Machine Learning Society, 2015:1889-1897.
[47] Silver D, Lever G, Heess N, et al. Deterministic policy gradientalgorithms[C]//31st International Conference on Machine Learning. USA:International Machine Learning Society, 2014:605-619.
[48] Rosenstein M T, Barto A G. Supervised actor-critic reinforcement learning[M]//Handbook of Learning and Approximate Dynamic Programming. Piscataway, USA:Wiley-IEEE Press, 2004:359-380.
[49] Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous control with deep reinforcement learning[EB/OL]. (2016-02-29)[2018-05-01]. https://arxiv.org/pdf/1509.02971.pdf.
[50] Heess N, Tb D, Sriram S, et al. Emergence of locomotion behaviours in rich environments[EB/OL]. (2017-07-10)[2018-05-01]. https://arxiv.org/pdf/1707.02286.pdf.
[51] Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms[EB/OL]. (2017-08-28)[2018-05-01]. https://arxiv.org/pdf/1707.06347.pdf.
[52] Schulman J, Moritz P, Levine S, et al. High-dimensional continuous control using generalized advantage estimation[EB/OL].(2016-09-09)[2018-05-01]. https://arxiv.org/pdf/1506.02438.pdf.
[53] Huang P H, Hasegawa O. Learning quadcopter maneuvers with concurrent methods of policy optimization[J]. Journal of Advanced Computational Intelligence and Intelligent Informatics, 2017, 21(4):639-649.
[54] Tai L, Paolo G, Liu M. Virtual-to-real deep reinforcement learning:Continuous control of mobile robots for mapless navigation[C]//IEEE/RSJ International Conference on IntelligentRobots and Systems. Piscataway, USA:IEEE, 2017:31-36.
[55] Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning[C]//33rd International Conference on Machine Learning. USA:International Machine Learning Society, 2016:2850-2869.
[56] Williams R J, Peng J. Function optimization using connectionist reinforcement learning algorithms[J]. Connection Science, 2012, 3(3):241-268.
[57] Babaeizadeh M, Frosio I, Tyree S, et al. Reinforcement learning through asynchronous advantage actor-critic on a GPU[EB/OL]. (2017-03-02)[2018-05-01]. https://arxiv.org/pdf/1611.06256.pdf.
[58] Gu S X, Holly E, Lillicrap T, et al. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA:IEEE, 2017:3389-3396.
[59] Ghadirzadeh A, Maki A, Kragic D, et al. Deep predictive policytraining using reinforcement learning[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, USA:IEEE, 2017:2351-2358.
[60] Espeholt L, Soyer H, Munos R, et al. IMPALA:Scalabledistributed deep-RL with importance weighted actor-learner architectures[EB/OL]. (2018-06-28)[2018-08-01]. https://arxiv.org/pdf/1802.01561.pdf.
[61] Jaderberg M, Mnih V, Czarnecki W M, et al. Reinforcement learning with unsupervised auxiliary tasks[EB/OL]. (2016-11-16)[2018-05-01]. https://arxiv.org/pdf/1611.05397.pdf.
[62] Mirowski P, Pascanu R, Viola F, et al. Learning to navigate in complex environments[EB/OL]. (2017-01-13)[2018-05-01]. https://arxiv.org/pdf/1611.03673.pdf.
[63] Deisenroth M P, Rasmussen C E. PILCO:A model-based and data-efficient approach to policy search[C]//28th InternationalConference on Machine Learning. New York, USA:ACM, 2011:465-472.
[64] Gal Y, Ghahramani Z. Dropout as a Bayesian approximation:Representing model uncertainty in deep learning[C]//33rd International Conference on Machine Learning. USA:International Machine Learning Society, 2016:1651-1660.
[65] Xia C, El Kamel A. Neural inverse reinforcement learning in autonomous navigation[J]. Robotics and Autonomous Systems, 2016, 84:1-14.
[66] Abbeel P, Ng A Y. Apprenticeship learning via inverse reinforcement learning[C]//21st International Conference on Machine Learning. New York, USA:ACM, 2004:1-8.
[67] Ho J, Gupta J K, Ermon S. Model-free imitation learning with policy optimization[C]//33rd International Conference on Machine Learning. USA:International Machine Learning Society, 2016:4036-4046.
[68] Goodfellow I J, Pouget-Abadie J, Mirza M, et al. Generative adversarial networks[EB/OL]. (2014-06-10)[2018-05-01]. https://arxiv.org/pdf/1406.2661.pdf.
[69] Radford A, Metz L, Chintala S. Unsupervised representationlearning with deep convolutional generative adversarial networks[EB/OL]. (2016-01-07)[2018-05-01]. https://arxiv.org/pdf/1511.06434.pdf.
[70] Ho J, Ermon S. Generative adversarial imitation learning[EB/OL]. (2016-06-10)[2018-05-01]. https://arxiv.org/pdf/1606.03476.pdf.
[71] Merel J, Tassa Y, Tb D, et al. Learning human behaviors from motion capture by adversarial imitation[EB/OL]. (2017-06-07)[2018-05-01]. https://arxiv.org/pdf/1707.02201.pdf.
[72] Tai L, Zhang J W, Liu M, et al. Socially compliant navigationthrough raw depth inputs with generative adversarial imitationlearning[EB/OL]. (2018-02-26)[2018-05-01]. https://arxiv.org/pdf/1710.02543.pdf.
[73] Hecht-Nielsen R. Theory of the backpropagation neural network[C]//International Joint Conference on Neural Networks. Piscataway, USA:IEEE, 1989:593-605.