Semantic Map Construction Based on Deep Convolutional Neural Network
HU Meiyu1, ZHANG Yunzhou1,2, QIN Cao1, LIU Tongbo3
1. College of Information Science and Engineering, Northeastern University, Shenyang 110819, China;
2. Faculty of Robot Science and Engineering, Northeastern University, Shenyang 110819, China;
3. College of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China
Abstract:Semantic segmentation of images is combined with simultaneous localization and mapping (SLAM) to create three-dimensional semantic map. Through ORB-SLAM, the input image sequences are screened to obtain key frames. Then, an improved semantic segmentation method based on DeepLab algorithm is proposed. The up-sampling convolutional network is added behind the last layer of original convolution network to improve the coarse sampling caused by bilinear interpolation. The depth of the key frame is used as gating signals to control the choice of different convolution operations, as a result, the small details are preserved for remote objects and larger receptive fields are preserved for near objects simultaneously. The segmented image is aligned with the depth map. Then, three-dimensional dense semantic map of the scene is formed by using the spatial correspondence between adjacent key frames. Experimental results show that the proposed algorithm, for indoor and outdoor scenes, can implement accurate semantic segmentation and create satisfactory semantic map by reverse projection in the three-dimensional space. Compared with existing methods based on DeepLab and deconvolution algorithms, the proposed algorithm can obtain better semantic map.
[1] Nüchter A, Hertzberg J. Towards semantic maps for mobile robots[J]. Robotics and Autonomous Systems, 2008, 56(11):915-926. [2] Salas-Moreno R F, Newcombe R A, Strasdat H, et al. SLAM ++:Simultaneous localisation and mapping at the level of objects[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2013:1352-1359. [3] Civera J, Gálvez-López D, Riazuelo L, et al. Towards semantic SLAM using a monocular camera[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, USA:IEEE, 2011:1277-1284. [4] Vineet V, Miksik O, Lidegaard M, et al. Incremental dense semantic stereo fusion for large-scale semantic scene reconstruction[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA:IEEE, 2015:75-82. [5] McCormac J, Handa A, Davison A, et al. SemanticFusion:Dense 3D semantic mapping with convolutional neural networks[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA:IEEE, 2017:4628-4635. [6] 于金山,吴皓, 田国会,等.基于云的语义库设计及机器人语义地图构建[J].机器人,2016,38(4):410-419.Yu J S, Wu H, Tian G H, et al. Semantic database design and semantic map construction of robots based on the cloud[J]. Robot, 2016, 38(4):410-419. [7] Long J, Shelhamer E, Darrell T. Fully convolutional networksfor semantic segmentation[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2015:3431-3440. [8] Whelan T, Salas-Moreno R F, Glocker B, et al. ElasticFusion:Real-time dense SLAM and light source estimation[J]. International Journal of Robotics Research, 2016, 35(14):1697-1716. [9] Mur-Artal R, Tardos J D. ORB-SLAM2:An open-source SLAM system for monocular, stereo, and RGB-D cameras[J]. IEEE Transactions on Robotics, 2017, 33(5):1-8. [10] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2016:770-778. [11] Chen L C, Papandreou G, Kokkinos I, et al. DeepLab:Semanticimage segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2016, 40(4):834-848. [12] Wu Z F, Shen C H, Hengel A. High-performance semantic segmentation using very deep fully convolutional networks[EB/OL]. (2016-04-15)[2018-06-13]. https://arxiv.org/pdf/1604.04339.pdf. [13] Lin G, Shen C, van den Hengel A, et al. Efficient piecewise training of deep structured models for semantic segmentation[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2016:3194-3203. [14] Liu Z, Li X, Luo P, et al. Semantic image segmentation viadeep parsing network[C]//IEEE International Conference on Computer Vision. Piscataway, USA:IEEE, 2015:1377-1385. [15] Kokkinos I. Pushing the boundaries of boundary detection using deep learning[EB/OL]. (2016-01-22)[2018-06-13]. https://arxiv.org/pdf/1511.07386.pdf. [16] Arnab A, Jayasumana S, Zheng S, et al. Higher order potentialsin end-to-end trainable conditional random fields[EB/OL]. (2015-11-28)[2018-06-13]. https://arxiv.org/pdf/1511.08119 v2.pdf. [17] Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation[C]//IEEE International Conference on Computer Vision. Piscataway, USA:IEEE, 2015:1520-1528. [18] Badrinarayanan V, Kendall A, Cipolla R. SegNet:A deep convolutional encoder-decoder architecture for image segmentation[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, 39(12):2481-2495. [19] He K M, Gkioxari G, Dollar P, et al. Mask R-CNN[C]//IEEE International Conference on Computer Vision. Piscataway, USA:IEEE, 2017:2961-2969. [20] Zhao H S, Shi J P, Qi X J, et al. Pyramid scene parsing network[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2017:2881-2890. [21] Hermans A, Floros G, Leibe B. Dense 3D semantic mapping of indoor scenes from RGB-D images[C]//IEEE InternationalConference on Robotics and Automation. Piscataway, USA:IEEE, 2014:2631-2638. [22] Concha A, Civera J. DPPTAM:Dense piecewise planar tracking and mapping from a monocular sequence[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, USA:IEEE, 2015:5686-5693. [23] Tateno K, Tombari F, Laina I, et al. CNN-SLAM:Real-time dense monocular SLAM with learned depth prediction[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2017:6565-6574. [24] Chen L C, Papandreou G, Kokkinos I, et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs[J]. Computer Science, 2014(4):357-361. [25] Chen L C, Papandreou G, Schroff F, et al. Rethinking atrous convolution for semantic image segmentation[EB/OL]. (2017-12-05)[2018-06-13]. https://arxiv.org/pdf/1706.05587.pdf. [26] Wang P Q, Chen P F, Yuan Y, et al. Understanding convolution for semantic segmentation[C]//IEEE Winter Conference on Applications of Computer Vision. Piscataway, USA:IEEE, 2018:1451-1460. [27] Hazirbas C, Ma L, Domokos C, et al. FuseNet:Incorporating depth into semantic segmentation via fusion-based CNN architecture[C]//Asian Conference on Computer Vision. Berlin, Germany:Springer-Verlag, 2016:213-228. [28] Luo Z, Peng B, Huang D A, et al. Unsupervised learning of long-term motion dynamics for videos[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2017:7101-7110. [29] Kong S, Fowlkes C. Recurrent scene parsing with perspective understanding in the loop[EB/OL]. (2017-12-06)[2018-06-13]. https://arxiv.org/pdf/1705.07238.pdf. [30] Cordts M, Omran M, Ramos S, et al. The CityScapes dataset for semantic urban scene understanding[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2016:3213-3223. [31] Silberman N, Hoiem D, Kohli P, et al. Indoor segmentation and support inference from RGBD images[C]//European Conference on Computer Vision. Berlin, Germany:Springer, 2012:746-760. [32] Everingham M, van Gool L, Williams C K I, et al. The Pascal visual object classes (VOC) challenge[J]. International Journal of Computer Vision, 2010, 88(2):303-338. [33] Song S, Lichtenberg S P, Xiao J. Sun RGB-D:A RGB-D scene understanding benchmark suite[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2015:567-576.