Abstract:A monocular visual odometry method based on generative adversarial network (GAN) and self-attention mechanism is proposed, named SAGANVO (SAGAN visual odometry). It applies a generative adversarial network learning framework to depth estimation and visual odometry tasks. The GAN generates realistic target frame to accurately solve the depth map of a scene and 6-DoF (degree of freedom) pose. Meanwhile, the self-attention mechanism is incorporated into the network model in order to improve the deep network ability to learn scene details and edge contours. Finally, high-quality results of the proposed model and method are demonstrated on the public dataset KITTI and compared with the existing methods, and it is proved that SAGANVO has better performance in depth estimation and visual odometry than the state-of-art methods.
[1] Garg R, Vijay Kumar B G, Carneiro G, et al. Unsupervised CNN for single view depth estimation: Geometry to the rescue[C]//14th European Conference on Computer Vision. Cham, Switzerland: Springer, 2016: 740-756. [2] Godard C, Mac Aodha O, Brostow G J. Unsupervised monocular depth estimation with left-right consistency[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2017: 6602-6611. [3] Zhou T H, Brown M, Snavely N, et al. Unsupervised learning of depth and ego-motion from video[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2017: 6612-6619. [4] Li R H, Wang S, Long Z Q, et al. UnDeepVO: Monocular visual odometry through unsupervised deep learning[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA: IEEE, 2018: 7286-7291. [5] Bian J W, Li Z C, Wang N Y, et al. Unsupervised scale-consistent depth and ego-motion learning from monocular video[C]//33rd Conference on Neural Information Processing Systems. La Jolla, USA: Neural Information Processing Systems Foundation, 2019. [6] Ranjan A, Jampani V, Balles L, et al. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2019: 12232-12241. [7] Liu F Y, Shen C H, Lin G S, et al. Learning depth from single monocular images using deep convolutional neural fields[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(10): 2024-2039. [8] Eigen D, Puhrsch C, Fergus R. Depth map prediction from a single image using a multi-scale deep network[C]//28th Conference on Neural Information Processing Systems. La Jolla, USA: Neural Information Processing Systems Foundation, 2014: 2366-2374. [9] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//31st International Conference on Neural Information Processing Systems. La Jolla, USA: Neural Information Processing Systems Foundation, 2017: 6000-6010. [10] Parmar N, Vaswani A, Uszkoreit J, et al. Image transformer[C]//35th International Conference on Machine Learning. 2018: 6453-6462. [11] Wang X L, Girshick R, Gupta A, et al. Non-local neural networks[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2018: 7794-7803. [12] Zhou T, Tulsiani S, Sun W, et al. View synthesis by appearance flow[C]//European Conference on Computer Vision. Cham, Switzerland: Springer, 2016: 286-301. [13] Gulrajani I, Ahmed F, Arjovsky M, et al. Improved training of Wasserstein GANs[C]//31st International Conference on Neural Information Processing Systems. La Jolla, USA: Neural Information Processing Systems Foundation, 2017: 5769-5779. [14] Zhang H, Goodfellow I, Metaxas D, et al. Self-attention generative adversarial networks[C]//36th International Conference on Machine Learning. New York, USA: PMLR, 2019: 7354-7363. [15] Wang Z, Bovik A C, Sheikh H R, et al. Image quality assessment: From error visibility to structural similarity[J]. IEEE Transactions on Image Processing, 2004, 13(4): 600-612. [16] Heusel M, Ramsauer H, Unterthiner T, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium[C]//31st International Conference on Neural Information Processing Systems. La Jolla, USA: Neural Information Processing Systems Foundation, 2017: 6629-6640. [17] Pilzer A, Xu D, Puscas M, et al. Unsupervised adversarial depth estimation using cycled generative networks[C]//International Conference on 3D Vision. Piscataway, USA: IEEE, 2018: 587-595. [18] Zhan H Y, Garg R, Weerasekera C S, et al. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2018: 340-349. [19] Yang Z Z, Wang P, Xu W, et al. Unsupervised learning of geometry from videos with edge-aware depth-normal consistency[C]//32nd AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI, 2018: 7493-7500. [20] Yin Z C, Shi J P. GeoNet: Unsupervised learning of dense depth, optical flow and camera pose[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2018: 1983-1992. [21] Mahjourian R, Wicke M, Angelova A. Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2018: 5667-5675. [22] Yang Z H, Wang P, Wang Y, et al. LEGO: Learning edge with geometry all at once by watching videos[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2018: 225-234. [23] 陈宗海,洪洋,王纪凯,等.基于循环卷积神经网络的单目视觉里程计[J].机器人,2019,41(2):147-155.Chen Z H, Hong Y, Wang J K, et al. Monocular visual odometry based on recurrent convolutional neural networks[J]. Robot, 2019, 41(2): 147-155. [24] Mur-Artal R, Montiel J M M, Tardos J D. ORB-SLAM: A versatile and accurate monocular SLAM system[J]. IEEE Transactions on Robotics, 2015, 31(5): 1147-1163. [25] Wang Y, Yang Z, Wang P, et al. Joint unsupervised learning of optical flow and depth by watching stereo videos[DB/OL]. (2018-10-08) [2019-10-26]. https://arxiv.org/abs/1810.03654. [26] Luo C X, Yang Z H, Wang P, et al. Every pixel counts ++: Joint learning of geometry and motion with 3D holistic understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(10): 2624-2641. [27] Godard C, Aodha O M, Firman M, et al. Digging into self-supervised monocular depth estimation[C]//IEEE/CVF International Conference on Computer Vision. Piscataway, USA: IEEE, 2019: 3828-3837. [28] Feng T, Gu D B. SGANVO: Unsupervised deep visual odometry and depth estimation with stacked generative adversarial networks[J]. IEEE Robotics and Automation Letters, 2019, 4(4): 4431-4437. [29] Almalioglu Y, Saputra M R U, Gusmo P P B D, et al. GANVO: Unsupervised deep monocular visual odometry and depth estimation with generative adversarial networks[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA: IEEE, 2019: 5474-5480.