Abstract:A monocular visual odometry method based on convolutional long short term memory (LSTM) network and convolutional neural network (CNN) is proposed, named LSTM visual odometry (LSTMVO). LSTMVO uses an unsupervised end-to-end deep learning framework to simultaneously estimate the 6-DoF (degree of freedom) pose and scene depth of monocular cameras. The entire network framework includes a pose estimation network and a depth estimation network. The pose estimation network is a deep recurrent convolutional neural network (RCNN) that implements monocular pose estimation from end to end, consisting of feature extraction based on convolutional neural networks and time-series modeling based on recurrent neural networks (RNN). The depth estimation network generates dense depth maps primarily based on the encoder-decoder architecture. At the same time, a new loss function for network training is proposed. The loss function consists of time series loss, loss of depth smoothness, and loss of consistency before and after the image sequence. The experimental results based on KITTI dataset show that by training on the original monocular RGB image, LSTMVO is superior to the existing mainstream monocular visual odometry methods in terms of pose estimation accuracy and depth estimation accuracy, verifying the effectiveness of the deep learning framework proposed.
[1] Davison A J, Reid I D, Molton N D, et al. MonoSLAM:Real-time single camera SLAM[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(6):1052-1067.
[2] Klein G, Murray D. Parallel tracking and mapping for smallAR workspaces[C]//6th IEEE and ACM International Symposium on Mixed and Augmented Reality. Piscataway, USA:IEEE, 2007:225-234.
[3] Mur-Artal R, Montiel J M M, Tardos J D. ORB-SLAM:A versatile and accurate monocular SLAM system[J]. IEEE Transactions on Robotics, 2015, 31(5):1147-1163.
[4] Newcombe R A, Lovegrove S J, Davison A J. DTAM:Dense tracking and mapping in real-time[C]//IEEE International Conference on Computer Vision. Piscataway, USA:IEEE, 2011:2320-2327.
[5] Engel J, Schöps T, Cremers D. LSD-SLAM:Large-scale direct monocular SLAM[C]//13th European Conference on Computer Vision. Berlin, Germany:Springer, 2014:834-849.
[6] Engel J, Koltun V, Cremers D. Direct sparse odometry[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(3):611-625.
[7] Donahue J, Hendricks L A, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2015:2625-2634.
[8] Kendall A, Grimes M, Cipolla R. Posenet:A convolutional network for real-time 6-DOF camera relocalization[C]//IEEE International Conference on Computer Vision. Piscataway, USA:IEEE, 2015:2938-2946.
[9] Li R H, Liu Q, Gui J J, et al. Indoor relocalization in challenging environments with dual-stream convolutional neural networks[J]. IEEE Transactions on Automation Science and Engineering, 2018, 15(2):651-662.
[10] Costante G, Mancini M, Valigi P, et al. Exploring representation learning with CNNs for frame-to-frame ego-motion estimation[J]. IEEE Robotics and Automation Letters, 2016, 1(1):18-25.
[11] Wang S, Clark R, Wen H K, et al. DeepVO:Towards end-to-end visual odometry with deep recurrent convolutional neural networks[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA:IEEE, 2017:2043-2050.
[12] Ummenhofer B, Zhou H Z, Uhrig J, et al. DeMoN:Depth and motion network for learning monocular stereo[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2017:5622-5631.
[13] Clark R, Wang S, Wen H K, et al. VINet:Visual-inertial odometry as a sequence-to-sequence learning problem[C]//31st AAAI Conference on Artificial Intelligence. Palo Alto, USA:AAAI, 2017:3995-4001.
[14] Pillai S, Leonard J J. Towards visual ego-motion learning in robots[C]//IEEE/RSJ International Conference on IntelligentRobots and Systems. Piscataway, USA:IEEE, 2017:5533-5540.
[15] Jaderberg M, Simonyan K, Zisserman A, et al. Spatial transformer networks[C]//Advances in Neural Information Processing Systems. Canada:Neural Information Processing Systems Foundation, 2015:2017-2025.
[16] Garg R, Vijay Kumar B G, Carneiro G, et al. Unsupervised CNN for single view depth estimation:Geometry to the rescue[C]//14th European Conference on Computer Vision. Berlin, Germany:Springer, 2016:740-756.
[17] Godard C, Mac Aodha O, Brostow G J. Unsupervised monocular depth estimation with left-right consistency[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2017:6602-6611.
[18] Zhou T H, Brown M, Snavely N, et al. Unsupervised learning of depth and ego-motion from video[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2017:6612-6621.
[19] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[C/OL]. (2015-04-10)[2017-01-01]. https://arxiv.org/pdf/1409.1556.pdf.
[20] Shi X J, Chen Z R, Wang H, et al. Convolutional LSTM network:A machine learning approach for precipitation nowcasting[C]//Advances in Neural Information Processing Systems. Canada:Neural Information Processing Systems Foundation, 2015:802-810.
[21] Wang Z, Bovik A C, Sheikh H R, et al. Image quality assessment:From error visibility to structural similarity[J]. IEEE Transactions on Image Processing, 2004, 13(4):600-612.
[22] Geiger A, Lenz P, Urtasun R. Are we ready for autonomous driving? the KITTI vision benchmark suite[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2012:3354-3361.
[23] Saxena A, Sun M, Ng A Y. Make3D:Learning 3D scene structure from a single still image[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(5):824-840.
[24] Eigen D, Puhrsch C, Fergus R. Depth map prediction from a single image using a multi-scale deep network[C]//Advances in Neural Information Processing Systems. Canada:Neural Information Processing Systems Foundation, 2014:2366-2374.
[25] Liu F Y, Shen C H, Lin G S, et al. Learning depth from singlemonocular images using deep convolutional neural fields[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(10):2024-2039.