Abstract:The sparse and dense approaches are two main aspects in vision-based simultaneous localization and mapping (VSLAM). Key technologies and latest research progress of the both approaches are reviewed in detail, and various aspects of the comparative advantages and disadvantages and the implementation difficulties of different methods are discussed. The research progress of deep learning techniques applied to VSLAM is reviewed, and the combination manner of the two approaches for improving performances is discussed. Finally, the future research directions of real-time VSLAM are explored.
[1] Thrun S, Burgard W, Fox D. Probabilistic robotics[M]. Cambridge, USA:MIT Press, 2005.
[2] Engel J, Sturm J, Cremers D. Camera-based navigation of a low-cost quadrocopter[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, USA:IEEE, 2012:2815-2821.
[3] Forster C, Pizzoli M, Scaramuzza D. SVO:Fast semi-direct monocular visual odometry[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA:IEEE, 2014:15-22.
[4] Mur-Artal R, Montiel J M M, Tardós J D. ORB-SLAM:A versatile and accurate monocular SLAM system[J]. IEEE Transactions on Robotics, 2015, 31(5):1147-1163.
[5] Engel J, Schöps T, Cremers D. LSD-SLAM:Large-scale direct monocular SLAM[C]//13th European Conference on Computer Vision. Berlin, Germany:Springer, 2014:834-849.
[6] Kümmerle R, Grisetti G, Strasdat H, et al. g2o:A general framework for graph optimization[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA:IEEE, 2011:3607-3613.
[7] Lowe D G. Distinctive image features from scale-invariant keypoints[J]. International Journal of Computer Vision, 2004, 60(2):91-110.
[8] Bay H, Tuytelaars T, van Gool L. SURF:Speeded up robust features[C]//9th European Conference on Computer Vision. Berlin, Germany:Springer, 2006:404-417.
[9] Rosten E, Drummond T. Machine learning for high-speed corner detection[C]//9th European Conference on Computer Vision. Berlin, Germany:Springer, 2006:430-443.
[10] Rublee E, Rabaud V, Konolige K, et al. ORB:An efficient alternative to SIFT or SURF[C]//IEEE International Conference on Computer Vision. Piscataway, USA:IEEE, 2011:2564-2571.
[11] Alahi A, Ortiz R, Vandergheynst P. FREAK:Fast retina keypoint[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2012:510-517.
[12] Rosin P L. Measuring corner properties[J]. Computer Vision and Image Understanding, 1999, 73(2):291-307.
[13] Calonder M, Lepetit V, Strecha C, et al. BRIEF:Binary robust independent elementary features[C]//11th European Conference on Computer Vision. Berlin, Germany:Springer, 2010:778-792.
[14] Zuliani M. RANSAC for Dummies[R]. Santa Barbara, USA:Vision Research Lab, University of California, 2008.
[15] Pollefeys M, van Gool L, Vergauwen M, et al. Visual modeling with a hand-held camera[J]. International Journal of Computer Vision, 2004, 59(3):207-232.
[16] Triggs B, McLauchlan P F, Hartley R I, et al. Bundle adjustment——a modern synthesis[M]//Vision Algorithms:Theory and Practice. Berlin, Germany:Springer, 2000:298-372.
[17] Strasdat H, Montiel J M M, Davison A J. Real-time monocular SLAM:Why filter?[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA:IEEE, 2010:2657-2664.
[18] Fuentes-Pacheco J, Ruiz-Ascencio J, Rendón-Mancha J M. Visual simultaneous localization and mapping:A survey[J]. Artificial Intelligence Review, 2015, 43(1):55-81.
[19] Paul R, Newman P. FAB-MAP 3D:Topological mapping with spatial and visual appearance[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA:IEEE, 2010:2649-2656.
[20] Pinies P, Paz L M, Galvez-Lopez D, et al. CI-graph simultaneous localization and mapping for three-dimensional reconstruction of large and complex environments using a multicamera system[J]. Journal of Field Robotics, 2010, 27(5):561-586.
[21] Angeli A, Filliat D, Doncieux S, et al. Fast and incremental method for loop-closure detection using bags of visual words[J]. IEEE Transactions on Robotics, 2008, 24(5):1027-1037.
[22] Gálvez-López D, Tardós J D. Real-time loop detection with bags of binary words[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, USA:IEEE, 2011:51-58.
[23] Cadena C, Gálvez-López D, Ramos F, et al. Robust place recognition with stereo cameras[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, USA:IEEE, 2010:5182-5189.
[24] Konolige K, Bowman J, Chen J D, et al. View-based maps[J]. International Journal of Robotics Research, 2010, 29(8):941-957.
[25] Grabner M, Grabner H, Bischof H. Fast approximated SIFT[C]//7th Asian Conference on Computer Vision. Berlin, Germany:Springer, 2006:918-927.
[26] Ke Y, Sukthankar R. PCA-SIFT:A more distinctive representation for local image descriptors[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol.2. Piscataway, USA:IEEE, 2004:506-513.
[27] Gálvez-López D, Tardós J D. Bags of binary words for fast place recognition in image sequences[J]. IEEE Transactions on Robotics, 2012, 28(5):1188-1197.
[28] Williams B, Cummins M, Neira J, et al. A comparison of loop closing techniques in monocular SLAM[J]. Robotics and Autonomous Systems, 2009, 57(12):1188-1197.
[29] Nister D, Stewenius H. Scalable recognition with a vocabulary tree[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2006:2161-2168.
[30] Mur-Artal R, Tardós J D. Fast relocalisation and loop closing in keyframe-based SLAM[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA:IEEE, 2014:846-853.
[31] Hartley R, Zisserman A. Multiple view geometry in computer vision[M]. 2nd ed. Cambridge, UK:Cambridge University Press, 2004.
[32] Davison A J. Real-time simultaneous localisation and mapping with a single camera[C]//IEEE International Conference on Computer Vision. Piscataway, USA:IEEE, 2003:1403-1410.
[33] Davison A J, Reid I D, Molton N D, et al. MonoSLAM:Real-time single camera SLAM[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(6):1052-1067.
[34] Klein G, Murray D. Parallel tracking and mapping for small AR workspaces[C]//6th IEEE and ACM International Symposium on Mixed and Augmented Reality. Piscataway, USA:IEEE, 2008:250-259.
[35] Weiss S, Achtelik M W, Lynen S, et al. Monocular vision for long-term micro aerial vehicle state estimation:A compendium[J]. Journal of Field Robotics, 2013, 30(5):803-831.
[36] Strasdat H, Montiel J M M, Davison A J. Scale drift-aware large scale monocular SLAM[C]//Proceedings of Robotics:Science and Systems VI. 2010. doi:10.15607/RSS.2010.VI.010.
[37] Strasdat H, Davison A J, Montiel J M M, et al. Double window optimisation for constant time visual SLAM[C]//IEEE International Conference on Computer Vision. Piscataway, USA:IEEE, 2011:2352-2359.
[38] Mur-Artal R, Tardós J D. Probabilistic semi-dense mapping from highly accurate feature-based monocular SLAM[C]//Proceedings of Robotics:Science and Systems XI. 2015. doi:10.15607/RSS.2015.XI.041.
[39] Klein G, Murray D. Improving the agility of keyframe-based SLAM[C]//10th European Conference on Computer Vision. Berlin, Germany:Springer, 2008:802-815.
[40] Concha A, Civera J. Using superpixels in monocular SLAM[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA:IEEE, 2014:365-372.
[41] Horn B K P, Schunck B G. Determining optical flow[C]//Proceedings of the SPIE, vol.281. Bellingham, USA:SPIE, 1981:319-331.
[42] Makadia A, Geyer C, Daniilidis K. Correspondence-free structure from motion[J]. International Journal of Computer Vision, 2007, 75(3):311-327.
[43] Irani M, Anandan P. About direct methods[M]//Vision Algorithms:Theory and Practice. Berlin, Germany:Springer, 2000:267-277.
[44] Jin H L, Favaro P, Soatto S. A semi-direct approach to structure from motion[J]. Visual Computer, 2003, 19(6):377-394.
[45] Molton N, Davison A J, Reid I. Locally planar patch features for real-time structure from motion[C]//Proceedings of the British Machine Vision Conference. 2004:90.1-90.10.
[46] Silveira G, Malis E, Rives P. An efficient direct approach to visual SLAM[J]. IEEE Transactions on Robotics, 2008, 24(5):969-979.
[47] Pretto A, Menegatti E, Pagello E. Omnidirectional dense large-scale mapping and navigation based on meaningful triangulation[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA:IEEE, 2011:3289-3296.
[48] Stühmer J, Gumhold S, Cremers D. Real-time dense geometry from a handheld camera[M]//Lecture Notes in Computer Science, vol.6376. Berlin, Germany:Springer, 2010:11-20.
[49] Pizzoli M, Forster C, Scaramuzza D. REMODE:Probabilistic, monocular dense reconstruction in real time[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA:IEEE, 2014:2609-2616.
[50] Lovegrove S, Davison A J, Ibanez-Guzmán J. Accurate visual odometry from a rear parking camera[C]//IEEE Intelligent Vehicles Symposium. Piscataway, USA:IEEE, 2011:788-793.
[51] Newcombe R A, Lovegrove S J, Davison A J. DTAM:Dense tracking and mapping in real-time[C]//IEEE International Conference on Computer Vision. Piscataway, USA:IEEE, 2011:2320-2327.
[52] Engel J, Sturm J, Cremers D. Semi-dense visual odometry for a monocular camera[C]//IEEE International Conference on Computer Vision. Piscataway, USA:IEEE, 2013:1449-1456.
[53] Kerl C, Sturm J, Cremers D. Robust odometry estimation for RGB-D cameras[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA:IEEE, 2013:3748-3754.
[54] Faessler M, Fontana F, Forster C, et al. Autonomous, vision-based flight and live dense 3D mapping with a quadrotor micro aerial vehicle[J]. Journal of Field Robotics, 2015, 33(4):431-450.
[55] Kerl C, Sturm J, Cremers D. Dense visual SLAM for RGB-D cameras[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, USA:IEEE, 2013:2100-2106.
[56] Schöps T, Engel J, Cremers D. Semi-dense visual odometry for AR on a smartphone[C]//IEEE International Symposium on Mixed and Augmented Reality. Piscataway, USA:IEEE, 2014:145-150.
[57] Engel J, Stückler J, Cremers D. Large-scale direct SLAM with stereo cameras[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, USA:IEEE, 2015:1935-1942.
[58] Caruso D, Engel J, Cremers D. Large-scale direct SLAM for omnidirectional cameras[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, USA:IEEE, 2015:141-148.
[59] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems. 2012:1097-1105.
[60] Eigen D, Puhrsch C, Fergus R. Depth map prediction from a single image using a multi-scale deep network[C]//Advances in Neural Information Processing Systems. 2014:2366-2374.
[61] Konda K, Memisevic R. Learning visual odometry with a convolutional network[C]//10th International Conference on Computer Vision Theory and Applications. Setubal, Portugal:INSTICC Press, 2015:486-490.
[62] Zbontar J, LeCun Y. Computing the stereo matching cost with a convolutional neural network[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2015:1592-1599.
[63] Salas-Moreno R F. Dense semantic SLAM[D]. London, UK:Imperial College London, 2014.
[64] Salas-Moreno R F, Newcombe R, Strasdat H, et al. SLAM++:Simultaneous localisation and mapping at the level of objects[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2013:1352-1359.