Abstract:For monocular camera-based visual relocalization, the research status and the latest progress are reviewed, and some key methods are introduced. Different from the existing vertical classification frameworks of relocalization methods, this paper proposes an intuitive and unified horizontal classification framework, which is mainly carried out from 3 aspects, including the scene model construction, scene information matching and camera pose solving. The deep-learning-based and geometric-structure-based methods are elaborated in the framework uniformly for the first time. Based on the in-depth performance analysis and visualization results, factors leading to performance bottlenecks and challenges of camera pose estimation are pointed out. Meanwhile, state-of-the-art methods of camera pose estimation are analyzed and summarized. Finally, the development trends of visual relocalization methods in the future are prospected.
[1] Sattler T, Zhou Q, Pollefeys M, et al. Understanding the limitations of CNN-based absolute camera pose regression[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2019: 3297-3307. [2] Shavit Y, Ferens R. Introduction to camera pose estimation with deep learning[DB/OL]. (2019-07-16) [2020-09-01]. https://arxiv.org/abs/1907.05272v3. [3] Lindeberg T. Scale invariant feature transform[J]. Scholarpedia, 2012, 7(5). DOI: 10.4249/scholarpedia.10491. [4] Bay H, Ess A, Tuytelaars T, et al. Speeded-up robust features (SURF)[J]. Computer Vision and Image Understanding, 2008, 110(3): 346-359. [5] Calonder M, Lepetit V, Strecha C, et al. BRIEF: Binary robust independent elementary features[C]//European Conference on Computer Vision. Berlin, Germany: Springer, 2010: 778-792. [6] Alahi A, Ortiz R, Vandergheynst P. FREAK: Fast retina keypoint[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2012: 510-517. [7] Leutenegger S, Chli M, Siegwart R Y. BRISK: Binary robust invariant scalable keypoints[C]//International Conference on Computer Vision. Piscataway, USA: IEEE, 2011: 2548-2555. [8] Torii A, Arandjelovic R, Sivic J, et al. 24/7 place recognition by view synthesis[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2015: 1808- 1817. [9] Arandjelovic R, Gronat P, Torii A, et al. NetVLAD: CNN architecture for weakly supervised place recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2016: 5297-5307. [10] Balntas V, Li S, Prisacariu V. RelocNet: Continuous metric learning relocalisation using neural nets[C]//European Conference on Computer Vision. Cham, Switzerland: Springer, 2018: 782-799. [11] Laskar Z, Melekhov I, Kalia S, et al. Camera relocalization by computing pairwise relative poses using convolutional neural network[C]//IEEE International Conference on Computer Vision Workshops. Piscataway, USA: IEEE, 2017: 920-929. [12] Brachmann E, Rother C. Neural-guided RANSAC: Learning where to sample model hypotheses[C]//IEEE/CVF International Conference on Computer Vision. Piscataway, USA: IEEE, 2019: 4321-4330. [13] Kendall A, Grimes M, Cipolla R. PoseNet: A convolutional network for real-time 6-DOF camera relocalization[C]//IEEE International Conference on Computer Vision. Piscataway, USA: IEEE, 2015: 2938-2946. [14] Kendall A, Cipolla R. Geometric loss functions for camera pose regression with deep learning[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2017: 6555-6564. [15] Walch F, Hazirbas C, Leal-Taixe L, et al. Image-based localization using LSTMs for structured feature correlation[C]//IEEE International Conference on Computer Vision. Piscataway, USA: IEEE, 2017: 627-637. [16] Melekhov I, Ylioinas J, Kannala J, et al. Image-based localization using hourglass networks[C]//IEEE International Conference on Computer Vision Workshops. Piscataway, USA: IEEE, 2017: 870-877. [17] Naseer T, Burgard W. Deep regression for monocular camerabased 6-DoF global localization in outdoor environments [C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, USA: IEEE, 2017: 1525-1530. [18] Wu J, Ma L, Hu X. Delving deeper into convolutional neural networks for camera relocalization[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA: IEEE, 2017: 5644-5651. [19] Brahmbhatt S, Gu J, Kim K, et al. Geometry-aware learning of maps for camera localization[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2018: 2616-2625. [20] Brachmann E, Krull A, Nowozin S, et al. DSAC – Differentiable RANSAC for camera localization[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2017: 2492-2500. [21] Brachmann E, Rother C. Learning less is more – 6D camera localization via 3D surface regression[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2018: 4654-4662. [22] Li X T, Wang S Z, Zhao Y, et al. Hierarchical scene coordinate classification and regression for visual localization[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2020: 11980-11989. [23] Duong N D, Soladie C, Kacete A, et al. Efficient multi-output scene coordinate prediction for fast and accurate camera relocalization from a single RGB image[J]. Computer Vision and Image Understanding, 2020, 190. DOI: 10.1016/j.cviu.2019. 102850. [24] Sarlin P E, Cadena C, Siegwart R, et al. From coarse to fine: Robust hierarchical localization at large scale[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2019: 12708-12717. [25] Hartley R, Zisserman A. Multiple view geometry in computer vision[M]. 2nd ed. Cambridge, UK: Cambridge University Press, 2004. [26] Sattler T, Leibe B, Kobbelt L. Fast image-based localization using direct 2D-to-3D matching[C]//International Conference on Computer Vision. Piscataway, USA: IEEE, 2011: 667-674. [27] Baatz G, Köser K, Chen D, et al. Leveraging 3D city models for rotation invariant place-of-interest recognition[J]. International Journal of Computer Vision, 2012, 96: 315-334. [28] Irschara A, Zach C, Frahm J M, et al. From structure-frommotion point clouds to fast location recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2009: 2599-2606. [29] Sattler T, Havlena M, Radenovic F, et al. Hyperpoints and fine vocabularies for large-scale location recognition[C]//IEEE International Conference on Computer Vision. Piscataway, USA: IEEE, 2015: 2102-2110. [30] Shi T X, Shen S H, Gao X, et al. Visual localization using sparse semantic 3D map[C]//IEEE International Conference on Image Processing. Piscataway, USA: IEEE, 2019: 315-319. [31] Chen L C, Zhu Y, Papandreou G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation [C]//European Conference on Computer Vision. Cham, Switzerland: Springer, 2018: 833-851. [32] Izadi S, Kim D, Hilliges O, et al. KinectFusion: Real-time 3D reconstruction and interaction using a moving depth camera [C]//24th Annual ACM Symposium on User Interface Software and Technology. New York, USA: ACM, 2011: 559-568. [33] Whelan T, Leutenegger S, Salas-Moreno R, et al. ElasticFusion: Dense SLAM without a pose graph[C]//Proceedings of Robotics: Science and Systems. 2015. DOI: 10.15607/RSS.2015.XI. 001. [34] Jégou H, Douze M, Schmid C, et al. Aggregating local descriptors into a compact image representation[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2010: 3304-3311. [35] Silpa-Anan C, Hartley R. Visual localization and loop-back detection with a high resolution omnidirectional camera[C/OL]//6th Workshop on Omnidirectional Vision, Camera Networks and Non-classical Cameras. 2005. [2020-09-01]. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.117.7030&rep=rep1&type=pdf. [36] Cummins M, Newman P. FAB-MAP: Probabilistic localization and mapping in the space of appearance[J]. International Journal of Robotics Research, 2008, 27(6): 647-665. [37] Galvez-Lopez D, Tardos J D. Real-time loop detection with bags of binary words[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, USA: IEEE, 2011: 51-58. [38] Galvez-Lopez D, Tardos J D. Bags of binary words for fast place recognition in image sequences[J]. IEEE Transactions on Robotics, 2012, 28(5): 1188-1197. [39] Ojala T, Pietikäinen M, Mäenpää T. Gray scale and rotation invariant texture classification with local binary patterns[C]//6th European Conference on Computer Vision. Berlin, Germany: Springer, 2000: 404-420. [40] Qiao Y, Cappelle C, Ruichek Y. Place recognition based visual localization using LBP feature and SVM[C]//Mexican International Conference on Artificial Intelligence. Cham, Switzerland: Springer, 2015: 393-404. [41] Rublee E, Rabaud V, Konolige K, et al. ORB: An efficient alternative to SIFT or SURF[C]//International Conference on Computer Vision. Piscataway, USA: IEEE, 2011: 2564-2571. [42] Mur-Artal R, Montiel J M M, Tardos J D. ORB-SLAM: A versatile and accurate monocular SLAM system[J]. IEEE Transactions on Robotics, 2015, 31(5): 1147-1163. [43] Mur-Artal R, Tardòs J D. ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras[J]. IEEE Transactions on Robotics, 2017, 33(5): 1255-1262. [44] Liu F, Lv Q, Lin H C, et al. An image registration algorithm based on FREAK-FAST for visual SLAM[C]//35th Chinese Control Conference. Piscataway, USA: IEEE, 2016: 6222- 6226. [45] DeTone D, Malisiewicz T, Rabinovich A. SuperPoint: Self-supervised interest point detection and description[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Piscataway, USA: IEEE, 2018: 337-349. [46] Germain H, Bourmaud G, Lepetit V. Efficient condition-based representations for long-term visual localization[DB/OL]. (2018-12-10) [2020-09-01]. https://arxiv.org/abs/1812.03707v1. [47] Lin R C, Xiao J, Fan J P. NeXtVLAD: An efficient neural network to aggregate frame-level features for large-scale video classification[C]//European Conference on Computer Vision. Cham, Switzerland: Springer, 2018: 206-218. [48] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[DB/OL]. (2015-04-10) [2020- 09-01]. https://arxiv.org/abs/1409.1556v6. [49] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2016: 770-778. [50] Zhou Q, Sattler T, Pollefeys M, et al. To learn or not to learn: Visual localization from essential matrices[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA: IEEE, 2020: 3319-3326. [51] Szegedy C, Liu W, Jia Y Q, et al. Going deeper with convolutions[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2015: 1-9. [52] Cai M, Zhan H, Saroj Weerasekera C, et al. Camera relocalization by exploiting multi-view constraints for scene coordinates regression[C]//IEEE International Conference on Computer Vision Workshop. Piscataway, USA: IEEE, 2019: 3769-3777. [53] Sattler T, Leibe B, Kobbelt L. Efficient & effective prioritized matching for large-scale image-based localization[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(9): 1744-1756. [54] Silpa-Anan C, Hartley R. Optimised KD-trees for fast image descriptor matching[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2008. DOI: 10.1109/CVPR.2008.4587638. [55] Krishna K, Narasimha Murty M. Genetic K-means algorithm[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 1999, 29(3): 433-439. [56] Li Y, Snavely N, Huttenlocher D P. Location recognition using prioritized feature matching[C]//European Conference on Computer Vision. Berlin, Germany: Springer, 2010: 791-804. [57] Choudhary S, Narayanan P J. Visibility probability structure from SFM datasets and applications[C]//European Conference on Computer Vision. Berlin, Germany: Springer, 2012: 130- 143. [58] Liu L, Li H D, Dai Y C. Efficient global 2D-3D matching for camera localization in a large-scale 3D map[C]//IEEE International Conference on Computer Vision. Piscataway, USA: IEEE, 2017: 2391-2400. [59] Muja M, Lowe D G. FLANN – Fast library for approximate nearest neighbors[EB/OL]. (2013-01-24) [2020-09-01]. https://www.cs.ubc.ca/research/flann/uploads/FLANN/flannmanual-1.8.4.pdf. [60] Fischler M A, Bolles R C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography[J]. Communications of the ACM, 1981, 24(6): 381-395. [61] Peterson L E. K-nearest neighbor[J]. Scholarpedia, 2009, 4(2). DOI: 10.4249/scholarpedia.1883. [62] Rocco I, Arandjelovic R, Sivic J. Convolutional neural network architecture for geometric matching[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2017: 39-48. [63] Rocco I, Cimpoi M, Arandjelovic R, et al. Neighbourhood consensus networks[C]//Advances in Neural Information Processing Systems. La Jolla, USA: Neural Information Processing Systems Foundation, 2018: 1651-1662. [64] Yi K M, Trulls E, Ono Y, et al. Learning to find good correspondences[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2018: 2666- 2674. [65] Zhang J H, Sun D W, Luo Z X, et al. Learning two-view correspondences and geometry using order-aware network[C]//IEEE/CVF International Conference on Computer Vision. Piscataway, USA: IEEE, 2019: 5844-5853. [66] Sarlin P E, DeTone D, Malisiewicz T, et al. SuperGlue: Learning feature matching with graph neural networks[C]//IEEE/ CVF Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2020: 4937-4946. [67] Lepetit V, Moreno-Noguer F, Fua P. EPnP: An accurate O(n) solution to the PnP problem[J]. International Journal of Computer Vision, 2009, 81(2): 155-166. [68] Kneip L, Li H, Seo Y. UPnP: An optimal O(n) solution to the absolute pose problem with universal applicability[C]//European Conference on Computer Vision. Cham, Switzerland: Springer, 2014: 127-142. [69] Ding M Y, Wang Z, Sun J K, et al. CamNet: Coarse-to-fine retrieval for camera re-localization[C]//IEEE International Conference on Computer Vision. Piscataway, USA: IEEE, 2019: 2871-2880. [70] Lin Y M, Liu Z X, Huang J F, et al. Deep global-relative networks for end-to-end 6-DoF visual localization and odometry [C]//Pacific Rim International Conference on Artificial Intelligence. Cham, Switzerland: Springer, 2019: 454-467. [71] Valada A, Radwan N, Burgard W. Deep auxiliary learning for visual localization and odometry[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA: IEEE, 2018: 6939-6946. [72] Radwan N, Valada A, Burgard W. VlocNet++: Deep multitask learning for semantic visual localization and odometry[J]. IEEE Robotics and Automation Letters, 2018, 3(4): 4407-4414. [73] Wang B, Chen C, Lu C X, et al. AtLoc: Attention guided camera localization[C]//AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI, 2020: 10393-10401. [74] Glocker B, Izadi S, Shotton J, et al. Real-time RGB-D camera relocalization[C]//IEEE International Symposium on Mixed and Augmented Reality. Piscataway, USA: IEEE, 2013: 173- 179. [75] Maddern W, Pascoe G, Linegar C, et al. 1 year, 1000 km: The Oxford RobotCar dataset[J]. International Journal of Robotics Research, 2017, 36(1): 3-15. [76] Kendall A, Cipolla R. Modelling uncertainty in deep learning for camera relocalization[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA: IEEE, 2016: 4762-4769. [77] Wang J K, Wang P, Dai D Y, et al. Regression forest based RGBD visual relocalization using coarse-to-fine strategy[J]. IEEE Robotics and Automation Letters, 2020, 5(3): 4431-4438. [78] Xue F, Wu X, Cai S J, et al. Learning multi-view camera relocalization with graph neural networks[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2020: 11372-11381. [79] Huang S Y, Qi S Y, Xiao Y X, et al. Cooperative holistic scene understanding: Unifying 3D object, layout, and camera pose estimation[C]//Advances in Neural Information Processing Systems. La Jolla, USA: Neural Information Processing Systems Foundation, 2018: 207-218. [80] Chen Y X, Huang S Y, Yuan T, et al. Holistic++ scene understanding: Single-view 3D holistic scene parsing and human pose estimation with human-object interaction and physical commonsense[C]//IEEE/CVF International Conference on Computer Vision. Piscataway, USA: IEEE, 2019: 8647-8656.