An End-to-End Weakly Supervised Network Architecture for Event-based Visual Place Recognition
KONG Delei1,2, FANG Zheng1, LI Haojia1, HOU Kuanxu1, JIANG Junjie1
1. Faculty of Robot Science and Engineering, Northeastern University, Shenyang 110169, China; 2. Science and Technology on Near-Surface Detection Laboratory, Wuxi 214000, China
Abstract:Frame-based cameras are generally used in traditional visual place recognition (VPR) methods, which often causes failure of VPR in the cases of dramatic illumination changes or fast motion. To overcome this, an end-to-end VPR network using event cameras is proposed, which can achieve good VPR performance in challenging environments. The key idea of the proposed algorithm is to firstly characterize the event streams with the event spike tensor (EST) voxel grid, then extract features using a deep residual network, and finally aggregate features using an improved VLAD (vector of locally aggregated descriptor) network to realize end-to-end VPR using event streams. Comparison experiments among the proposed method and classical VPR methods are carried out on the event-based driving datasets (MVSEC, DDD17) and the synthetic event stream datasets (Oxford RobotCar). As results, the performance of the proposed method is better than that of frame-based VPR methods in challenging scenarios (such as night scenes), with an approximately 6.61 % improvement in Recall@1 index. To our knowledge, for visual place recognition task, this is the first end-to-end weakly supervised deep network architecture that directly processes event stream data.
[1] Lowry S, Sunderhauf N, Newman P, et al. Visual place recognition: A survey[J]. IEEE Transactions on Robotics, 2016, 32(1): 1-19. [2] Zeng Z Q, Zhang J, Wang X D, et al. Place recognition: An overview of vision perspective[J]. Applied Sciences, 2018, 8(11). DOI: 10.3390/app8112257. [3] Angeli A, Filliat D, Doncieux S, et al. Fast and incremental method for loop-closure detection using bags of visual words [J]. IEEE Transactions on Robotics, 2008, 24(5): 1027-1037. [4] Galvez-Lopez D, Tardos J D. Bags of binary words for fast place recognition in image sequences[J]. IEEE Transactions on Robotics, 2012, 28(5): 1188-1197. [5] Oertel A, Cieslewski T, Scaramuzza D. Augmenting visual place recognition with structural cues[J]. IEEE Robotics and Automation Letters, 2020, 5(4): 5534-5541. [6] Gallego G, Delbruck T, Orchard G, et al. Event-based vision: A survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(1): 154-180. [7] 孔德磊,方正.基于事件的视觉传感器及其应用综述[J]. 信息与控制, 2021, 50(1): 1-19. Kong D L, Fang Z. A review of event-based vision sensors and their applications[J]. Information and Control, 2021, 50(1): 1- 19. [8] Lowe D G. Distinctive image features from scale-invariant keypoints[J]. International Journal of Computer Vision, 2004, 60: 91-110. [9] Rublee E, Rabaud V, Konolige K, et al. ORB: An efficient alternative to SIFT or SURF[C]//International Conference on Computer Vision. Piscataway, USA: IEEE, 2011: 2564-2571. [10] Jegou H, Douze M, Schmid C, et al. Aggregating local descriptors into a compact image representation[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2010: 3304-3311. [11] Arandjelovic R, Zisserman A. All about VLAD[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2013: 1578-1585. [12] Torii A, Arandjelovic R, Sivic J, et al. 24/7 place recognition by view synthesis[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2015: 1808- 1817. [13] Chen Z T, Lam O, Jacobson A, et al. Convolutional neural network-based place recognition[DB/OL]. (2014-11-06) [2018- 08-13]. https://arxiv.org/ftp/arxiv/papers/1411/1411.1509.pdf. [14] Lopez-Antequera M, Gomez-Ojeda R, Petkov N, et al. Appearance-invariant place recognition by discriminatively training a convolutional neural network[J]. Pattern Recognition Letters, 2017, 92: 89-95. [15] Arandjelovic R, Gronat P, Torii A, et al. NetVLAD: CNN architecture for weakly supervised place recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2016: 5297-5307. [16] Milford M J, Wyeth G F. SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA: IEEE, 2012: 1643-1649. [17] Torii A, Sivic J, Pajdla T, et al. Visual place recognition with repetitive structures[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2013: 883- 890. [18] Ye Y W, Cieslewski T, Loquercio A, et al. Place recognition in semi-dense maps: Geometric and learning-based approaches [C]//British Machine Vision Conference. Guildford, UK: BMVA Press, 2017: 74.1-74.13. [19] Camara L G, Preucil L. Spatio-semantic ConvNet-based visual place recognition[C]//European Conference on Mobile Robots. Piscataway, USA: IEEE, 2019: 1-8. [20] Hong Z Y, Petillot Y, Lane D, et al. TextPlace: Visual place recognition and topological localization through reading scene texts[C]//IEEE/CVF International Conference on Computer Vision. Piscataway, USA: IEEE, 2019: 2861-2870. [21] Benbihi A, Arravechia S, Geist M, et al. Image-based place recognition on bucolic environment across seasons from semantic edge description[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA: IEEE, 2020: 3032-3038. [22] Milford M, Kim H, Mangan M, et al. Place recognition with event-based cameras and a neural implementation of SeqSLAM [DB/OL]. (2015-05-18) [2018-08-13]. https://arxiv.org/ftp/arxiv/papers/1505/1505.04548.pdf. [23] Fischer T, Milford M. Event-based visual place recognition with ensembles of temporal windows[J]. IEEE Robotics and Automation Letters, 2020, 5(4): 6924-6931. [24] Rebecq H, Ranftl R, Koltun V, et al. Event-to-video: Bringing modern computer vision to event cameras[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2019: 3852-3861. [25] Gallego G, Rebecq H, Scaramuzza D. A unifying contrast maximization framework for event cameras, with applications to motion, depth, and optical flow estimation[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2018: 3867-3876. [26] Zhu A Z, Yuan L, Chaney K, et al. EV-FlowNet: Self-supervised optical flow estimation for event-based cameras[DB/OL]. (2018-02-19) [2018-08-13]. https://arxiv.org/pdf/1802.06898.pdf. [27] Gehrig D, Loquercio A, Derpanis K, et al. End-to-end learning of representations for asynchronous event-based data[C]// IEEE/CVF International Conference on Computer Vision. Piscataway, USA: IEEE, 2019: 5632-5642. [28] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2016: 770-778. [29] Kulis B. Metric learning: A survey[J]. Foundations and Trands® in Machine Learning, 2013, 5(4): 287-364. [30] Liu Y. Distance metric learning: A comprehensive survey[D]. Michigan, USA: Michigan State University, 2006. [31] Schroff F, Kalenichenko D, Philbin J. FaceNet: A unified embedding for face recognition and clustering[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2015: 815-823. [32] Chen W H, Chen X T, Zhang J G, et al. Beyond triplet loss: A deep quadruplet network for person re-identification[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2017: 1320-1329. [33] Zhu A Z, Thakur D, Ozaslan T, et al. The multivehicle stereo event camera dataset: An event camera dataset for 3D perception[J]. IEEE Robotics and Automation Letters, 2018, 3(3): 2032-2039. [34] Binas J, Neil D, Liu S C, et al. DDD17: End-to-end DAVIS driving dataset[DB/OL]. (2017-11-07) [2018-08-13]. https://arxiv.org/pdf/1711.01458.pdf. [35] Maddern W, Pascoe G, Linegar C, et al. 1 year, 1000 km: The Oxford RobotCar dataset[J]. International Journal of Robotics Research, 2017, 36(1): 3-15. [36] Hu Y, Liu S C, Delbruck T. V2E: From video frames to realistic DVS events[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2021: 1312- 1321. [37] Dosovitskiy A, Ros G, Codevilla F, et al. CARLA: An open urban driving simulator[DB/OL]. (2017-11-17) [2018-08-13]. https://arxiv.org/pdf/1711.03938.pdf.