A Monocular 3D Target Detection Network with Perspective Projection
ZHANG Junning1, SU Qunxing1,2, LIU Pengyuan1, GU Hongqiang1, WANG Wei3
1. Army Engineering University, Shijiazhuang 050003, China; 2. Army Command College, Nanjing 210016, China; 3. Military Representative Office of Military Equipment Department in Nanjing Area in Wuxi Area, Wuxi 214035, China
Abstract:For the number of training constraints is small and the prediction accuracy of the model is low in the monocular 3D target detection network, a monocular 3D target detection network with perspective projection is proposed through the improvement of network structure, the establishment of perspective projection constraints and the optimization of loss function and so on. Firstly, a 3D target bounding box model based on vanishing point (VP) is established by using the transformation relationship among the world, the camera and the target based on the perspective projection mechanism. Secondly, it is simplified into the constraint relationship among the yaw angle, target size and 3D bounding box by combining the spatial geometric relationship and the prior size information. Finally, a learning-type azimuth-size loss function based on the constraint relationship is proposed by taking full advantages of single peak and easy regression of the size constraint, and thus the learning efficiency and prediction accuracy of the network are enhanced. In view of the lack of 3D center constraints in the monocular 3D target detection network, a training strategy jointly constraining the azimuth, size, and 3D center in model training is proposed based on the spatial geometry of the 3D bounding box and 2D bounding box. Experiments on KITTI and SUN-RGBD datasets show that the proposed model can achieve better results and is more effective than the other algorithms in 3D target detection.
[1] 陆峰,徐友春,李永乐,等.基于DSmT理论的多视角融合目标检测识别[J].机器人,2018, 40(5):723-733.Lu F, Xu Y C, Li Y L, et al. Multi-view fusion target detection and recognition based on DSmT theory[J]. Robot, 2018, 40(5):723-733. [2] 王任栋,徐友春,齐尧,等.一种鲁棒的城市复杂动态场景点云配准方法[J].机器人,2018,40(3):257-265.Wang R D, Xu Y C, Qi Y, et al. A robust point cloud registration method for urban complex dynamic scene[J]. Robot, 2018, 40(3):257-265. [3] Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2014:580-587. [4] He K, Gkioxari G, Dollar P, et al. Mask R-CNN[C]//IEEE Conference on Computer Vision. Piscataway, USA:IEEE, 2017:2980-2988. [5] Ohno K, Tsubouchi T, Shigematsu B, et al. Differential GPS and odometry-based outdoor navigation of a mobile robot[J]. Advanced Robotics, 2004, 18(6):611-635. [6] Fuentes-Pacheco J, Ruiz-Ascencio J, Rendon-Mancha J M. Visual simultaneous localization and mapping:A survey[J]. Artificial Intelligence Review, 2012, 43(1):55-81. [7] Endres F, Hess J, Sturm J, et al. 3D mapping with an RGB-D camera[J]. IEEE Transactions on Robotics, 2014, 30(1):177-187. [8] Tomas H. Zabulis X, Lourakis M, et al. Detection and fine 3D pose estimation of textureless objects in RGB-D images[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, USA:IEEE, 2017:4421-4428. [9] David G L. Distinctive image features from scale-invariant keypoints[J]. International Journal of Computer Vision, 2004, 60(2):91-110. [10] Chen X Z, Kundu K, Zhang Z Y, et al. Monocular 3D object detection for autonomous driving[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2016:2147-2156. [11] Yang S C, Scherer S. CubeSLAM:Monocular 3D object detection and SLAM without prior models[J]. IEEE Transactions on Robotics, 2019, 35(4):925-938. [12] Yu X, Tanner S, Venkatraman N. PoseCNN:A convolutional neural network for 6D object pose estimation in cluttered scenes[DB/OL]. (2017-11-01)[2019-05-05]. http://export.arxiv.org/abs/1711.00199. [13] Mahdi R, Vincent L. BB8:A scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth[C]//IEEE International Conference on Computer Vision. Piscataway, USA:IEEE, 2017:3848-3856. [14] Kehl W, Manhardt F, Tombari F, et al. SSD-6D:Making RGB-based 3D detection and 6D pose estimation great again[C]//IEEE International Conference on Computer Vision. Piscataway, USA:IEEE, 2017:1530-1538. [15] Mousavian A, Anguelov D, Flynn J, et al. 3D bounding box estimation using deep learning and geometry[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2017:5632-5640. [16] Chen X Z, Ma H M, Wan J, et al. Multi-view 3D object detection network for autonomous driving[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2017:6526-6534. [17] Ren S Q, He K M, Girshick R. Faster R-CNN:Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6):1137-1149. [18] Song S R, Xiao J X. Sliding shapes for 3D object detection in depth images[M]//Lecture Notes in Computer Science, Vol.8694. Berlin, Germany:Springer, 2014:634-651. [19] Engelcke M, Rao D, Wang D Z. Vote3Deep:Fast object detection in 3D point clouds using efficient convolutional neural networks[DB/OL]. (2016-09-21)[2019-05-05]. https://arxiv.org/abs/1609.06666. [20] Lahoud J, Ghanem B. 2D-driven 3D object detection in RGB-D images[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway. USA:IEEE, 2017:4622-4630. [21] Li B. 3D fully convolutional network for vehicle detection in point cloud[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, USA:IEEE, 2016:1513-1518. [22] Qi C R, Hao S, Kaichun M. PointNet:Deep learning on point sets for 3D classification and segmentation[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2017:77-85. [23] Chabot F, Chaouch M, Rabarisoa J. Deep MANTA:A coarse-to-fine many-task network for joint 2D and 3D vehicle analysis from monocular image[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2017. DOI:10.1109/CVPR.2017.198. [24] Chen X Z, Kundu K, Zhang Z Y, et al. Monocular 3D object detection for autonomous driving[C]//European Conference on Computer Vision. Piscataway, USA:IEEE, 2016:2147-2156. [25] Mousavian A, Anguelov D, Flynn J. 3D bounding box estimation using deep learning and geometry[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2017:5632-5640. [26] Geiger A, Lenz P, Urtasun R. Are we ready for autonomous driving? The KITTI vision benchmark suite[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2012:3354-3361. [27] Yang S C, Scherer S. CubeSLAM:Monocular 3D object detection SLAM[J]. IEEE Transactions on Robotics, 2019, 35(4):925-938. [28] Song S R, Lichtenberg S P, Xiao J X. SUN RGB-D:A RGB-D scene understanding benchmark suite[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2015:567-576. [29] Chen X Z, Kundu K, Zhu Y K, et al. 3D object proposals for accurate object class detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(5):1259-1272. [30] Chen X Z, Kundu K, Zhang Z Y. Monocular 3D object detection for autonomous driving[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2016:2147-2156. [31] Xu B, Chen Z Z. Multi-level fusion based 3D object detection from monocular images[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2018:1238-1251. [32] Ku J, Pon A, Waslander S. Monocular 3D object detection leveraging accurate proposals and shape reconstruction[DB/OL]. (2019-04-02)[2019-05-05]. https://arxiv.org/abs/1904.01690. [33] Naiden A, Paunescu V, Kim G, et al. Shift R-CNN:Deep monocular 3D object detection with closed-form geometric constraints[DB/OL]. (2019-05-23)[2019-06-05]. https://arxiv.org/abs/1905.09970?context=cs.CV. [34] Song S R, Xiao J X. Deep sliding shapes for amodal 3D object detection in RGB-D images[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2016:808-816. [35] Ren Z L, Sudderth E B. Three-dimensional object detection and layout prediction using clouds of oriented gradients[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2016:1525-1533. [36] Charles R, Liu Q W, Wu C X. Frustum PointNets for 3D object detection from RGB-D data[DB/OL]. (2017-11-22)[2019-05-05]. https://arxiv.org/abs/1711.08488.