Abstract:For the problem of object detection from 3D point clouds, a high-precision and real-time single-stage deep neural network is proposed, which includes new solutions in three aspects: network feature extraction, loss function design and data augmentation. Firstly, the point clouds are directly voxelized to build a bird's eye view (BEV). In the step of feature extraction, the residual structure is used to extract high-level semantic features, and the multi-level features are combined to output dense feature map. While regressing the bounding boxes of objects from the BEV, the quadratic offset is considered in the loss function to achieve the convergence with higher precision. In training process, data augmentation is adopted by mixing 3D point clouds from different frames to improve the generalization of the network. The experimental results based on the KITTI BEV object detection dataset show that the proposed network only using the position information of the lidar point cloud, is not only better than the state-of-the-art BEV object detection network in performance, but also outperforms the methods that fuse images and point clouds. And the speed of the entire network reaches 20 frame/s, which meets the real-time requirement.
[1] Levinson J, Askeland J, Becker J, et al. Towards fully autonomous driving:Systems and algorithms[C]//IEEE Intelligent Vehicles Symposium (IV). Piscataway, USA:IEEE, 2011:163-168. [2] Chen X, Kundu K, Zhang Z, et al. Monocular 3D object detection for autonomous driving[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2016:2147-2156. [3] Li P, Chen X, Shen S. Stereo R-CNN based 3D object detection for autonomous driving[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2019:7644-7652. [4] Engelcke M, Rao D, Wang D Z, et al. Vote3Deep:Fast object detection in 3D point clouds using efficient convolutional neural networks[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA:IEEE, 2017:1355-1361. [5] Li B, Zhang T, Xia T. Vehicle detection from 3D LiDAR usingfully convolutional network[DB/OL].(2016-08-29)[2019-05-07]. https://arxiv.org/abs/1608.07916. [6] Wang D Z, Posner I. Voting for voting in online point cloud object detection[M]//Robotics:Science and Systems. Cambridge, USA:MIT Press, 2015. DOI:10.15607/RSS.2015.XI.035. [7] Qi C R, Su H, Mo K, et al. PointNet:Deep learning on point setsfor 3D classification and segmentation[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2017:652-660. [8] Zhou Y, Tuzel O. VoxelNet:End-to-end learning for point cloud based 3D object detection[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2018:4490-4499. [9] Yan Y, Mao Y, Li B. SECOND:Sparsely embedded convolutional detection[J]. Sensors, 2018, 18, DOI:10.3390/s18103337. [10] Qi C R, Liu W, Wu C X, et al. Frustum PointNets for 3D object detection from RGB-D data[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2018:918-927. [11] Chen X, Ma H, Wan J, et al. Multi-view 3D object detection network for autonomous driving[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2017:1907-1915. [12] Ku J, Mozifian M, Lee J, et al. Joint 3D proposal generation and object detection from view aggregation[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, USA:IEEE, 2018:5750-5757. [13] Simon M, Amende K, Kraus A, et al. Complexer-YOLO:Real-time 3D object detection and tracking on semantic point clouds[DB/OL].(2019-04-16)[2019-05-07]. https://arxiv.org/abs/1904.07537. [14] Yang B, Luo W, Urtasun R. PIXOR:Real-time 3D object detection from point clouds[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2018:7652-7660. [15] Yang B, Liang M, Urtasun R. HDNET:Exploiting HD maps for 3D object detection[C]//IEEE Conference on Robot Learning. Piscataway, USA:IEEE, 2018:146-155. [16] Geiger A, Lenz P, Stiller C, et al. Vision meets robotics:The KITTI dataset[J]. International Journal of Robotics Research, 2013, 32(11):1231-1237. [17] Geiger A, Lenz P, Urtasun R. Are we ready for autonomous driving?The KITTI vision benchmark suite[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2012:3354-3361. [18] Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2014:580-587. [19] Girshick R. Fast R-CNN[C]//IEEE International Conference on Computer Vision. Piscataway, USA:IEEE, 2015:1440-1448. [20] Ren S, He K, Girshick R, et al. Faster R-CNN:Towards real-time object detection with region proposal networks[C]//Adv-ances in Neural Information Processing Systems. La Jolla,USA:Neural Information Processing Systems Foundation, 2015:91-99. [21] Redmon J, Divvala S, Girshick R, et al. You only look once:Unified, real-time object detection[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2016:779-788. [22] Liu W, Anguelov D, Erhan D, et al. SSD:Single shot multibox detector[C]//European Conference on Computer Vision. Berlin, Germany:Springer, 2016:21-37. [23] Everingham M, van Gool L, Williams C K I, et al. The PASCAL visual object classes (VOC) challenge[J]. International Journal of Computer Vision, 2010, 88(2):303-338. [24] Lin T Y, Maire M, Belongie S, et al. Microsoft COCO:Common objects in context[C]//European Conference on Computer Vision. Berlin, Germany:Springer, 2014:740-755. [25] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2016:770-778. [26] Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networksfor object detection[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2017:2117-2125. [27] Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C]//IEEE International Conference on Computer Vision. Piscataway, USA:IEEE, 2017:2980-2988. [28] Zhang S, Wen L, Bian X, et al. Single-shot refinement neural network for object detection[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2018:4203-4212. [29] Kong T, Sun F, Liu H, et al. Consistent optimization for single-shot object detection[DB/OL].(2019-01-23)[2019-05-07]. https://arxiv.org/abs/1901.06563. [30] Redmon J, Farhadi A. YOLO9000:Better, faster, stronger[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2017:7263-7271. [31] Zeng Y, Hu Y, Liu S, et al. RT3D:Real-time 3-D vehicle detection in LiDAR point cloud for autonomous driving[J]. IEEE Robotics and Automation Letters, 2018, 3(4):3434-3440. [32] Beltrán J, Guindel C, Moreno F M, et al. BirdNet:A 3D objectdetection framework from LiDAR information[C]//201821st International Conference on Intelligent Transportation Systems. Piscataway, USA:IEEE, 2018:3517-3523. [33] Redmon J, Farhadi A. YOLOv3:An incremental improvement[DB/OL].(2018-04-08)[2019-05-07]. https://arxiv.org/abs/1804.02767. [34] Zhang H, Cisse M, Dauphin Y N, et al. Mixup:Beyond empirical risk minimization[DB/OL].(2018-04-27)[2019-05-07]. https://arxiv.org/abs/1710.09412.