Structured Deep Learning Based Depth Estimation from a Monocular Image
LI Yaoyu1,2, WANG Hongmin1, ZHANG Yifan2, LU Hanqing2
1. School of Automation, Harbin University of Science and Technology, Harbin 150080, China;
2. Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
李耀宇, 王宏民, 张一帆, 卢汉清. 基于结构化深度学习的单目图像深度估计[J]. 机器人, 2017, 39(6): 812-819.DOI: 10.13973/j.cnki.robot.2017.0812.
LI Yaoyu, WANG Hongmin, ZHANG Yifan, LU Hanqing. Structured Deep Learning Based Depth Estimation from a Monocular Image. ROBOT, 2017, 39(6): 812-819. DOI: 10.13973/j.cnki.robot.2017.0812.
Abstract:For the purposes of extracting rich 3D structural features from a monocular image and inferring depth information for the scene, a structured deep learning model is proposed for the task of depth estimation from a monocular image. The model combines a novel multi-scale convolutional neural network (CNN) and continuous conditional random field (CCRF) in a unified deep learning framework. CNN can learn related feature representations from an image, and CCRF can optimize the output of CNN according to the position and color information of the image pixels. By jointly learning the parameters of CCRF and CNN, the generalization ability of the model can be improved. Experiments on NYU Depth dataset demonstrate the effectiveness and superiority of the model. The average relative error of the predictions of the model is 0.187, and the root mean squared error is 0.074, the average log10 error is 0.671.
[1] Ladicky L, Shi J B, Pollefeys M. Pulling things out of perspective[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2014:89-96.
[2] Shotton J, Girshick R, Fitzgibbon A, et al. Efficient human pose estimation from single depth images[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12):2821-2840.
[3] Song S R, Xiao J X. Sliding shapes for 3D object detection in depth images[C]//13th European Conference on Computer Vision. Berlin, Germany:Springer, 2014:634-651.
[4] Scharstein D, Szeliski R, Zabih R. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms[C]//IEEE Workshop on Stereo and Multi-Baseline Vision. Piscataway, USA:IEEE, 2001:131-140.
[5] Hedau V, Hoiem D, Forsyth D. Thinking inside the box:Using appearance models and context based on room geometry[C]//11th European Conference on Computer Vision. Berlin, Germany:Springer, 2010:224-237.
[6] Gupta A, Efros A A, Hebert M. Blocks world revisited:Image understanding using qualitative geometry and mechanics[C]//11th European Conference on Computer Vision. Berlin, Germany:Springer, 2010:482-496.
[7] Lee D C, Gupta A, Hebert M, et al. Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces[C]//Advances in Neural Information Processing Systems 23. 2010:1288-1296.
[8] Saxena A, Sun M, Ng A Y. Make3D:Learning 3D scene structure from a single still image[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(5):824-840.
[9] Saxena A, Chung S H, Ng A Y. Learning depth from single monocular images[C]//Advances in Neural Information Processing Systems. 2005:1161-1168.
[10] Liu M M, Salzmann M, He X M. Discrete-continuous depth estimation from a single image[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2014:716-723.
[11] Eigen D, Puhrsch C, Fergus R. Depth map prediction from a single image using a multi-scale deep network[C]//Advances in Neural Information Processing Systems. 2014:2366-2374.
[12] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2015:3431-3440.
[13] Liu F Y, Shen C H, Lin G S, et al. Learning depth from single monocular images using deep convolutional neural fields[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(10):2024-2039.
[14] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[C/OL]. (2015-04-10)[2017-01-01]. https://arxiv.org/pdf/1409.1556.pdf.
[15] Radosavljevic V, Vucetic S, Obradovic Z. Continuous conditional random fields for regression in remote sensing[C]//19th European Conference on Artificial Intelligence/6th Conference on Prestigious Applications of Intelligent Systems. Amsterdam, Netherlands:IOS Press, 2010:809-814.
[16] Bishop C M. Pattern recognition and machine learning (Information science and statistics)[M]. New York, USA:Springer-Verlag, 2006.
[17] Adams A, Baek J, Davis M A. Fast high-dimensional filtering using the permutohedral lattice[J]. Computer Graphics Forum, 2010, 29(2):753-762.
[18] Silberman N, Hoiem D, Kohli P, et al. Indoor segmentation and support inference from RGBD images[C]//12th European Conference on Computer Vision. Berlin, Germany:Springer, 2012:746-760.
[19] Jia Y Q, Shelhamer E, Donahue J, et al. Caffe:Convolutional architecture for fast feature embedding[C]//ACM Conference on Multimedia. New York, USA:ACM, 2014:675-678.