An Indoor Scene Recognition Method Combining Global and Saliency Region Features
NIU Jie1,2, BU Xiongzhu1, QIAN Kun3, LI Zhong2
1. School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China;
2. School of Electrical and Electronic Engineering, Changzhou College of Information Technology, Changzhou 213164, China;
3. School of Automation, Southeast University, Nanjing 210096, China
牛杰, 卜雄洙, 钱堃, 李众. 一种融合全局及显著性区域特征的室内场景识别方法[J]. 机器人, 2015, 37(1): 122-128.DOI: 10.13973/j.cnki.robot.2015.122.
NIU Jie, BU Xiongzhu, QIAN Kun, LI Zhong. An Indoor Scene Recognition Method Combining Global and Saliency Region Features. ROBOT, 2015, 37(1): 122-128. DOI: 10.13973/j.cnki.robot.2015.122.
摘要
针对常规场景识别方法在室内环境中性能显著下降的问题,提出一种融合全局及显著性区域特征的移动机器人室内场景识别方法.利用改进的 BoW(bag-of-words)模型进行室内场景判别的同时,结合视觉注意方法提取出场景图像的最大及次大显著区域,送入改进的 BDBN(bilinear deep belief network)模型来自动学习图像特征,进行类别判断.利用分段判别策略对于两个模型的结果进行融合,并输出最终场景判别结果.将本 方法应用于实际机器人平台及包含67个类别的MIT室内场景数据库,实验结果表明,相较于常规BoW模型,本方法可以有效提高识别准确 率10%以上.此外,本方法在MIT数据库中达到平均44.3%的准确率,优于相关文献算法.
Conventional scene recognition methods have poor performance in indoor situations. For this reason, an indoor scene recognition method for mobile robots is presented, combining global and saliency region features. In addition to the use of an improved BoW (Bag-of-Words) model for indoor scene recognition, an improved BDBN (bilinear deep belief network) model is implemented, using information from a salient region detection technique. The first and the second winners of the salient region detection with the visual attention approach are sent into the improved BDBN model to automatically learn image features and to judge the class sets they belong to. The final result of the indoor scene recognition can be obtained by combining the above-mentioned two models through strategies for a piecewise discriminant. The method is applied to the real mobile robot platform and the standard MIT 67-category indoor scene dataset. The experiments show that the proposed method is highly effective, and can improve the accuracy of common BoW-based methods by up to 10%. In addition, the accuracy rate of the method can reach 44.3% in the MIT dataset, which is superior to some methods in the literature.
[1] Thrun S, Burgard W, Fox D. Probabilistic robotics[M]. Cambridge, USA: MIT, 2005.[2] Vailaya A, Jain A, Zhang H J. On image classification: City vs. landscape[C]//IEEE Workshop on Content-Based Access of Image and Video Libraries. Piscataway, USA: IEEE, 1998: 3-8.[3] Chang E, Goh K, Sychay G, et al. CBSA: Content-based soft annotation for multimodal image retrieval using Bayes point machines[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2003, 13(1): 26-38. [4] 钱堃,马旭东,戴先中,等.基于层次化SLAM 的未知环境级联地图创建方法[J].机器人,2011,33(6):736-741.Qian K, Ma X D, Dai X Z, et al. A layered SLAM based approach for unknown environment hierarchical mapping building[J]. Robot, 2011, 33(6): 736-741.[5] 包加桐,宋爱国,郭晏,等.基于SURF特征跟踪的动态手势识别算法[J].机器人,2011,33(4):482-489.Bao J T, Song A G, Guo Y, et al. Dynamic hand gesture recognition based on SURF tracking[J]. Robot, 2011, 33(4): 482-489.[6] Zhang H B, Su S Z, Li S Z, et al. Seeing actions through scene context[C]//IEEE International Conference on Visual Communications and Image Processing. Piscataway, USA: IEEE, 2013: 1-6.[7] Lazebnik S, Schmid C, Ponce J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2006: 2169-2178.[8] Parizi S N, Oberlin J G, Felzenszwalb P F. Reconfigurable models for scene recognition[C]//2012 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2012: 2775-2782.[9] Zhao Z S, Feng X, Wei F, et al. Learning representative features for robot topological localization[J]. International Journal of Advanced Robotic Systems, 2013, 10: 1-12.[10] Wang R, Wang Z L, Ma X R. Indoor scene classification based on the bag-of-words model of local feature information gain[J]. IEICE Transactions on Information and Systems, 2013, 96(4): 984-987.[11] Espinace P, Kollar T, Soto A, et al. Indoor scene recognition through object detection[C]//2010 IEEE International Conference on Robotics and Automation. Piscataway, USA: IEEE, 2010: 1406-1413.[12] Madokoro H, Yamanashi A, Sato K. Unsupervised semantic indoor scene classification for robot vision based on context of features using Gist and HSV-SIFT[J]. Pattern Recognition in Physics, 2013, 1(1): 93-103. [13] Vasudevan S, Siegwart R. Bayesian space conceptualization and place classification for semantic maps in mobile robotics[J]. Robotics and Autonomous Systems, 2008, 56(6): 522-537. [14] 朱博,戴先中,李新德,等.基于“原型”的机器人开放式室内场所感知实验研究[J].机器人,2013,35(4):491-499,512.Zhu B, Dai X Z, Li X D, et al. Experimental study on open interior-places perception of robot based on ''prototype''[J]. Robot, 2013, 35(4): 491-499,512.[15] Bengio Y. Deep learning of representations: Looking forward[C]//1st International Conference on LSP, Vol.7978. Berlin, German: Springer, 2013: 1-37.[16] Calonder M, Lepetit V, Strecha C, et al. Brief: Binary robust independent elementary features[C]//11th European Conference on Computer Vision, Vol.6314. Berlin, German: Springer, 2010: 778-792.[17] Arandjelovic R, Zisserman A. Three things everyone should know to improve object retrieval[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2012: 2911-2918.[18] Nister D, Stewenius H. Scalable recognition with a vocabulary tree[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2006: 2161-2168.[19] Gehler P, Nowozin S. On feature combination for multiclass object classification[C]//12th International Conference on Computer Vision. Piscataway, USA: IEEE, 2009: 221-228.[20] Hinton G E, Osinder O S, Teh Y W. A fast learning algorithm for deep belief nets[J]. Neural Computation, 2006, 18(7): 1527-1554. [21] Rifai S, Mesnil G, Vincent P, et al. Higher order contractive auto-encoder[C]//European Conference on ECML PKDD, Vol.6912. Berlin,German: Springer, 2011: 645-660.[22] Zhong S H, Liu Y, Liu Y. Bilinear deep learning for image classification[C]//19th ACM International Conference on Multimedia. New York, USA: ACM, 2011: 343-352.[23] Lee J, Lim J H, Choi H, et al. Multiple kernel learning with hierarchical feature representations[C]//20th International Conference on Neural Information Processing. Berlin, German: Springer, 2013: 517-524.[24] Orabona F, Jie L. Ultra-fast optimization algorithm for sparse multi kernel learning[C]//28th International Conference on Machine Learning. Washington, USA: IMLS, 2011.[25] Shrivastava A, Mmlisiewicz T, Gupta A, et al. Data-driven visual similarity for cross-domain image matching[C]//SIGGRAPH Asia Conference. New York, USA: ACM, 2011: 154:1-154:10.[26] Harel J, Koch C, Perona P. Graph-based visual saliency[C]// Advances in Neural Information Processing Systems. Cambridge, USA: MIT, 2006: 545-552.[27] Altman N S. An introduction to kernel and nearest-neighbor nonparametric regression[J]. The American Statistician, 1992, 46(3): 175-185.[28] Zhou S S, Chen Q C, Wang X L. Discriminate deep belief networks for image classification[C]//17th IEEE International Conference on Image Processing. Piscataway, USA: IEEE, 2010: 1561-1564.[29] Jarrett K, Kavukcuoglu K, Ranzato M, et al. What is the best multi-stage architecture for object recognition?[C]//12th International Conference on Proceedings of the Computer Vision. Piscataway, USA: IEEE, 2009: 2146- 2153[30] Quattoni A, Torralba A. Recognizing indoor scenes[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2009: 413-420.[31] Zhou L, Zhou Z, Hu D. Scene classification using multi-resolution low-level feature combination[J]. Neurocomputing, 2013, 122: 284-297.