基于非对称注意与双池化筛选的情感识别网络

黄忠; 张远坤; 任福继; 胡敏; 刘娟

doi:10.13973/j.cnki.robot.240255

基于非对称注意与双池化筛选的情感识别网络

Emotion Recognition Network with Asymmetric Attention and Dual Pooling Screening

摘要

摘要: 为充分利用面部表情与身姿语的情感相关性以及抑制非情感关键帧冗余信息的影响，提出一种基于非对称注意与双池化筛选（asymmetric attention and dual pooling screening, AADPS）的情感识别网络。该网络由帧内空间融合子网和帧间时空融合子网组成。在帧内空间融合子网中，为捕获面部表情和身姿语的空间潜在相关性，构建非对称注意机制，分层获取以手势线索引导的身姿语几何语义和以面部线索引导的帧内情感关联语义。在帧间时空融合子网中，分别采用平均池化和最大池化操作生成视频级情感语义并度量其与帧级语义的相似度，同时设计双池化关键帧筛选策略抑制非峰值帧的影响。实验结果表明：AADPS网络在FABO和CAER情感数据集上的情感识别准确率分别达到94.95%和88.69%，相比基线网络提升了12.22%和13.49%。与基于面部表情或身姿语的单模态方法及基于两者情感线索融合的多模态方法相比，本文方法在复杂场景下具有更好的情感识别性能。

Abstract: To fully utilize the emotional relevance between facial expression and body language, and to suppress the redundant information of non-emotional key frames, an AADPS (asymmetric attention and dual pooling screening) network for emotional recognition is proposed. The proposed network consists of an intra-frame spatial fusion subnet and an inter-frame spatio-temporal fusion subnet. In the intra-frame spatial fusion subnet, an asymmetric attention mechanism is constructed to capture the potential spatial correlation between facial expression and body language, and to hierarchically obtain the geometric semantics of body language guided by gesture cues and the intra-frame emotional correlation semantics guided by facial cues. In the inter-frame spatio-temporal fusion subnet, the average pooling and max pooling operations are used to generate the video-level emotional semantics and measure the similarity between the video-level emotional semantics and frame-level semantics. Meanwhile, a dual pooling key-frame screening strategy is designed to suppress the influence of non-peak frames. The experimental results demonstrate that the AADPS network achieves emotion recognition accuracies of 94.95% and 88.69% on FABO and CAER datasets, respectively, representing improvements of 12.22% and 13.49% over the baseline network. Compared with the single-modal methods based on facial expression or body language and the multi-modal methods based on the fusion of two emotional cues, the proposed method exhibits superior emotion recognition performance in complex scenarios.

HTML全文

参考文献(38)

施引文献

资源附件(0)