Abstract:
To fully utilize the emotional relevance between facial expression and body language, and to suppress the redundant information of non-emotional key frames, an AADPS (asymmetric attention and dual pooling screening) network for emotional recognition is proposed. The proposed network consists of an intra-frame spatial fusion subnet and an inter-frame spatio-temporal fusion subnet. In the intra-frame spatial fusion subnet, an asymmetric attention mechanism is constructed to capture the potential spatial correlation between facial expression and body language, and to hierarchically obtain the geometric semantics of body language guided by gesture cues and the intra-frame emotional correlation semantics guided by facial cues. In the inter-frame spatio-temporal fusion subnet, the average pooling and max pooling operations are used to generate the video-level emotional semantics and measure the similarity between the video-level emotional semantics and frame-level semantics. Meanwhile, a dual pooling key-frame screening strategy is designed to suppress the influence of non-peak frames. The experimental results demonstrate that the AADPS network achieves emotion recognition accuracies of 94.95% and 88.69% on FABO and CAER datasets, respectively, representing improvements of 12.22% and 13.49% over the baseline network. Compared with the single-modal methods based on facial expression or body language and the multi-modal methods based on the fusion of two emotional cues, the proposed method exhibits superior emotion recognition performance in complex scenarios.