Audiovisual emotion recognition based on a multi-head cross attention mechanism_Journal of Biomedical Engineering

Authors：

WANG Ziqiong ,  ZHAO Dechun , QIN Lu , CHEN Yi , SHEN Yuchen

School of Life Health Information Science and Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, P. R. China;

Corresponding author：

ZHAO Dechun, Email: zhaodc@cqupt.edu.cn

Keywords：

Emotional analysis; Representational learning; Cross-attention; Modal fusion

DOI：

10.7507/1001-5515.202406041

Video：

Export PDF Favorites Scan Get Citation

Abstract Full text Figures/Tables Video References Cited by

In audiovisual emotion recognition, representational learning is a research direction receiving considerable attention, and the key lies in constructing effective affective representations with both consistency and variability. However, there are still many challenges to accurately realize affective representations. For this reason, in this paper we proposed a cross-modal audiovisual recognition model based on a multi-head cross-attention mechanism. The model achieved fused feature and modality alignment through a multi-head cross-attention architecture, and adopted a segmented training strategy to cope with the modality missing problem. In addition, a unimodal auxiliary loss task was designed and shared parameters were used in order to preserve the independent information of each modality. Ultimately, the model achieved macro and micro F1 scores of 84.5% and 88.2%, respectively, on the crowdsourced annotated multimodal emotion dataset of actor performances (CREMA-D). The model in this paper can effectively capture intra- and inter-modal feature representations of audio and video modalities, and successfully solves the unity problem of the unimodal and multimodal emotion recognition frameworks, which provides a brand-new solution to the audiovisual emotion recognition.

Citation： WANG Ziqiong, ZHAO Dechun, QIN Lu, CHEN Yi, SHEN Yuchen. Audiovisual emotion recognition based on a multi-head cross attention mechanism. Journal of Biomedical Engineering, 2025, 42(1): 24-31. doi: 10.7507/1001-5515.202406041 Copy

1.	Scherer K R. Vocal communication of emotion: a review of research paradigms. Speech Communication, 2003, 40(1-2): 227-256.
2.	陈晓宇, 王美馨, 高雅婷, 等. 视觉与听觉情绪信息关系判断中的交互作用. 心理科学, 2016, 39(4): 842-848.
3.	张四平, 王梅, 邓华侔, 等. 远程医疗监护报警系统中的人脸表情识别算法研究. 信息与电脑(理论版), 2020, 32(14): 68-70.
4.	薛雨丽, 毛峡, 郭叶, 等. 人机交互中的人脸表情识别研究进展. 中国图象图形学报, 2009, 14(5): 764-772.
5.	张国雪. 多模态话语在线上、线下语法课中的对比研究. 社会科学前沿, 2023, 12(6): 2903-2911.
6.	谢丽丽, 徐慧芳, 姜媛, 等. 新手和专家警察对犯罪嫌疑人面部和情绪躯体语言识别的 ERP 研究. 心理学探新, 2016, 36(6): 526-534.
7.	Cai J, Meng Z, Khan A S, et al. Feature-level and model-level audiovisual fusion for emotion recognition in the wild//2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR). San Jose, USA: IEEE, 2019: 443 - 448.
8.	Ma Y, Hao Y, Chen M, et al. Audio-visual emotion fusion (AVEF): a deep efficient weighted approach. Information Fusion, 2019, 46: 184-192.
9.	Poria S, Chaturvedi I, Cambria E, et al. Convolutional MKL based multimodal emotion recognition and sentiment analysis//2016 IEEE 16th International Conference on Data Mining (ICDM). Barcelona, Spain: IEEE, 2016: 439-448.
10.	胡婷婷, 沈凌洁, 冯亚琴, 等. 语音与文本情感识别中愤怒与开心误判分析. 计算机技术与发展, 2018, 28(11): 124-127,134.
11.	Shoaib M, Haq S U, Shah M S, et al. Audio-visual emotion recognition using multilevel fusion. The Sciencetech, 2024, 5(1): 39-51.
12.	Kollias D, Zafeiriou S. Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface. arXiv preprint, 2019, arXiv: 1910.04855.
13.	Goncalves L, Leem S G, Lin W C, et al. Versatile audio-visual learning for handling single and multi modalities in emotion regression and classification tasks. arXiv preprint, 2023, arXiv: 2305.07216.
14.	Huang N, Liu J, Luo Y, et al. Exploring modality-shared appearance features and modality-invariant relation features for cross-modality person re-identification. Pattern Recognition, 2023, 135: 109145.
15.	Chen F, Luo Z, Xu Y, et al. Complementary fusion of multi-features and multi-modalities in sentiment analysis. arXiv preprint, 2019, arXiv: 1904.08138.
16.	Goncalves L, Busso C. AuxFormer: Robust approach to audiovisual emotion recognition//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore: IEEE, 2022: 7357-7361.
17.	Praveen R G, Cardinal P, Granger E. Audio-visual fusion for emotion recognition in the valence-arousal space using joint cross-attention. IEEE Transactions on Biometrics, Behavior, and Identity Science, 2023, 5(3): 360-373.
18.	Hazarika D, Zimmermann R, Poria S. Misa: modality-invariant and-specific representations for multimodal sentiment analysis//Proceedings of the 28th ACM International Conference on Multimedia, Seattle: ACM, 2020: 1122-1131.
19.	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Advances in Neural Information Processing Systems, 2017, 30: 5998-6008.
20.	Gong Y, Liu A H, Rouditchenko A, et al. UAVM: towards unifying audio and visual models. IEEE Signal Processing Letters, 2022, 29: 2437-2441.
21.	Cao H, Cooper D G, Keutmann M K, et al. CREMA-D: crowd-sourced emotional multimodal actors dataset. IEEE Transactions on Affective Computing, 2014, 5(4): 377-390.
22.	Zhang K, Zhang Z, Li Z, et al. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 2016, 23(10): 1499-1503.
23.	Savchenko A V. EmotiEffNets for facial processing in video-based valence-arousal prediction, expression classification and action unit detection//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE/CVF, 2023: 5716-5724.
24.	Mollahosseini A, Hasani B, Mahoor M H. Affectnet: a database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 2017, 10(1): 18-31.
25.	Baevski A, Zhou Y, Mohamed A, et al. Wav2vec2. 0: a framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 2020, 33: 12449-12460.
26.	McFee B, Raffel C, Liang D, et al. Librosa: audio and music signal analysis in python// Proceeding of the 14th Python in Science Conference (SCIPY 2015), Austin: SciPy, 2015: 18-24.
27.	Tsai Y H, Bai S, Liang P P, et al. Multimodal transformer for unaligned multimodal language sequences//Proceedings of the Conference Association for Computational Linguistics Meeting, Florence: Association for computational linguistics, 2019: 6558-6569.
28.	Goncalves L, Busso C. Learning cross-modal audiovisual representations with ladder networks for emotion recognition//ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Rhodes Island: IEEE, 2023: 1-5.
29.	Mocanu B, Tapu R, Zaharia T. Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning. Image and Vision Computing, 2023, 133: 104676.

1. Scherer K R. Vocal communication of emotion: a review of research paradigms. Speech Communication, 2003, 40(1-2): 227-256.
2. 陈晓宇, 王美馨, 高雅婷, 等. 视觉与听觉情绪信息关系判断中的交互作用. 心理科学, 2016, 39(4): 842-848.
3. 张四平, 王梅, 邓华侔, 等. 远程医疗监护报警系统中的人脸表情识别算法研究. 信息与电脑(理论版), 2020, 32(14): 68-70.
4. 薛雨丽, 毛峡, 郭叶, 等. 人机交互中的人脸表情识别研究进展. 中国图象图形学报, 2009, 14(5): 764-772.
5. 张国雪. 多模态话语在线上、线下语法课中的对比研究. 社会科学前沿, 2023, 12(6): 2903-2911.
6. 谢丽丽, 徐慧芳, 姜媛, 等. 新手和专家警察对犯罪嫌疑人面部和情绪躯体语言识别的 ERP 研究. 心理学探新, 2016, 36(6): 526-534.
7. Cai J, Meng Z, Khan A S, et al. Feature-level and model-level audiovisual fusion for emotion recognition in the wild//2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR). San Jose, USA: IEEE, 2019: 443 - 448.
8. Ma Y, Hao Y, Chen M, et al. Audio-visual emotion fusion (AVEF): a deep efficient weighted approach. Information Fusion, 2019, 46: 184-192.
9. Poria S, Chaturvedi I, Cambria E, et al. Convolutional MKL based multimodal emotion recognition and sentiment analysis//2016 IEEE 16th International Conference on Data Mining (ICDM). Barcelona, Spain: IEEE, 2016: 439-448.
10. 胡婷婷, 沈凌洁, 冯亚琴, 等. 语音与文本情感识别中愤怒与开心误判分析. 计算机技术与发展, 2018, 28(11): 124-127,134.
11. Shoaib M, Haq S U, Shah M S, et al. Audio-visual emotion recognition using multilevel fusion. The Sciencetech, 2024, 5(1): 39-51.
12. Kollias D, Zafeiriou S. Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface. arXiv preprint, 2019, arXiv: 1910.04855.
13. Goncalves L, Leem S G, Lin W C, et al. Versatile audio-visual learning for handling single and multi modalities in emotion regression and classification tasks. arXiv preprint, 2023, arXiv: 2305.07216.
14. Huang N, Liu J, Luo Y, et al. Exploring modality-shared appearance features and modality-invariant relation features for cross-modality person re-identification. Pattern Recognition, 2023, 135: 109145.
15. Chen F, Luo Z, Xu Y, et al. Complementary fusion of multi-features and multi-modalities in sentiment analysis. arXiv preprint, 2019, arXiv: 1904.08138.
16. Goncalves L, Busso C. AuxFormer: Robust approach to audiovisual emotion recognition//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore: IEEE, 2022: 7357-7361.
17. Praveen R G, Cardinal P, Granger E. Audio-visual fusion for emotion recognition in the valence-arousal space using joint cross-attention. IEEE Transactions on Biometrics, Behavior, and Identity Science, 2023, 5(3): 360-373.
18. Hazarika D, Zimmermann R, Poria S. Misa: modality-invariant and-specific representations for multimodal sentiment analysis//Proceedings of the 28th ACM International Conference on Multimedia, Seattle: ACM, 2020: 1122-1131.
19. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Advances in Neural Information Processing Systems, 2017, 30: 5998-6008.
20. Gong Y, Liu A H, Rouditchenko A, et al. UAVM: towards unifying audio and visual models. IEEE Signal Processing Letters, 2022, 29: 2437-2441.
21. Cao H, Cooper D G, Keutmann M K, et al. CREMA-D: crowd-sourced emotional multimodal actors dataset. IEEE Transactions on Affective Computing, 2014, 5(4): 377-390.
22. Zhang K, Zhang Z, Li Z, et al. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 2016, 23(10): 1499-1503.
23. Savchenko A V. EmotiEffNets for facial processing in video-based valence-arousal prediction, expression classification and action unit detection//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE/CVF, 2023: 5716-5724.
24. Mollahosseini A, Hasani B, Mahoor M H. Affectnet: a database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 2017, 10(1): 18-31.
25. Baevski A, Zhou Y, Mohamed A, et al. Wav2vec2. 0: a framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 2020, 33: 12449-12460.
26. McFee B, Raffel C, Liang D, et al. Librosa: audio and music signal analysis in python// Proceeding of the 14th Python in Science Conference (SCIPY 2015), Austin: SciPy, 2015: 18-24.
27. Tsai Y H, Bai S, Liang P P, et al. Multimodal transformer for unaligned multimodal language sequences//Proceedings of the Conference Association for Computational Linguistics Meeting, Florence: Association for computational linguistics, 2019: 6558-6569.
28. Goncalves L, Busso C. Learning cross-modal audiovisual representations with ladder networks for emotion recognition//ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Rhodes Island: IEEE, 2023: 1-5.
29. Mocanu B, Tapu R, Zaharia T. Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning. Image and Vision Computing, 2023, 133: 104676.

Journal of Biomedical Engineering

Audiovisual emotion recognition based on a multi-head cross attention mechanism

Abstract Full text Figures/Tables Video References Cited by

Previous Article

Next Article

Format

Content