基于跨模态交叉注意力网络的多模态情感分析方法

doi:10.16088/j.issn.1001-6600.2023052701

摘要/Abstract

摘要： 挖掘不同模态内信息和模态间信息有助于提升多模态情感分析的性能,本文为此提出一种基于跨模态交叉注意力网络的多模态情感分析方法。首先,利用VGG-16网络将多模态数据映射到全局特征空间;同时,利用Swin Transformer网络将多模态数据映射到局部特征空间;其次,构造模态内自注意力和模态间交叉注意力特征;然后,设计一种跨模态交叉注意力融合模块实现不同模态内和模态间特征的深度融合,提升多模态特征表达的可靠性;最后,通过Softmax获得最终预测结果。在2个开源数据集CMU-MOSI和CMU-MSOEI上进行测试,本文模型在七分类任务上获得45.9%和54.1%的准确率,相比当前MCGMF模型,提升了0.66%和2.46%,综合性能提升显著。

关键词: 情感分析, 多模态, 跨模态交叉注意力, 自注意力, 局部和全局特征

Abstract: Exploiting intra-modal and inter-modal information is helpful for improving the performance of multimodal sen-timent analysis. So, a multimodal sentiment analysis based on cross-modal cross-attention network is proposed. Firstly, VGG-16 network is utilized to map the multimodal data into the global feature space. Simultaneously, the Swin Transformer network is used to map the multimodal data into the local feature space. And the intra-modal self-attention and inter-modal cross-attention features are constructed. Then, a cross-modal cross-attention fusion module is designed to achieve the deep fusion of the intra-modal and inter-modal features, enhancing the represen-tation reliability of the multimodal feature. Finally, the softmax function is used to obtain the results of the sentiment analysis. The experimental results on two open source datasets CMU-MOSI and CMU-MSOEI show that the proposed model can achieve an accuracy of 45.9% and 54.1% respectively in the seven-classification task. Compared with the current classical MCGMF model, the accuracy of the proposed model has improved by 0.66% and 2.46%, and the overall performance improvement is significant.

Key words: sentiment analysis, multimodal, cross-modal cross-attention, self-attention, global and local feature

中图分类号: TP391.41

王旭阳, 王常瑞, 张金峰, 邢梦怡. 基于跨模态交叉注意力网络的多模态情感分析方法[J]. 广西师范大学学报（自然科学版）, 2024, 42(2): 84-93.

WANG Xuyang, WANG Changrui, ZHANG Jinfeng, XING Mengyi. Multimodal Sentiment Analysis Based on Cross-Modal Cross-Attention Network[J]. Journal of Guangxi Normal University(Natural Science Edition), 2024, 42(2): 84-93.

参考文献

[1] YANG B, WU L J, ZHU J H, et al. Multimodal sentiment analysis with two-phase multi-task learning[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 2015-2024. DOI: 10.1109/TASLP.2022.3178204.
[2] 王旭阳, 董帅, 石杰. 复合层次融合的多模态情感分析[J]. 计算机科学与探索, 2023, 17(1): 198-208. DOI: 10.3778/j.issn.1673-9418.2111004.
[3] 刘继明, 张培翔, 刘颖, 等. 多模态的情感分析技术综述[J]. 计算机科学与探索, 2021, 15(7): 1165-1182. DOI: 10.3778/j.issn.1673-9418.2012075.
[4] YANG B, SHAO B, WU L J, et al. Multimodal sentiment analysis with unidirectional modality translation[J]. Neurocomputing, 2022, 467: 130-137. DOI: 10.1016/j.neucom.2021.09.041.
[5] 沈剑平, 王轩, 于成龙, 等. 基于语义理解的Bayesian-Boosting情感分类[J]. 广西师范大学学报(自然科学版), 2010, 28(1): 161-164. DOI: 10.16088/j.issn.1001-6600.2010.01.020.
[6] 张峰, 李希城, 董春茹, 等. 基于深度情感唤醒网络的多模态情感分析与情绪识别[J]. 控制与决策, 2022, 37(11): 2984-2992. DOI: 10.13195/j.kzyjc.2021.0782.
[7] YAN X M, XUE H W, JIANG S Y, et al. Multimodal sentiment analysis using multi-tensor fusion network with cross-modal modeling[J]. Applied Artificial Intelligence, 2022, 36(1): 2000688. DOI: 10.1080/08839514.2021.2000688.
[8] 包广斌, 李港乐, 王国雄. 面向多模态情感分析的双模态交互注意力[J]. 计算机科学与探索, 2022, 16(4): 909-916. DOI: 10.3778/j.issn.1673-9418.2105071.
[9] LIU D, CHEN L X, WANG L F, et al. A multi-modal emotion fusion classification method combined expression and speech based on attention mechanism[J]. Multimedia Tools and Applications, 2022, 81(29): 41677-41695. DOI: 10.1007/s11042-021-11260-w.
[10] 缪裕青, 杨爽, 刘同来, 等. 基于跨模态门控机制和改进融合方法的多模态情感分析[J]. 计算机应用研究, 2023, 40(7): 2025-2030, 2038. DOI: 10.19734/j.issn.1001-3695.2022.12.0766.
[11] 李丽, 李平. 基于交互图神经网络的方面级多模态情感分析[J]. 计算机应用研究, 2023, 40(12): 3683-3689. DOI: 10.19734/j.issn.1001-3695.2022.10.0532.
[12] 李文雪, 甘臣权. 基于注意力机制的分层次交互融合多模态情感分析[J]. 重庆邮电大学学报(自然科学版), 2023, 35(1): 176-184. DOI: 10.3979/j.issn.1673-825X.202106300229.
[13] 王靖豪, 刘箴, 刘婷婷, 等. 基于多层次特征融合注意力网络的多模态情感分析[J]. 中文信息学报, 2022, 36(10): 145-154. DOI: 10.3969/j.issn.1003-0077.2022.10.016.
[14] ZHANG F, LI X C, LIM C P, et al. Deep emotional arousal networkfor multimodal sentiment analysis and emotion recognition[J]. Information Fusion, 2022, 88: 296-304. DOI: 10.1016/j.inffus.2022.07.006.
[15] ZHU T, LI L D, YANG J F, et al. Multimodal sentiment analysis with image-text interaction network[J]. IEEE Transactions on Multimedia, 2022, 25: 3375-3385. DOI: 10.1109/TMM.2022.3160060.
[16] WANG D, GUO X T, TIAN Y M, et al. TETFN: a text enhanced transformer fusion network for multimodal sentiment analysis[J]. Pattern Recognition, 2023, 136: 109259. DOI: 10.1016/j.patcog.2022.109259.
[17] YANG X C, FENG S, WANG D L, et al. Image-text multimodal emotion classification via multi-view attentional network[J]. IEEE Transactions on Multimedia, 2021, 23: 4014-4026. DOI: 10.1109/TMM.2020.3035277.
[18] HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: modality-invariantand-specific representations for multimodal sentiment analysis[C]// MM’20: Proceedings of the 28th ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2020: 1122-1131. DOI: 10.1145/3394171.3413678.
[19] TRUONG Q T, LAUW H W. VistaNet: visual aspect attention network for multimodal sentiment analysis[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 305-312. DOI: 10.1609/aaai.v33i01.3301305.
[20] WU Y, ZHANG Z Y, PENG P, et al. Leveraging multi-modal interactions among the intermediate representations of deep transformers for emotion recognition[C]// MuSe’22: Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge. New York, NY: Association for Computing Machinery, 2022: 101-109. DOI: 10.1145/3551876.3554813.
[21] LIANG Y, TOHTI T, HAMDULLA A. Multimodal false information detection method basedon Text-CNN and SE module[J]. PLoS ONE, 2022, 17(11): e0277463. DOI: 10.1371/journal.pone.0277463.
[22] BHATTACHARJEE D, ZHANG T, SÜSSTRUNK S, et al. MuIT: an end-to-end multitask learning transformer[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2022: 12021-12031. DOI: 10.1109/CVPR52688.2022.01172.
[23] 张昱, 张海军, 刘雅情, 等. 基于双向掩码注意力机制的多模态情感分析[J]. 数据分析与知识发现, 2023, 7(4): 46-55. DOI: 10.11925/infotech.2096-3467.2022.0151.
[24] 孙岩松, 杨亮, 林鸿飞. 基于多粒度的分词消歧和语义增强的情景剧幽默识别[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 57-65. DOI: 10.16088/j.issn.1001-6600.2021091505.
[25] SUN Z K, SARMA P, SETHARES W, et al. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(5): 8992-8999. DOI: 10.1609/aaai.v34i05.6431.
[26] HAN W, CHEN H, GELBUKH A, et al. Bi-bimodal modality fusionfor correlation-controlled multimodal sentiment analysis[C]// ICMI’21: Proceedings of the 2021 International Conference on Multimodal Interaction. New York, NY: Association for Computing Machinery, 2021: 6-15. DOI: 10.1145/3462244.3479919.
[27] SUN H, WANG H Y, LIU J Q, et al. CubeMLP: an MLP-based model for multimodal sentiment analysis and depression estimation[C]// MM’22: Proceedings of the 30th ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2022: 3722-3729. DOI: 10.1145/3503161.3548025.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed