差异特征导向的解耦多模态情感分析

doi:10.16088/j.issn.1001-6600.2024071702

摘要/Abstract

摘要： 特征解耦能够将不同模态特征解耦为相似特征和差异特征,以缓和模态间的贡献度差异。但由于差异特征不仅包含互补信息,同时也包含一致信息,因此差异特征存在显著分布差异。传统特征解耦方法忽视了差异特征内在的冲突,从而导致预测不准确。为了解决这一问题,本文提出一种差异特征导向的解耦多模态情感分析方法,利用特征表示学习和对比学习的思想,提取更为有效的特征并扩大差异特征间的差异。首先部署一个特征提取模块,针对3种模态使用不同的特征提取方法以提取到更为有效的特征;其次使用共同编码器与独立编码器解耦3种模态特征,并使用一个多模态变压器进行特征融合;最后,为了扩大差异特征间的差异,设计用于优化的损失函数。在2个大规模基准数据集上进行实验,并与多个当前先进方法进行比较,在绝大部分指标上都超越当前先进方法,验证了本文方法的有效性与鲁棒性。

关键词: 多模态情感分析, 特征解耦, 预训练BERT, 对比学习, 表示学习

Abstract: Feature decomposition method decomposes features from different modalities into similarity and dissimilarity features. Due to the decoupled dissimilarity features containing both the diversity and the unique information, they show evident distribution discrepancies. Previous feature decomposition methods have overlooked the inherent contradictions in dissimilarity features, resulting in a decrease in prediction accuracy. To address this issue, a dissimilarity feature-driven decomposition network (DFDDN) for multimodal sentiment analysis is proposed. Firstly, feature extract module is used to extract and amplify features, which not only eliminate visual and audio noise but also facilitate the capture of complementary information between modalities. Secondly, different encoders are used to decouple the features, and a multimodal transformer is used to mitigate the differences in dissimilarity features. Finally, loss functions are used for optimization. Extensive experiments on two widely-used multimodal sentiment analysis datasets demonstrate the accuracy and robustness of this model, transcending SOTA performance.

Key words: multimodal sentiment analysis, feature decomposition, pretraining BERT, representation learning, contrastive learning

中图分类号: TP391

李志欣, 刘鸣琦. 差异特征导向的解耦多模态情感分析[J]. 广西师范大学学报（自然科学版）, 2025, 43(3): 57-71.

LI Zhixin, LIU Mingqi. A Dissimilarity Feature-Driven Decoupled Multimodal Sentiment Analysis[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(3): 57-71.

参考文献

[1] 周立柱, 贺宇凯, 王建勇. 情感分析研究综述[J]. 计算机应用, 2008, 28(11): 2725-2728.
[2]蔡宇扬, 蒙祖强. 基于模态信息交互的多模态情感分析[J]. 计算机应用研究, 2023, 40(9): 2603-2608. DOI: 10.19734/j.issn.1001-3695.2023.02.0050.
[3]吕学强, 田驰, 张乐, 等. 融合多特征和注意力机制的多模态情感分析模型[J]. 数据分析与知识发现, 2024, 8(5): 91-101. DOI: 10.11925/infotech.2096-3467.2023.0026.
[4]郭嘉梁, 靳婷. 基于语义增强的多模态情感分析[J]. 广西师范大学学报(自然科学版), 2023, 41(5): 14-25. DOI: 10.16088/j.issn.1001-6600.2023022302.
[5]LI Y Q, ZHANG K, WANG J Y, et al. A cognitive brain model for multimodal sentiment analysis based on attention neural networks[J]. Neurocomputing, 2021, 430: 159-173. DOI: 10.1016/j.neucom.2020.10.021.
[6]ABDU S A, YOUSEF A H, SALEM A. Multimodal video sentiment analysis using deep learning approaches, a survey[J]. Information Fusion, 2021, 76: 204-226. DOI: 10.1016/j.inffus.2021.06.003.
[7]LI Z X, PENG Z, TANG S Q, et al. Text summarization method based on double attention pointer network[J]. IEEE Access, 2020, 8: 11279-11288. DOI: 10.1109/ACCESS.2020.2965575.
[8]LIANG T, LIN G S, FENG L, et al. Attention is not enough: mitigating the distribution discrepancy in asynchronous multimodal sequence fusion[C]// 2021 IEEE/CVF International Conference on Computer Vision(ICCV). Los Alamitos, CA: IEEE Computer Society, 2021: 8128-8136. DOI: 10.1109/ICCV48922.2021.00804.
[9]LV F M, CHEN X, HUANG Y Y, et al. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Los Alamitos, CA: IEEE Computer Society, 2021: 2554-2562. DOI: 10.1109/CVPR46437.2021.00258.
[10]HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: modality-invariant and-specific representations for multimodal sentiment analysis[C]// Proceedings of the 28th ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2020: 1122-1131. DOI: 10.1145/3394171.3413678.
[11]LIU Z T, WU M, CAO W H, et al. A facial expression emotion recognition based human-robot interaction system[J]. IEEE/CAA Journal of Automatica Sinica, 2017, 4(4): 668-676. DOI: 10.1109/JAS.2017.7510622.
[12]HUANG B X, CARLEY K M. Syntax-aware aspect level sentiment classification with graph attention networks[EB/OL]. (2019-09-05)[2024-07-17]. https://arxiv.org/abs/1909.02606. DOI: 10.48550/arXiv.1909.02606.
[13]HOU X C, QI P, WANG G T, et al. Graph ensemble learning over multiple dependency trees for aspect-level sentiment classification[EB/OL]. (2021-03-12)[2024-07-17]. https://arxiv.org/abs/2103.11794. DOI: 10.48550/arXiv.2103.11794.
[14]BALTRUSAITIS T, AHUJA C, MORENCY L P. Multimodal machine learning: a survey and taxonomy[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2019, 41(2): 423-443. DOI: 10.1109/TPAMI.2018.2798607.
[15]TSAI Y H H, BAI S J, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2019: 6558-6569. DOI: 10.18653/v1/P19-1656.
[16]LI Y, WANG Y Z, CUI Z. Decoupled multimodal distilling for emotion recognition[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Los Alamitos, CA: IEEE Computer Society, 2023: 6631-6640. DOI: 10.1109/CVPR52729.2023.00641.
[17]ZHAO H L, XIAO J Y, XUE Y, et al. Aspect category sentiment classification via document-level GAN and POS information[J]. International Journal of Machine Learning and Cybernetics, 2024, 15(8): 3221-3235. DOI: 10.1007/s13042-023-02089-w.
[18]SUN H, WANG H Y, LIU J Q, et al. CubeMLP: an MLP-based model for multimodal sentiment analysis and depression estimation[C]// Proceedings of the 30th ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2022: 3722-3729. DOI: 10.1145/3503161.3548025.
[19]WANG D, GUO X T, TIAN Y M, et al. TETFN: A text enhanced transformer fusion network for multimodal sentiment analysis[J]. Pattern Recognition, 2023, 136: 109259. DOI: 10.1016/j.patcog.2022.109259.
[20]林玩聪, 韩明杰, 靳婷. 基于数据增强的多层次论点立场分类方法[J]. 广西师范大学学报(自然科学版), 2023, 41(6): 62-69. DOI: 10.16088/j.issn.1001-6600.2023052001.
[21]陈开阳, 徐凡, 王明文. 基于知识图谱和图像描述的虚假新闻检测研究[J]. 江西师范大学学报(自然科学版), 2021, 45(4): 398-402. DOI: 10.16357/j.cnki.issn1000-5862.2021.04.12.
[22]方旭东, 王兴芳. 基于注意力机制和对比学习的多模态情感分析[J]. 北京信息科技大学学报(自然科学版), 2024, 39(4): 63-70. DOI: 10.16508/j.cnki.11-5866/n.2024.04.009.
[23]YU W M, XU H, YUAN Z Q, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[EB/OL]. (2021-02-09)[2024-07-17]. https://arxiv.org/abs/2102.04830. DOI: 10.48550/arXiv.2102.04830.
[24]YU W M, XU H, MENG F Y, et al. CH-SIMS: a Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2020: 3718-3727. DOI: 10.18653/v1/2020.acl-main.343.
[25]文载道, 王佳蕊, 王小旭, 等. 解耦表征学习综述[J]. 自动化学报, 2022, 48(2): 351-374. DOI: 10.16383/j.aas.c210096.
[26]王旭阳, 王常瑞, 张金峰, 等. 基于跨模态交叉注意力网络的多模态情感分析方法[J]. 广西师范大学学报(自然科学版), 2024, 42(2): 84-93. DOI: 10.16088/j.issn.1001-6600.2023052701.
[27]杜鹏飞, 李小勇, 高雅丽. 多模态视觉语言表征学习研究综述[J]. 软件学报, 2021, 32(2): 327-348. DOI: 10.13328/j.cnki.jos.006125.
[28]赵朝阳, 朱贵波, 王金桥. ChatGPT 给语言大模型带来的启示和多模态大模型新的发展思路[J]. 数据分析与知识发现, 2023, 7(3): 26-35. DOI: 10.11925/infotech.2096-3467.2023.0216.
[29]ZHAO Z X, BAI H W, ZHANG J S, et al. CDDFuse: correlation-driven dual-branch feature decomposition for multi-modality image fusion[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Los Alamitos, CA: IEEE Computer Society, 2023: 5906-5916. DOI: 10.1109/CVPR52729.2023.00572.
[30]RUAN D L, YAN Y, LAI S Q, et al. Feature decomposition and reconstruction learning for effective facial expression recognition[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Los Alamitos, CA: IEEE Computer Society, 2021: 7656-7665. DOI: 10.1109/CVPR46437.2021.00757.
[31]GUPTA S, HOFFMAN J, MALIK J. Cross modal distillation for supervision transfer[C]// 2016 IEEE Conference on Computer Vision And Pattern Recognition(CRPR). Los Alamitos, CA: IEEE Computer Society, 2016: 2827-2836. DOI: 10.1109/CVPR.2016.309.
[32]YANG J D, YU Y K, NIU D, et al. ConFEDE: contrastive feature decomposition for multimodal sentiment analysis[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers). Stroudsburg, PA: Association for Computational Linguistics, 2023: 7617-7630. DOI: 10.18653/v1/2023.acl-long.421.
[33]HU G M, LIN T E, ZHAO Y, et al. UniMSE: towards unified multimodal sentiment analysis and emotion recognition[C]// Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2022: 7837-7851. DOI: 10.18653/v1/2022.emnlp-main.534.
[34]ZADEH A, ZELLERS R, PINCUS E, et al. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[EB/OL]. (2016-08-12)[2024-07-17]. https://arxiv.org/abs/1606.06259. DOI: 10.48550/arXiv.1606.06259.
[35]ZADEH A A B, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers). Stroudsburg, PA: Association for Computational Linguistics, 2018: 2236-2246. DOI: 10.18653/v1/P18-1208.
[36]ZADEH A, CHEN M H, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2017: 1103-1114. DOI: 10.18653/v1/D17-1115.
[37]LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers). Stroudsburg, PA: Association for Computational Linguistics, 2018: 2247-2256. DOI: 10.18653/v1/P18-1209.
[38]ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[J].Proceedings of the AAAI Conference on Artificial Intelligence, 2018,32(1):5634-5641. DOI: 10.1609/aaai.v32i1.12021.
[39]HAN W, CHEN H, PORIA S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2021: 9180-9192. DOI: 10.18653/v1/2021.emnlp-main.723.
[40]RAHMAN W, HASAN M K, LEE S, et al. Integrating multimodal information in large pretrained transformers[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2020: 2359-2369. DOI: 10.18653/v1/2020.acl-main.214.
[41]YANG B, SHAO B, WU L J, et al. Multimodal sentiment analysis with unidirectional modality translation[J]. Neurocomputing, 2022, 467: 130-137. DOI: 10.1016/j.neucom.2021.09.041.
[42]MAI S J, ZENG Y, ZHENG S J, et al. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis[J]. IEEE Transactions on Affective Computing, 2023, 14(3): 2276-2289. DOI: 10.1109/TAFFC.2022.3172360.
[43]ZENG Y F, LI Z X, CHEN Z B, et al. A feature-based restoration dynamic interaction network for multimodal sentiment analysis[J]. Engineering Applications of Artificial Intelligence, 2024, 127(Part B): 107335. DOI: 10.1016/j.engappai.2023.107335.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed