跨模态特征增强与层次化MLP通信的多模态情感分析

doi:10.16088/j.issn.1001-6600.2025040903

摘要/Abstract

摘要： 在多模态情感分析任务中,由于非语言模态信息利用不充分、跨模态交互缺乏细粒度关联建模以及层次化语义融合机制不完善,导致不同模态之间的情感信息难以实现有效融合。为此,本文提出一种跨模态特征增强与层次化MLP通信的多模态情感分析方法。该方法构建渐进式融合架构,首先通过跨模态注意力机制增强非语言模态信息,捕捉多对多的跨模态细粒度交互;继而使用层次化MLP通信模块,在模态融合维度与时间建模维度上分别设计并行与堆叠的MLP模块,实现水平与垂直方向的层次化特征交互,有效提升情感理解的准确性与表达能力。实验结果表明,本文模型在CMU-MOSI上,Acc2和F₁值较次优模型分别提升0.89和0.77个百分点,在CMU-MOSEI上对比实验各项指标均优于基准模型,Acc2、F₁值分别达到86.34%、86.25%。

关键词: 多模态, 情感分析, 跨模态注意力, 层次化MLP通信, 门控单元

Abstract: In multimodal sentiment analysis tasks, effective fusion of sentiment information between different modalities is hindered due to three challenges: nonverbal modal information being insufficiently utilized, fine-grained associative modeling for cross-modal interactions being inadequately established, and hierarchical semantic fusion mechanisms being imperfectly designed. To address these issues, a multimodal sentiment analysis method with cross-modal feature enhancement and hierarchical MLP communication is proposed in this paper. A progressive fusion architecture is constructed, where nonverbal modal information is first enhanced through a cross-modal attention mechanism, enabling many-to-many cross-modal fine-grained interactions to be captured. Subsequently, a lightweight hierarchical MLP communication module is designed to implement hierarchical feature interactions in both horizontal and vertical dimensions, through which cross-modal deep semantic fusion is achieved. It is demonstrated by the experimental results that compared with the suboptimal model on CMU-MOSI, The Acc2 and F₁ values increase 0.89 percentage points and 0.77 percentage points compared with the suboptimal model. Furthermore, all metrics in comparative experiments on CMU-MOSEI are shown to surpass those of baseline models, with the Acc2 and F₁ values being elevated to 86.34% and 86.25%.

Key words: multimodality, sentiment analysis, cross-modal attention, hierarchical MLP communication, gating units

中图分类号: TP391.1

王旭阳, 马瑾. 跨模态特征增强与层次化MLP通信的多模态情感分析[J]. 广西师范大学学报（自然科学版）, 2026, 44(1): 91-101.

WANG Xuyang, MA Jin. Cross-modal Feature Enhancement and Hierarchical MLP Communication for Multimodal Sentiment Analysis[J]. Journal of Guangxi Normal University(Natural Science Edition), 2026, 44(1): 91-101.

参考文献

[1] 李梦云, 张景, 张换香, 等. 基于跨模态语义信息增强的多模态情感分析[J]. 计算机科学与探索, 2024, 18(9): 2476-2486. DOI: 10.3778/j.issn.1673-9418.2307045.
[2] 王旭阳, 章家瑜. 基于跨模态增强网络的时序多模态情感分析[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 97-107. DOI: 10.16088/j.issn.1001-6600.2024081301.
[3] ZADEH A, LIANG P P, PORIA S, et al. Multi-attention recurrent network for human communication comprehension[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 5642-5649. DOI: 10.1609/aaai.v32i1.12024.
[4] 吴俊洁, 王佳阳, 朱萍, 等. 基于MLP和多头自注意力特征融合的双模态情感计算模型[J]. 计算机应用, 2024, 44(S1): 39-43. DOI: 10.11772/j.issn.1001-9081.2023091295.
[5] 林宜山, 左景, 卢树华. 基于音视频特征优化与跨模态Transformer的多模态情感分析[J/OL]. 北京航空航天大学学报:1-13[2025-04-09]. https://doi.org/10.13700/j.bh.1001-5965.2024.0247. DOI: 10.13700/j.bh.1001-5965.2024.0247.
[6] LIU C, WANG Y, YANG J. A transformer-encoder-based multimodal multi-attention fusion network for sentimentanalysis[J]. Applied Intelligence, 2024, 54(17/18): 8415-8441. DOI: 10.1007/s10489-024-05623-7.
[7] 李文潇, 梅红岩, 李雨恬. 基于深度学习的多模态情感分析研究综述[J]. 辽宁工业大学学报(自然科学版), 2022, 42(5): 293-298. DOI: 10.15916/j.issn1674-3261.2022.05.003.
[8] 刘继明, 张培翔, 刘颖, 等. 多模态的情感分析技术综述[J]. 计算机科学与探索, 2021, 15(7): 1165-1182. DOI: 10.3778/j.issn.1673-9418.2012075.
[9] GARG A, PAVLOVIC V, REHG J M. Boosted learning in dynamic Bayesian networks for multimodal speaker detection[J]. Proceedings of the IEEE, 2003, 91(9): 1355-1369. DOI: 10.1109/JPROC.2003.817119.
[10] ZADEH A, CHEN M H, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2017: 1103-1114. DOI: 10.18653/v1/D17-1115.
[11] ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 5634-5641. DOI: 10.1609/aaai.v32i1.12021.
[12] TSAI Y H H, BAI S J, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2019: 6558-6569. DOI: 10.18653/v1/P19-1656.
[13] 高玮军, 孙子博, 刘书君. 基于多视角的图像文本情感分析[J]. 计算机科学, 2024, 51(S2): 138-145. DOI: 10.11896/jsjkx.231200163.
[14] 卢婵, 郭军军, 谭凯文, 等. 基于文本指导的层级自适应融合的多模态情感分析[J]. 山东大学学报(理学版), 2023, 58(12): 31-40. DOI: 10.6040/j.issn.1671-9352.1.2022.421.
[15] HUANG J, JI Y L, QIN Z, et al. Dominant single-modal supplementary fusion (SIMSUF) for multimodal sentiment analysis[J]. IEEE Transactions on Multimedia, 2024, 26: 8383-8394. DOI: 10.1109/TMM.2023.3344358.
[16] 欧阳梦妮, 樊小超, 帕力旦·吐尔逊. 基于目标对齐和语义过滤的多模态情感分析[J]. 计算机技术与发展, 2024, 34(10): 171-177. DOI: 10.20165/j.cnki.ISSN1673-629X.2024.0209.
[17] 谢润锋, 张博超, 杜永萍. 基于视觉语言模型的跨模态多级融合情感分析方法[J]. 模式识别与人工智能, 2024, 37(5): 459-468. DOI: 10.16451/j.cnki.issn1003-6059.202405007.
[18] KE P, JI H Z, LIU S Y, et al. SentiLARE: sentiment-aware language representation learning with linguistic knowledge[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA: Association for Computational Linguistics, 2020: 6975-6988. DOI: 10.18653/v1/2020.emnlp-main.567.
[19] XU H, LIU B, SHU L, et al. BERT post-training for review reading comprehension and aspect-based sentiment analysis[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg, PA: Association for Computational Linguistics, 2019: 2324-2335. DOI: 10.18653/v1/N19-1242.
[20] ZADEH A, ZELLERS R, PINCUS E, et al. MOSI:multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[EB/OL]. (2016-08-12)[2025-04-09]. https://arxiv.org/abs/1606.06259. DOI: 10.48550/arXiv.1606.06259.
[21] BAGHER ZADEH A, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: Association for Computational Linguistics, 2018: 2236-2246. DOI: 10.18653/v1/P18-1208.
[22] DEGOTTEX G, KANE J, DRUGMAN T, et al. COVAREP: a collaborative voice analysis repository for speech technologies[C]//2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2014: 960-964. DOI: 10.1109/ICASSP.2014.6853739.
[23] CHEONG J H, JOLLY E, XIE T K, et al. Py-Feat: Python facial expression analysis toolbox[J]. Affective Science, 2023, 4(4): 781-796. DOI: 10.1007/s42761-023-00191-4.
[24] EKMAN P, ROSENBERG E L. What the face reveals: basic and applied studies of spontaneous expression using the facial action coding system (FACS)[M]. 2nd ed. New York: Oxford University Press, 2005. DOI: 10.1093/acprof:oso/9780195179644.001.0001.
[25] LIN H, ZHANG P L, LING J D, et al. PS-Mixer: a polar-vector and strength-vector mixer model for multimodal sentiment analysis[J]. Information Processing & Management, 2023, 60(2): 103229. DOI: 10.1016/j.ipm.2022.103229.
[26] SRIVASTAVA R K, GREFF K, SCHMIDHUBER J. Highway networks[EB/OL]. (2015-11-03)[2025-04-09]. http://arxiv.org/abs/1505.00387. DOI: 10.48550/arXiv.1505.00387.
[27] DEVLIN J, CHANG M W, LEE K, et al. BERT:pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg, PA: Association for Computational Linguistics, 2019: 4171-4186. DOI: 10.18653/v1/N19-1423.
[28] LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: Association for Computational Linguistics, 2018: 2247-2256. DOI: 10.18653/v1/P18-1209.
[29] HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2020: 1122-1131. DOI: 10.1145/3394171.3413678.
[30] YU W M, XU H, YUAN Z Q, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(12): 10790-10797. DOI: 10.1609/aaai.v35i12.17289.
[31] YANG B, WU L J, ZHU J H, et al. Multimodal sentiment analysis with two-phase multi-task learning[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 2015-2024. DOI: 10.1109/TASLP.2022.3178204.
[32] SUN L C, LIAN Z, LIU B, et al. Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis[J]. IEEE Transactions on Affective Computing, 2024, 15(1): 309-325. DOI: 10.1109/TAFFC.2023.3274829.
[33] WANG Y F, HE J H, WANG D, et al. Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis[J]. Neurocomputing, 2024, 572: 127181. DOI: 10.1016/j.neucom.2023.127181.
[34] LI X, ZHANG H J, DONG Z Q, et al. Learning fine-grained representation with token-level alignment for multimodal sentiment analysis[J]. Expert Systems with Applications, 2025, 269: 126274. DOI: 10.1016/j.eswa.2024.126274.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed