广西师范大学学报(自然科学版) ›› 2026, Vol. 44 ›› Issue (1): 91-101.doi: 10.16088/j.issn.1001-6600.2025040903

• 智能信息处理 • 上一篇    下一篇

跨模态特征增强与层次化MLP通信的多模态情感分析

王旭阳*, 马瑾   

  1. 兰州理工大学 计算机与通信学院,甘肃 兰州 730050
  • 收稿日期:2025-04-09 修回日期:2025-08-25 出版日期:2026-01-05 发布日期:2026-01-26
  • 通讯作者: 王旭阳(1974—), 女, 甘肃兰州人, 兰州理工大学教授。E-mail:wxuyang126@126.com
  • 基金资助:
    国家自然科学基金(62161019)

Cross-modal Feature Enhancement and Hierarchical MLP Communication for Multimodal Sentiment Analysis

WANG Xuyang*, MA Jin   

  1. School of Computer and Communication, Lanzhou University of Technology, Lanzhou Gansu 730050, China
  • Received:2025-04-09 Revised:2025-08-25 Online:2026-01-05 Published:2026-01-26

摘要: 在多模态情感分析任务中,由于非语言模态信息利用不充分、跨模态交互缺乏细粒度关联建模以及层次化语义融合机制不完善,导致不同模态之间的情感信息难以实现有效融合。为此,本文提出一种跨模态特征增强与层次化MLP通信的多模态情感分析方法。该方法构建渐进式融合架构,首先通过跨模态注意力机制增强非语言模态信息,捕捉多对多的跨模态细粒度交互;继而使用层次化MLP通信模块,在模态融合维度与时间建模维度上分别设计并行与堆叠的MLP模块,实现水平与垂直方向的层次化特征交互,有效提升情感理解的准确性与表达能力。实验结果表明,本文模型在CMU-MOSI上,Acc2和F1值较次优模型分别提升0.89和0.77个百分点,在CMU-MOSEI上对比实验各项指标均优于基准模型,Acc2、F1值分别达到86.34%、86.25%。

关键词: 多模态, 情感分析, 跨模态注意力, 层次化MLP通信, 门控单元

Abstract: In multimodal sentiment analysis tasks, effective fusion of sentiment information between different modalities is hindered due to three challenges: nonverbal modal information being insufficiently utilized, fine-grained associative modeling for cross-modal interactions being inadequately established, and hierarchical semantic fusion mechanisms being imperfectly designed. To address these issues, a multimodal sentiment analysis method with cross-modal feature enhancement and hierarchical MLP communication is proposed in this paper. A progressive fusion architecture is constructed, where nonverbal modal information is first enhanced through a cross-modal attention mechanism, enabling many-to-many cross-modal fine-grained interactions to be captured. Subsequently, a lightweight hierarchical MLP communication module is designed to implement hierarchical feature interactions in both horizontal and vertical dimensions, through which cross-modal deep semantic fusion is achieved. It is demonstrated by the experimental results that compared with the suboptimal model on CMU-MOSI, The Acc2 and F1 values increase 0.89 percentage points and 0.77 percentage points compared with the suboptimal model. Furthermore, all metrics in comparative experiments on CMU-MOSEI are shown to surpass those of baseline models, with the Acc2 and F1 values being elevated to 86.34% and 86.25%.

Key words: multimodality, sentiment analysis, cross-modal attention, hierarchical MLP communication, gating units

中图分类号:  TP391.1

[1] 李梦云, 张景, 张换香, 等. 基于跨模态语义信息增强的多模态情感分析[J]. 计算机科学与探索, 2024, 18(9): 2476-2486. DOI: 10.3778/j.issn.1673-9418.2307045.
[2] 王旭阳, 章家瑜. 基于跨模态增强网络的时序多模态情感分析[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 97-107. DOI: 10.16088/j.issn.1001-6600.2024081301.
[3] ZADEH A, LIANG P P, PORIA S, et al. Multi-attention recurrent network for human communication comprehension[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 5642-5649. DOI: 10.1609/aaai.v32i1.12024.
[4] 吴俊洁, 王佳阳, 朱萍, 等. 基于MLP和多头自注意力特征融合的双模态情感计算模型[J]. 计算机应用, 2024, 44(S1): 39-43. DOI: 10.11772/j.issn.1001-9081.2023091295.
[5] 林宜山, 左景, 卢树华. 基于音视频特征优化与跨模态Transformer的多模态情感分析[J/OL]. 北京航空航天大学学报:1-13[2025-04-09]. https://doi.org/10.13700/j.bh.1001-5965.2024.0247. DOI: 10.13700/j.bh.1001-5965.2024.0247.
[6] LIU C, WANG Y, YANG J. A transformer-encoder-based multimodal multi-attention fusion network for sentimentanalysis[J]. Applied Intelligence, 2024, 54(17/18): 8415-8441. DOI: 10.1007/s10489-024-05623-7.
[7] 李文潇, 梅红岩, 李雨恬. 基于深度学习的多模态情感分析研究综述[J]. 辽宁工业大学学报(自然科学版), 2022, 42(5): 293-298. DOI: 10.15916/j.issn1674-3261.2022.05.003.
[8] 刘继明, 张培翔, 刘颖, 等. 多模态的情感分析技术综述[J]. 计算机科学与探索, 2021, 15(7): 1165-1182. DOI: 10.3778/j.issn.1673-9418.2012075.
[9] GARG A, PAVLOVIC V, REHG J M. Boosted learning in dynamic Bayesian networks for multimodal speaker detection[J]. Proceedings of the IEEE, 2003, 91(9): 1355-1369. DOI: 10.1109/JPROC.2003.817119.
[10] ZADEH A, CHEN M H, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2017: 1103-1114. DOI: 10.18653/v1/D17-1115.
[11] ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 5634-5641. DOI: 10.1609/aaai.v32i1.12021.
[12] TSAI Y H H, BAI S J, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2019: 6558-6569. DOI: 10.18653/v1/P19-1656.
[13] 高玮军, 孙子博, 刘书君. 基于多视角的图像文本情感分析[J]. 计算机科学, 2024, 51(S2): 138-145. DOI: 10.11896/jsjkx.231200163.
[14] 卢婵, 郭军军, 谭凯文, 等. 基于文本指导的层级自适应融合的多模态情感分析[J]. 山东大学学报(理学版), 2023, 58(12): 31-40. DOI: 10.6040/j.issn.1671-9352.1.2022.421.
[15] HUANG J, JI Y L, QIN Z, et al. Dominant single-modal supplementary fusion (SIMSUF) for multimodal sentiment analysis[J]. IEEE Transactions on Multimedia, 2024, 26: 8383-8394. DOI: 10.1109/TMM.2023.3344358.
[16] 欧阳梦妮, 樊小超, 帕力旦·吐尔逊. 基于目标对齐和语义过滤的多模态情感分析[J]. 计算机技术与发展, 2024, 34(10): 171-177. DOI: 10.20165/j.cnki.ISSN1673-629X.2024.0209.
[17] 谢润锋, 张博超, 杜永萍. 基于视觉语言模型的跨模态多级融合情感分析方法[J]. 模式识别与人工智能, 2024, 37(5): 459-468. DOI: 10.16451/j.cnki.issn1003-6059.202405007.
[18] KE P, JI H Z, LIU S Y, et al. SentiLARE: sentiment-aware language representation learning with linguistic knowledge[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA: Association for Computational Linguistics, 2020: 6975-6988. DOI: 10.18653/v1/2020.emnlp-main.567.
[19] XU H, LIU B, SHU L, et al. BERT post-training for review reading comprehension and aspect-based sentiment analysis[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg, PA: Association for Computational Linguistics, 2019: 2324-2335. DOI: 10.18653/v1/N19-1242.
[20] ZADEH A, ZELLERS R, PINCUS E, et al. MOSI:multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[EB/OL]. (2016-08-12)[2025-04-09]. https://arxiv.org/abs/1606.06259. DOI: 10.48550/arXiv.1606.06259.
[21] BAGHER ZADEH A, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: Association for Computational Linguistics, 2018: 2236-2246. DOI: 10.18653/v1/P18-1208.
[22] DEGOTTEX G, KANE J, DRUGMAN T, et al. COVAREP: a collaborative voice analysis repository for speech technologies[C]//2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2014: 960-964. DOI: 10.1109/ICASSP.2014.6853739.
[23] CHEONG J H, JOLLY E, XIE T K, et al. Py-Feat: Python facial expression analysis toolbox[J]. Affective Science, 2023, 4(4): 781-796. DOI: 10.1007/s42761-023-00191-4.
[24] EKMAN P, ROSENBERG E L. What the face reveals: basic and applied studies of spontaneous expression using the facial action coding system (FACS)[M]. 2nd ed. New York: Oxford University Press, 2005. DOI: 10.1093/acprof:oso/9780195179644.001.0001.
[25] LIN H, ZHANG P L, LING J D, et al. PS-Mixer: a polar-vector and strength-vector mixer model for multimodal sentiment analysis[J]. Information Processing & Management, 2023, 60(2): 103229. DOI: 10.1016/j.ipm.2022.103229.
[26] SRIVASTAVA R K, GREFF K, SCHMIDHUBER J. Highway networks[EB/OL]. (2015-11-03)[2025-04-09]. http://arxiv.org/abs/1505.00387. DOI: 10.48550/arXiv.1505.00387.
[27] DEVLIN J, CHANG M W, LEE K, et al. BERT:pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg, PA: Association for Computational Linguistics, 2019: 4171-4186. DOI: 10.18653/v1/N19-1423.
[28] LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: Association for Computational Linguistics, 2018: 2247-2256. DOI: 10.18653/v1/P18-1209.
[29] HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2020: 1122-1131. DOI: 10.1145/3394171.3413678.
[30] YU W M, XU H, YUAN Z Q, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(12): 10790-10797. DOI: 10.1609/aaai.v35i12.17289.
[31] YANG B, WU L J, ZHU J H, et al. Multimodal sentiment analysis with two-phase multi-task learning[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 2015-2024. DOI: 10.1109/TASLP.2022.3178204.
[32] SUN L C, LIAN Z, LIU B, et al. Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis[J]. IEEE Transactions on Affective Computing, 2024, 15(1): 309-325. DOI: 10.1109/TAFFC.2023.3274829.
[33] WANG Y F, HE J H, WANG D, et al. Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis[J]. Neurocomputing, 2024, 572: 127181. DOI: 10.1016/j.neucom.2023.127181.
[34] LI X, ZHANG H J, DONG Z Q, et al. Learning fine-grained representation with token-level alignment for multimodal sentiment analysis[J]. Expert Systems with Applications, 2025, 269: 126274. DOI: 10.1016/j.eswa.2024.126274.
[1] 施子豪, 蒙祖强, 谈超洪. 基于注意力机制和多尺度融合的多模态虚假新闻检测模型[J]. 广西师范大学学报(自然科学版), 2026, 44(1): 68-79.
[2] 王旭阳, 章家瑜. 基于跨模态增强网络的时序多模态情感分析[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 97-107.
[3] 李志欣, 刘鸣琦. 差异特征导向的解耦多模态情感分析[J]. 广西师范大学学报(自然科学版), 2025, 43(3): 57-71.
[4] 王旭阳, 王常瑞, 张金峰, 邢梦怡. 基于跨模态交叉注意力网络的多模态情感分析方法[J]. 广西师范大学学报(自然科学版), 2024, 42(2): 84-93.
[5] 郭嘉梁, 靳婷. 基于语义增强的多模态情感分析[J]. 广西师范大学学报(自然科学版), 2023, 41(5): 14-25.
[6] 梁启花, 胡现韬, 钟必能, 于枫, 李先贤. 基于孪生网络的目标跟踪算法研究进展[J]. 广西师范大学学报(自然科学版), 2022, 40(5): 90-103.
[7] 杜锦丰, 王海荣, 梁焕, 王栋. 基于表示学习的跨模态检索方法研究进展[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 1-12.
[8] 晁睿, 张坤丽, 王佳佳, 胡斌, 张维聪, 韩英杰, 昝红英. 中文多模态知识库构建[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 31-39.
[9] 孙岩松, 杨亮, 林鸿飞. 基于多粒度的分词消歧和语义增强的情景剧幽默识别[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 57-65.
[10] 马新娜, 赵猛, 祁琳. 基于卷积脉冲神经网络的故障诊断方法研究[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 112-120.
[11] 薛其威, 伍锡如. 基于多模态特征融合的无人驾驶系统车辆检测[J]. 广西师范大学学报(自然科学版), 2022, 40(2): 37-48.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 刘晓娟, 林璐, 胡郁葱, 潘雷. 站点周边用地类型对地铁乘车满意度影响研究[J]. 广西师范大学学报(自然科学版), 2025, 43(6): 1 -12 .
[2] 韩华彬, 高丙朋, 蔡鑫, 孙凯. 基于HO-CNN-BiLSTM-Transformer模型的风机叶片结冰故障诊断[J]. 广西师范大学学报(自然科学版), 2025, 43(6): 13 -28 .
[3] 陈建国, 梁恩华, 宋学伟, 覃章荣. 基于OCT图像三维重建的人眼房水动力学LBM模拟[J]. 广西师范大学学报(自然科学版), 2025, 43(6): 29 -41 .
[4] 李好, 何冰. 凹槽结构表面液滴弹跳行为研究[J]. 广西师范大学学报(自然科学版), 2025, 43(6): 42 -53 .
[5] 田晟, 赵凯龙, 苗佳霖. 基于改进YOLO11n模型的自动驾驶道路交通检测算法研究[J]. 广西师范大学学报(自然科学版), 2026, 44(1): 1 -9 .
[6] 黄艳国, 肖洁, 吴水清. 基于D2STGNN的双向高效多尺度交通流预测[J]. 广西师范大学学报(自然科学版), 2026, 44(1): 10 -22 .
[7] 刘志豪, 李自立, 苏珉. 智能通信与无人机结合的YOLOv8电动车骑行者头盔佩戴检测方法[J]. 广西师范大学学报(自然科学版), 2026, 44(1): 23 -32 .
[8] 张竹露, 李华强, 刘洋, 许立雄. 基于Bi-LSTM特征融合和FT-FSL的非侵入式负荷辨识[J]. 广西师范大学学报(自然科学版), 2026, 44(1): 33 -44 .
[9] 王涛, 黎远松, 石睿, 陈慧宁, 侯宪庆. MGDE-UNet:轻量化光伏电池缺陷分割模型[J]. 广西师范大学学报(自然科学版), 2026, 44(1): 45 -55 .
[10] 黄文杰, 罗维平, 陈镇南, 彭志祥, 丁梓豪. 基于YOLO11的轻量化PCB缺陷检测算法研究[J]. 广西师范大学学报(自然科学版), 2026, 44(1): 56 -67 .
版权所有 © 广西师范大学学报(自然科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发