Journal of Guangxi Normal University(Natural Science Edition) ›› 2026, Vol. 44 ›› Issue (1): 91-101.doi: 10.16088/j.issn.1001-6600.2025040903

• Intelligence Information Processing • Previous Articles     Next Articles

Cross-modal Feature Enhancement and Hierarchical MLP Communication for Multimodal Sentiment Analysis

WANG Xuyang*, MA Jin   

  1. School of Computer and Communication, Lanzhou University of Technology, Lanzhou Gansu 730050, China
  • Received:2025-04-09 Revised:2025-08-25 Online:2026-01-05 Published:2026-01-26

Abstract: In multimodal sentiment analysis tasks, effective fusion of sentiment information between different modalities is hindered due to three challenges: nonverbal modal information being insufficiently utilized, fine-grained associative modeling for cross-modal interactions being inadequately established, and hierarchical semantic fusion mechanisms being imperfectly designed. To address these issues, a multimodal sentiment analysis method with cross-modal feature enhancement and hierarchical MLP communication is proposed in this paper. A progressive fusion architecture is constructed, where nonverbal modal information is first enhanced through a cross-modal attention mechanism, enabling many-to-many cross-modal fine-grained interactions to be captured. Subsequently, a lightweight hierarchical MLP communication module is designed to implement hierarchical feature interactions in both horizontal and vertical dimensions, through which cross-modal deep semantic fusion is achieved. It is demonstrated by the experimental results that compared with the suboptimal model on CMU-MOSI, The Acc2 and F1 values increase 0.89 percentage points and 0.77 percentage points compared with the suboptimal model. Furthermore, all metrics in comparative experiments on CMU-MOSEI are shown to surpass those of baseline models, with the Acc2 and F1 values being elevated to 86.34% and 86.25%.

Key words: multimodality, sentiment analysis, cross-modal attention, hierarchical MLP communication, gating units

CLC Number:  TP391.1
[1] 李梦云, 张景, 张换香, 等. 基于跨模态语义信息增强的多模态情感分析[J]. 计算机科学与探索, 2024, 18(9): 2476-2486. DOI: 10.3778/j.issn.1673-9418.2307045.
[2] 王旭阳, 章家瑜. 基于跨模态增强网络的时序多模态情感分析[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 97-107. DOI: 10.16088/j.issn.1001-6600.2024081301.
[3] ZADEH A, LIANG P P, PORIA S, et al. Multi-attention recurrent network for human communication comprehension[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 5642-5649. DOI: 10.1609/aaai.v32i1.12024.
[4] 吴俊洁, 王佳阳, 朱萍, 等. 基于MLP和多头自注意力特征融合的双模态情感计算模型[J]. 计算机应用, 2024, 44(S1): 39-43. DOI: 10.11772/j.issn.1001-9081.2023091295.
[5] 林宜山, 左景, 卢树华. 基于音视频特征优化与跨模态Transformer的多模态情感分析[J/OL]. 北京航空航天大学学报:1-13[2025-04-09]. https://doi.org/10.13700/j.bh.1001-5965.2024.0247. DOI: 10.13700/j.bh.1001-5965.2024.0247.
[6] LIU C, WANG Y, YANG J. A transformer-encoder-based multimodal multi-attention fusion network for sentimentanalysis[J]. Applied Intelligence, 2024, 54(17/18): 8415-8441. DOI: 10.1007/s10489-024-05623-7.
[7] 李文潇, 梅红岩, 李雨恬. 基于深度学习的多模态情感分析研究综述[J]. 辽宁工业大学学报(自然科学版), 2022, 42(5): 293-298. DOI: 10.15916/j.issn1674-3261.2022.05.003.
[8] 刘继明, 张培翔, 刘颖, 等. 多模态的情感分析技术综述[J]. 计算机科学与探索, 2021, 15(7): 1165-1182. DOI: 10.3778/j.issn.1673-9418.2012075.
[9] GARG A, PAVLOVIC V, REHG J M. Boosted learning in dynamic Bayesian networks for multimodal speaker detection[J]. Proceedings of the IEEE, 2003, 91(9): 1355-1369. DOI: 10.1109/JPROC.2003.817119.
[10] ZADEH A, CHEN M H, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2017: 1103-1114. DOI: 10.18653/v1/D17-1115.
[11] ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 5634-5641. DOI: 10.1609/aaai.v32i1.12021.
[12] TSAI Y H H, BAI S J, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2019: 6558-6569. DOI: 10.18653/v1/P19-1656.
[13] 高玮军, 孙子博, 刘书君. 基于多视角的图像文本情感分析[J]. 计算机科学, 2024, 51(S2): 138-145. DOI: 10.11896/jsjkx.231200163.
[14] 卢婵, 郭军军, 谭凯文, 等. 基于文本指导的层级自适应融合的多模态情感分析[J]. 山东大学学报(理学版), 2023, 58(12): 31-40. DOI: 10.6040/j.issn.1671-9352.1.2022.421.
[15] HUANG J, JI Y L, QIN Z, et al. Dominant single-modal supplementary fusion (SIMSUF) for multimodal sentiment analysis[J]. IEEE Transactions on Multimedia, 2024, 26: 8383-8394. DOI: 10.1109/TMM.2023.3344358.
[16] 欧阳梦妮, 樊小超, 帕力旦·吐尔逊. 基于目标对齐和语义过滤的多模态情感分析[J]. 计算机技术与发展, 2024, 34(10): 171-177. DOI: 10.20165/j.cnki.ISSN1673-629X.2024.0209.
[17] 谢润锋, 张博超, 杜永萍. 基于视觉语言模型的跨模态多级融合情感分析方法[J]. 模式识别与人工智能, 2024, 37(5): 459-468. DOI: 10.16451/j.cnki.issn1003-6059.202405007.
[18] KE P, JI H Z, LIU S Y, et al. SentiLARE: sentiment-aware language representation learning with linguistic knowledge[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA: Association for Computational Linguistics, 2020: 6975-6988. DOI: 10.18653/v1/2020.emnlp-main.567.
[19] XU H, LIU B, SHU L, et al. BERT post-training for review reading comprehension and aspect-based sentiment analysis[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg, PA: Association for Computational Linguistics, 2019: 2324-2335. DOI: 10.18653/v1/N19-1242.
[20] ZADEH A, ZELLERS R, PINCUS E, et al. MOSI:multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[EB/OL]. (2016-08-12)[2025-04-09]. https://arxiv.org/abs/1606.06259. DOI: 10.48550/arXiv.1606.06259.
[21] BAGHER ZADEH A, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: Association for Computational Linguistics, 2018: 2236-2246. DOI: 10.18653/v1/P18-1208.
[22] DEGOTTEX G, KANE J, DRUGMAN T, et al. COVAREP: a collaborative voice analysis repository for speech technologies[C]//2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2014: 960-964. DOI: 10.1109/ICASSP.2014.6853739.
[23] CHEONG J H, JOLLY E, XIE T K, et al. Py-Feat: Python facial expression analysis toolbox[J]. Affective Science, 2023, 4(4): 781-796. DOI: 10.1007/s42761-023-00191-4.
[24] EKMAN P, ROSENBERG E L. What the face reveals: basic and applied studies of spontaneous expression using the facial action coding system (FACS)[M]. 2nd ed. New York: Oxford University Press, 2005. DOI: 10.1093/acprof:oso/9780195179644.001.0001.
[25] LIN H, ZHANG P L, LING J D, et al. PS-Mixer: a polar-vector and strength-vector mixer model for multimodal sentiment analysis[J]. Information Processing & Management, 2023, 60(2): 103229. DOI: 10.1016/j.ipm.2022.103229.
[26] SRIVASTAVA R K, GREFF K, SCHMIDHUBER J. Highway networks[EB/OL]. (2015-11-03)[2025-04-09]. http://arxiv.org/abs/1505.00387. DOI: 10.48550/arXiv.1505.00387.
[27] DEVLIN J, CHANG M W, LEE K, et al. BERT:pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg, PA: Association for Computational Linguistics, 2019: 4171-4186. DOI: 10.18653/v1/N19-1423.
[28] LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: Association for Computational Linguistics, 2018: 2247-2256. DOI: 10.18653/v1/P18-1209.
[29] HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2020: 1122-1131. DOI: 10.1145/3394171.3413678.
[30] YU W M, XU H, YUAN Z Q, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(12): 10790-10797. DOI: 10.1609/aaai.v35i12.17289.
[31] YANG B, WU L J, ZHU J H, et al. Multimodal sentiment analysis with two-phase multi-task learning[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 2015-2024. DOI: 10.1109/TASLP.2022.3178204.
[32] SUN L C, LIAN Z, LIU B, et al. Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis[J]. IEEE Transactions on Affective Computing, 2024, 15(1): 309-325. DOI: 10.1109/TAFFC.2023.3274829.
[33] WANG Y F, HE J H, WANG D, et al. Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis[J]. Neurocomputing, 2024, 572: 127181. DOI: 10.1016/j.neucom.2023.127181.
[34] LI X, ZHANG H J, DONG Z Q, et al. Learning fine-grained representation with token-level alignment for multimodal sentiment analysis[J]. Expert Systems with Applications, 2025, 269: 126274. DOI: 10.1016/j.eswa.2024.126274.
[1] SHI Zihao, MENG Zuqiang, TAN Chaohong. A Detection Model for Multimodal Fake News Based on Attention Mechanism and Multiscale Fusion [J]. Journal of Guangxi Normal University(Natural Science Edition), 2026, 44(1): 68-79.
[2] WANG Xuyang, ZHANG Jiayu. Temporal Multimodal Sentiment Analysis with Cross-Modal Augmentation Networks [J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(4): 97-107.
[3] LI Zhixin, LIU Mingqi. A Dissimilarity Feature-Driven Decoupled Multimodal Sentiment Analysis [J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(3): 57-71.
[4] WANG Xuyang, WANG Changrui, ZHANG Jinfeng, XING Mengyi. Multimodal Sentiment Analysis Based on Cross-Modal Cross-Attention Network [J]. Journal of Guangxi Normal University(Natural Science Edition), 2024, 42(2): 84-93.
[5] GUO Jialiang, JIN Ting. Semantic Enhancement-Based Multimodal Sentiment Analysis [J]. Journal of Guangxi Normal University(Natural Science Edition), 2023, 41(5): 14-25.
[6] SUN Yansong, YANG Liang, LIN Hongfei. Humor Recognition of Sitcom Based on Multi-granularity of Segmentation Enhancement and Semantic Enhancement [J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(3): 57-65.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] LIU Xiaojuan, LIN Lu, HU Yucong, PAN Lei. Research on the Influence of Land Use Types Surrounding Stations on Subway Passenger Satisfaction[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 1 -12 .
[2] HAN Huabin, GAO Bingpeng, CAI Xin, SUN Kai. Fault Diagnosis of Wind Turbine Blade Icing Based on HO-CNN-BiLSTM-Transformer Model[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 13 -28 .
[3] CHEN Jianguo, LIANG Enhua, SONG Xuewei, QIN Zhangrong. Lattice Boltzmann Simulation for the Aqueous Humour Dynamics of the Human Eye Based on 3D Reconstruction of OCT Images[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 29 -41 .
[4] LI Hao, HE Bing. Droplet Rebound Behavior on Grooves Surface[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 42 -53 .
[5] TIAN Sheng, ZHAO Kailong, MIAO Jialin. Research on Automatic Driving Road Traffic Detection Algorithm Based on Improved YOLO11n Model[J]. Journal of Guangxi Normal University(Natural Science Edition), 2026, 44(1): 1 -9 .
[6] HUANG Yanguo, XIAO Jie, WU Shuiqing. Bidirectional Efficient Multi-scale Traffic Flow Prediction Based on D2STGNN[J]. Journal of Guangxi Normal University(Natural Science Edition), 2026, 44(1): 10 -22 .
[7] LIU Zhihao, LI Zili, SU Min. YOLOv8-based Helmet Detection Method for Electric Vehicle Riders Combining Intelligent Communication and UAV-Assistance[J]. Journal of Guangxi Normal University(Natural Science Edition), 2026, 44(1): 23 -32 .
[8] ZHANG Zhulu, LI Huaqiang, LIU Yang, XU Lixiong. Non-intrusive Load Identification Based on Bi-LSTM Feature Fusion and FT-FSL[J]. Journal of Guangxi Normal University(Natural Science Edition), 2026, 44(1): 33 -44 .
[9] WANG Tao, LI Yuansong, SHI Rui, CHEN Huining, HOU Xianqing. MGDE-UNet: Defect Segmentation Model for Lightweight Photovoltaic Cells[J]. Journal of Guangxi Normal University(Natural Science Edition), 2026, 44(1): 45 -55 .
[10] HUANG Wenjie, LUO Weiping, CHEN Zhennan, PENG Zhixiang, DING Zihao. Research on Lightweight PCB Defect Detection Algorithm Based on YOLO11[J]. Journal of Guangxi Normal University(Natural Science Edition), 2026, 44(1): 56 -67 .