基于语义增强的多模态情感分析

doi:10.16088/j.issn.1001-6600.2023022302

摘要/Abstract

摘要： 多模态情感分析是自然语言处理领域的重要任务,模态融合是其核心问题。以往的研究没有区分各个模态在情感分析中的主次地位,没有考虑到不同模态之间的质量和性能差距,平等地对待各个模态。现有研究表明文本模态往往在情感分析中占据主导地位,但非文本模态包含识别正确情感必不可少的关键特征信息。因此,本文提出一种以文本模态为中心的模态融合策略,通过带有注意力机制的编解码器网络区分不同模态之间的共有语义和私有语义,利用非文本模态相对于文本模态的2种语义增强补充文本特征,实现多模态的联合鲁棒表示,并最终实现情感预测。在CMU-MOSI和CMU-MOSEI视频情感分析数据集上的实验显示,本方法的准确率分别达到87.3%和86.2%,优于许多现有的先进方法。

关键词: 情感分析, 模态融合, 注意力机制, 共同语义, 私有语义, 增强补充

Abstract: Multimodal sentiment analysis is an important task in the field of natural language processing, and modality fusion is its core problem. Previous research has not distinguished the primary and secondary status of each modality in sentiment analysis, treating each modality equally and not properly recognizing the quality and performance gaps between different modalities. Existing research shows that textual modalities tend to dominate sentiment analysis, but non-textual modalities contain key feature information that is essential for identifying correct sentiment. Therefore, this paper proposes a modality fusion strategy that focuses on text modality. Through a codec network with an attention mechanism to distinguish the shared and private semantics between different modalities, the two semantic enhancements of non-text modalities relative to text modalities are used to complement text features, achieve a joint robust representation of multiple modalities, and ultimately achieve sentiment prediction. Experiments on the CMU-MOSI and CMU-MOSEI video sentiment analysis datasets show that the accuracy of this method reaches 87.3% and 86.2% respectively, outperforming many existing state-of-the-art methods.

Key words: sentiment analysis, modal fusion, attentional mechanisms, common semantics, private semantics, augmented complementation

中图分类号: TP391.1

郭嘉梁, 靳婷. 基于语义增强的多模态情感分析[J]. 广西师范大学学报（自然科学版）, 2023, 41(5): 14-25.

GUO Jialiang, JIN Ting. Semantic Enhancement-Based Multimodal Sentiment Analysis[J]. Journal of Guangxi Normal University(Natural Science Edition), 2023, 41(5): 14-25.

参考文献

[1] VINODHINI G, CHANDRASEKARAN R M. Sentiment analysis and opinion mining: a survey[J]. International Journal of Advanced Research in Computer Science and Software Engineering, 2012, 2(6): 282-292.
[2] 钟佳娃, 刘巍, 王思丽, 等. 文本情感分析方法及应用综述[J]. 数据分析与知识发现, 2021, 5(6): 1-13. DOI: 10.11925/infotech.2096-3467.2021.0040.
[3] 任泽裕, 王振超, 柯尊旺, 等. 多模态数据融合综述[J]. 计算机工程与应用, 2021, 57(18): 49-64. DOI: 10.3778/j.issn.1002-8331.2104-0237.
[4] 刘继明, 张培翔, 刘颖, 等. 多模态的情感分析技术综述[J]. 计算机科学与探索, 2021, 15(7): 1165-1182. DOI: 10.3778/j.issn.1673-9418.2012075.
[5] 张亚洲, 戎璐, 宋大为, 等. 多模态情感分析研究综述[J]. 模式识别与人工智能, 2020, 33(5): 426-438. DOI: 10.16451/j.cnki.issn1003-6059.202005005.
[6] 吴石松, 董召杰. 基于RoBERTa改进的多模态情绪识别关键技术研究[J]. 电子设计工程, 2023, 31(9): 54-58. DOI: 10.14022/j.issn1674-6236.2023.09.011.
[7] SUN Z K, SARMA P, SETHARES W, et al. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(5): 8992-8999. DOI: 10.1609/aaai.v34i05.6431.
[8] HAZARIKA D, PORIA S, MIHALCEA R, et al. Icon: interactive conversational memory network for multimodal emotion detection[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2018: 2594-2604. DOI: 10.18653/v1/D18-1280.
[9] PHAM H, LIANG P P, MANZINI T, et al. Found in translation: learning robust joint representations by cyclic translations between modalities[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 6892-6899. DOI: 10.1609/aaai.v33i01.33016892.
[10] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]// Advances in Neural Information Processing Systems 27 (NIPS 2014). Red Hook, NY: Curran Associates, Inc., 2014: 3104-3112.
[11] CAO R Q, YE C Y, ZHOU H. Multimodel sentiment analysis with self-attention[C]// Proceedings of the Future Technologies Conference (FTC) 2020: Volume 1. Cham: Springer Nature Switzerland AG, 2020: 16-26. DOI: 10.1007/978-3-030-63128-4_2.
[12] PORIA S, CAMBRIA E, BAJPAI R, et al. A review of affective computing: from unimodal analysis to multimodal fusion[J]. Information Fusion, 2017, 37: 98-125. DOI: 10.1016/j.inffus.2017.02.003.
[13] CAMBRIA E, DAS D, BANDYOPADHYAY S, et al. Affective computing and sentiment analysis[M]// CAMBRIA E, DAS D, BANDYOPADHYAY S, et al. A Practical Guide to Sentiment Analysis. Cham: Springer, 2017: 1-10. DOI: 10.1007/978-3-319-55394-8_1.
[14] KAMPMAN O, BAREZI E J, BERTERO D,et al. Investigating audio, visual, and text fusion methods for end-to-end automatic personality prediction[EB/OL]. (2018-05-16)[2023-02-23]. https://arxiv.org/abs/1805.00705. DOI: 10.48550/arXiv.1805.00705.
[15] D’MELLO S K, KORY J. A review and meta-analysis of multimodal affect detection systems[J]. ACM Computing Surveys, 2015, 47(3): 43. DOI: 10.1145/2682899.
[16] MORENCY L P, MIHALCEA R, DOSHI P. Towards multimodal sentiment analysis: harvesting opinions from the web[C]// ICMI’11: Proceedings of the 13th International Conference on Multimodal Interfaces. New York, NY: Association for Computing Machinery, 2011: 169-176. DOI: 10.1145/2070481.2070509.
[17] GKOUMAS D, LI Q C, LIOMA C, et al. What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis[J]. Information Fusion, 2021, 66: 184-197. DOI: 10.1016/j.inffus.2020.09.005.
[18] ZADEH A, CHEN M H, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2017: 1103-1114. DOI: 10.18653/v1/D17-1115.
[19] WANG Y S, SHEN Y, LIU Z, et al. Words can shift: dynamically adjusting word representations using nonverbal behaviors[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 7216-7223. DOI: 10.1609/aaai.v33i01.33017216.
[20] PORIA S, CAMBRIA E, HAZARIKA D, et al. Context-dependent sentiment analysis in user-generated videos[C]// Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers). Stroudsburg, PA: Association for Computational Linguistics, 2017: 873-883. DOI: 10.18653/v1/P17-1081.
[21] HAN W, CHEN H, GELBUKH A, et al. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis[C]// ICMI’21: Proceedings of the 2021 International Conference on Multimodal Interaction. New York, NY: Association for Computing Machinery, 2021: 6-15. DOI: 10.1145/3462244.3479919.
[22] HAN W, CHEN H, PORIA S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2021: 9180-9192. DOI: 10.18653/v1/2021.emnlp-main.723.
[23] 杨青, 张亚文, 朱丽, 等. 基于注意力机制和BiGRU融合的文本情感分析[J]. 计算机科学, 2021, 48(11): 307-311. DOI: 10.11896/jsjkx.201000075.
[24] FENG K X, CHASPARI T. A review of generalizable transfer learning in automatic emotion recognition[J]. Frontiers in Computer Science, 2020, 2: 9. DOI: 10.3389/fcomp.2020.00009.
[25] 岳增营, 叶霞, 刘睿珩. 基于语言模型的预训练技术研究综述[J]. 中文信息学报, 2021, 35(9): 15-29. DOI: 10.3969/j.issn.1003-0077.2021.09.002.
[26] 李舟军, 范宇, 吴贤杰. 面向自然语言处理的预训练技术研究综述[J]. 计算机科学, 2020, 47(3): 162-173. DOI: 10.11896/jsjkx.191000167.
[27] 赵宏, 傅兆阳, 赵凡. 基于BERT和层次化Attention的微博情感分析研究[J]. 计算机工程与应用, 2022, 58(5): 156-162. DOI: 10.3778/j.issn.1002-8331.2107-0448.
[28] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg, PA: Association for Computational Linguistics, 2019: 4171-4186. DOI: 10.18653/v1/N19-1423.
[29] LEE S, HAN D K, KO H. Multimodal emotion recognition fusion analysis adapting BERT with heterogeneous feature unification[J]. IEEE Access, 2021, 9: 94557-94572. DOI: 10.1109/ACCESS.2021.3092735.
[30] ABDU S A, YOUSEF A H, SALEM A. Multimodal video sentiment analysis using deep learning approaches, a survey[J]. Information Fusion, 2021, 76: 204-226. DOI: 10.1016/j.inffus.2021.06.003.
[31] 朱张莉, 饶元, 吴渊, 等. 注意力机制在深度学习中的研究进展[J]. 中文信息学报, 2019, 33(6): 1-11. DOI: 10.3969/j.issn.1003-0077.2019.06.001.
[32] 林敏鸿, 蒙祖强. 基于注意力神经网络的多模态情感分析[J]. 计算机科学, 2020, 47(11A): 508-514, 548. DOI: 10.11896/jsjkx.191100041.
[33] 姚懿秦, 郭薇. 基于交互注意力机制的多模态情感识别算法[J]. 计算机应用研究, 2021, 38(6): 1689-1693. DOI: 10.19734/j.issn.1001-3695.2020.09.0230.
[34] 郭可心, 张宇翔. 基于多层次空间注意力的图文评论情感分析方法[J]. 计算机应用, 2021, 41(10): 2835-2841. DOI: 10.11772/j.issn.1001-9081.2020101676.
[35] 朱亚辉. 基于Bi-LSTM-Attention的英文文本情感分类方法[J]. 电子设计工程, 2022, 30(16): 27-30. DOI: 10.14022/j.issn1674-6236.2022.16.006.
[36] WU Y H, SCHUSTER M, CHEN Z F, et al. Google’s neural machine translation system: bridging the gap between human and machine translation[EB/OL]. (2016-10-08)[2023-02-23]. https://arxiv.org/abs/1609.08144. DOI: 10.48550/arXiv.1609.08144.
[37] DEGOTTEX G, KANE J, DRUGMAN T, et al. COVAREP: a collaborative voice analysis repository for speech technologies[C]// 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2014: 960-964. DOI: 10.1109/ICASSP.2014.6853739.
[38] ZADEH A, ZELLERS R, PINCUS E, et al. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[EB/OL]. (2016-08-12)[2023-02-23]. https://arxiv.org/abs/1606.06259. DOI: 10.48550/arXiv.1606.06259.
[39] ZADEH A, LIANG P P, VANBRIESEN J, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: Association for Computational Linguistics, 2018: 2236-2246. DOI: 10.18653/v1/P18-1208.
[40] ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[J]. Proceedings of the AAAI conference on artificial intelligence, 2018, 32(1): 5634-5641. DOI: 10.1609/aaai.v32i1.12021.
[41] TSAI Y H H, BAI S J, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2019: 6558-6569. DOI: 10.18653/v1/P19-1656.
[42] MAI S J, HU H F, XU J, et al. Multi-fusion residual memory network for multimodal human sentiment comprehension[J]. IEEE Transactions on Affective Computing, 2022, 13(1): 320-334. DOI: 10.1109/TAFFC.2020.3000510.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed