基于Transformer和TextRank的微博观点摘要方法

doi:10.16088/j.issn.1001-6600.2022103101

广西师范大学学报（自然科学版） ›› 2023, Vol. 41 ›› Issue (4): 96-108.doi: 10.16088/j.issn.1001-6600.2022103101

基于Transformer和TextRank的微博观点摘要方法

孙旭¹, 沈彬¹, 严馨^1,2*, 张金鹏^3,4, 徐广义⁵

1.昆明理工大学信息工程与自动化学院, 云南昆明 650500;
2.云南省人工智能重点实验室(昆明理工大学),云南昆明 650500;
3.云南大学信息学院, 云南昆明 650091;
4.云南财经大学信息学院, 云南昆明 650221;
5.云南南天电子信息产业股份有限公司, 云南昆明 650040

收稿日期:2022-10-31 修回日期:2023-03-16 出版日期:2023-07-25 发布日期:2023-09-06
通讯作者: 严馨(1969—),女,重庆人,昆明理工大学副教授。E-mail:kg_yanxin@sina.com
基金资助:
国家自然科学基金(U21B2027,61972186);云南省重点研发计划(202103AA080015);云南省基础研究计划(202001AS070014);云南省科技人才与平台计划(202105AC160018)

Microblog Opinion Summarization Method Based on Transformer and TextRank

SUN Xu¹, SHEN Bin¹, YAN Xin^1,2*, ZHANG Jinpeng^3,4, XU Guangyi⁵

1. School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming Yunnan 650500, China;
2. Key Laboratory of Artificial Intelligence in Yunnan Province (Kunming University of Science and Technology), Kunming Yunnan 650500, China;
3. School of Computer Science and Engineering, Yunnan University, Kunming Yunnan 650091, China;
4. School of Information, Yunnan University of Finance and Economics, Kunming Yunnan 650221, China;
5. Yunnan Nantian Electronic Information Industry Co., Ltd., Kunming Yunnan 650040, China

Received:2022-10-31 Revised:2023-03-16 Online:2023-07-25 Published:2023-09-06

摘要/Abstract

摘要： 针对已有研究没有考虑微博文本之间情感关联的问题,本文提出基于Transformer和TextRank的微博观点摘要方法。首先通过Transformer中的编码器和量化空间部分对文本的字向量进行编码和量化;然后根据量化结果实现语义聚类来划分微博文本集的观点类别,并选取重要的类别进行摘要抽取;之后将情感特征向量和微博文本的特征向量进行拼接;接着在每个类别中使用融入情感特征的TextRank算法,将抽取出的权重最高的微博文本作为摘要文本;最后将所有类别下最具有代表性的摘要文本相结合,得到最终的微博观点摘要。实验结果表明:在加入情感极性影响因子后,相比于基线方法,本文方法的各项ROUGE值均有明显地提升,Rouge-1、Rouge-2和Rouge-SU4的F-measure值最高达到0.493 7、0.255 5、0.270 6,证明本文方法对于微博观点摘要抽取任务是有效的。

关键词: 情感特征, 观点摘要, 语义聚类, 摘要抽取, Transformer, TextRank

Abstract: The association of sentiment among microblog texts has not been considered by previous research. A microblog opinion summarization method based on Transformer and TextRank is proposed in this paper. Firstly, the word vectors of the texts are encoded and quantified by encoder and quantization space of Transformer. Then according to the quantization results, the opinion categories of microblog textset are divided by semanteme clustering, and the important categories are selected for summary extraction. Then the sentiment feature vector and the microblog text feature vector are concatenated. Then TextRank algorithm with sentiment features is used in every category, and the microblog text with the highest weight is extracted as the summary text. Finally, the most representative summary texts in all categories are combined to obtain the final microblog opinion summarizations. The experimental results show that, after adding the sentiment polarity influence factor, the ROUGE values of the proposed method has significantly improved compared with the baseline method. The maximum F-measure values of Rouge-1, Rouge-2 and Rouge-SU4 can top out at 0.493 7, 0.255 5, 0.270 6 respectively. It proves that the proposed method is effective for the task of extracting microblog opinion summarizations.

Key words: sentiment feature, opinion summarization, semanteme clustering, summary extraction, Transformer, TextRank

中图分类号: TP391.1

孙旭, 沈彬, 严馨, 张金鹏, 徐广义. 基于Transformer和TextRank的微博观点摘要方法[J]. 广西师范大学学报（自然科学版）, 2023, 41(4): 96-108.

SUN Xu, SHEN Bin, YAN Xin, ZHANG Jinpeng, XU Guangyi. Microblog Opinion Summarization Method Based on Transformer and TextRank[J]. Journal of Guangxi Normal University(Natural Science Edition), 2023, 41(4): 96-108.

参考文献

[1] 田宁梦. 面向微博话题的立场检测和观点摘要[D]. 武汉: 中南财经政法大学, 2019.
[2] LLORET E, PALOMAR M. Analyzing the use of word graphs for abstractive text summarization[C]// IMMM 2011: The
First International Conference on Advances in Information Mining and Management. Barcelona: IARIA, 2011: 61-66.
[3] GANESAN K, ZHAI C X, HAN J W. Opinosis:a graph based approach to abstractive summarization of highly redundant opinions[C]// Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010). Beijing: Tsinghua University Press, 2010: 340-348.
[4] GERANI S, MEHDAD Y, CARENINI G, et al. Abstractive summarization of product reviews using discourse structure[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).Stroudsburg, PA: Association for Computational Linguistics, 2014: 1602-1613. DOI: 10.3115/v1/D14-1168.
[5] KHAN A, GUL M A, ZAREEI M, et al. Movie review summarization using supervised learning and graph-based ranking algorithm[J]. Computational Intelligence and Neuroscience, 2020, 2020:7526580. DOI: 10.1155/2020/7526580.
[6] ZHU L H, GAO S, PAN S J, et al. Graph-based informative-sentence selection for opinion summarization[C]// Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013). New York, NY: Association for Computing Machinery, 2013: 408-412. DOI: 10.1145/2492517.2492651.
[7] SANKARASUBRAMANIAM Y, RAMANATHAN K, GHOSH S. Text summarization using Wikipedia[J]. Information Processing & Management, 2014, 50(3): 443-461. DOI: 10.1016/j.ipm.2014.02.001.
[8] 张聪, 裴家欢, 黄锴宇, 等. 基于语义图优化算法的中文微博观点摘要研究[J]. 山东大学学报(理学版), 2017, 52(7): 59-65. DOI: 10.6040/j.issn.1671-9352.1.2016.PC2.
[9] 熊娇, 王明文, 李茂西, 等. 基于词项—句子—文档三层图模型的多文档自动摘要[J]. 中文信息学报, 2014, 28(6): 201-207. DOI: 10.3969/j.issn.1003-0077.2014.06.029.
[10] 余珊珊, 苏锦钿, 李鹏飞. 基于改进的TextRank的自动摘要提取方法[J]. 计算机科学, 2016, 43(6): 240-247. DOI: 10.11896/j.issn.1002-137X.2016.6.048.
[11] 莫鹏, 胡珀, 黄湘冀, 等. 基于超图的文本摘要与关键词协同抽取研究[J]. 中文信息学报, 2015, 29(6): 135-140. DOI: 10.3969/j.issn.1003-0077.2015.06.018.
[12] ANGELIDIS S, LAPATA M. Summarizing opinions:aspect extraction meets sentiment prediction and they are both weakly supervised[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2018: 3675-3686. DOI: 10.18653/v1/D18-1403.
[13] CHU E, LIU P. Meansum: a neural model for unsupervised multi-document abstractive summarization[C]// Proceedings of the 36th International Conference on Machine Learning. Long Beach, CA: PMLR, 2019: 1223-1232.
[14] BRAŽINSKAS A, LAPATA M, TITOV I. Unsupervised opinion summarization as copycat-review generation[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2020: 5151-5169. DOI: 10.18653/v1/2020.acl-main.461.
[15] ANGELIDIS S, AMPLAYO R K, SUHARA Y, et al. Extractive opinion summarization in quantized transformer spaces[J]. Transactions of the Association for Computational Linguistics, 2021, 9: 277-293. DOI: 10.1162/ tacl_a_00366.
[16] KE W J, GAO J H, SHEN H W, et al. ConsistSum: unsupervised opinion summarization with the consistency of aspect, sentiment and semantic[C]// Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. New York, NY: Association for Computing Machinery, 2022: 467-475. DOI: 10.1145/3488560.3498463.
[17] IM J, KIM M, LEE H, et al. Self-supervised multimodal opinion summarization[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Stroudsburg, PA: Association for Computational Linguistics, 2021: 388-403. DOI: 10.18653/v1/2021.acl-long.33.
[18] ABDI A, HASAN S, SHAMSUDDIN S M, et al. A hybrid deep learning architecture for opinion-oriented multi-document summarization based on multi-feature fusion[J]. Knowledge-Based Systems, 2021, 213: 106658. DOI: 10.1016/j.knosys.2020.106658.
[19] 余传明, 郑智梁, 朱星宇, 等. 面向查询的观点摘要模型研究:以Debatepedia为数据源[J]. 情报学报, 2020, 39(4): 374-386. DOI: 10.3772/j.issn.1000-0135.2020.04.004.
[20] 苏放, 王晓宇, 张治. 基于注意力机制的评论摘要生成[J]. 北京邮电大学学报, 2018, 41(3): 7-13. DOI: 10.13190/ j.jbupt.2017-219.
[21] 余传明, 朱星宇, 龚雨田, 等. 基于序列到序列模型的抽象式中文文本摘要研究[J]. 图书情报工作, 2019, 63(11): 108-117. DOI: 10.13266/j.issn.0252-3116.2019.11.012.
[22] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Advances in Neural Information Processing Systerms 30 (NIPS 2017). Red Hook, NY: Curran Associates Inc., 2017: 6000-6010.
[23] ROY A, VASWANI A, NEELAKANTAN A, et al. Theory and experiments on vector quantizedautoencoders[EB/OL]. (2018-07-20)[2022-10-31]. https://arxiv.org/abs/1805.11063. DOI: 10.48550/arXiv.1805.11063.
[24] BENGIO Y, LÉONARD N, COURVILLE A. Estimating or propagating gradients through stochastic neurons for conditional computation[EB/OL].(2013-08-15)[2022-10-31]. https://arxiv.org/abs/1308.3432. DOI: 10.48550/arXiv.1308.3432.
[25] 林莉媛, 王中卿, 李寿山, 等. 基于PageRank的中文多文档文本情感摘要[J]. 中文信息学报, 2014, 28(2): 85-90. DOI: 10.3969/j.issn.1003-0077.2014.02.013.
[26] 沈彬, 严馨, 周丽华, 等. 基于ERNIE和双重注意力机制的微博情感分析[J]. 云南大学学报(自然科学版), 2022, 44(3): 480-489. DOI: 10.7540/j.ynu.20210263.
[27] LI S, ZHAO Z, HU R F, et al. Analogical reasoning on Chinese morphological and semantic relations[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Stroudsburg, PA: Association for Computational Linguistics, 2018: 138-143. DOI: 10.18653/v1/P18-2023.
[28] KINGMA D P, BA J. Adam: a method for stochasticoptimization[EB/OL]. (2017-01-30)[2022-10-31]. https://arxiv.org/abs/1412.6980. DOI: 10.48550/arXiv.1412.6980.
[29] LIN C Y. Rouge: a package for automatic evaluation of summaries[C]// Text Summarization Branches Out. Stroudsburg, PA: Association for Computational Linguistics, 2004: 74-81.
[30] PUDUPPULLY R, STEEDMAN M. Multi-document summarization with centroid-based pretraining[EB/OL]. (2022-08-01)[2022-10-31]. https://arxiv.org/abs/2208.01006. DOI: 10.48550/arXiv.2208.01006.
[31] KAZEMI A, PÉREZ-ROSAS V, MIHALCEA R. Biased TextRank: unsupervised graph-based content extraction[C]// Proceedings of the 28th International Conference on Computational Linguistics. Barcelona: International Committee on Computational Linguistics, 2020: 1642-1652. DOI: 10.18653/v1/2020.coling-main.144.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于Transformer和TextRank的微博观点摘要方法

Microblog Opinion Summarization Method Based on Transformer and TextRank

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 10

[1]	潘海明, 陈庆锋, 邱杰, 何乃旭, 刘春雨, 杜晓敬. 基于卷积推理的多跳知识图谱问答算法[J]. 广西师范大学学报（自然科学版）, 2023, 41(1): 102-112.
[2]	郝雅茹, 董力, 许可, 李先贤. 预训练语言模型的可解释性研究进展[J]. 广西师范大学学报（自然科学版）, 2022, 40(5): 59-71.
[3]	晁睿, 张坤丽, 王佳佳, 胡斌, 张维聪, 韩英杰, 昝红英. 中文多模态知识库构建[J]. 广西师范大学学报（自然科学版）, 2022, 40(3): 31-39.
[4]	李正光, 陈恒, 林鸿飞. 基于双向语言模型的社交媒体药物不良反应识别[J]. 广西师范大学学报（自然科学版）, 2022, 40(3): 40-48.
[5]	周圣凯, 富丽贞, 宋文爱. 基于深度学习的短文本语义相似度计算模型[J]. 广西师范大学学报（自然科学版）, 2022, 40(3): 49-56.
[6]	孙岩松, 杨亮, 林鸿飞. 基于多粒度的分词消歧和语义增强的情景剧幽默识别[J]. 广西师范大学学报（自然科学版）, 2022, 40(3): 57-65.
[7]	王健, 郑七凡, 李超, 石晶. 基于ENCODER_ATT机制的远程监督关系抽取[J]. 广西师范大学学报（自然科学版）, 2019, 37(4): 53-60.
[8]	宋俊, 韩啸宇, 黄宇, 黄廷磊, 付琨. 一种面向实体的演化式多文档摘要生成方法[J]. 广西师范大学学报（自然科学版）, 2015, 33(2): 36-41.
[9]	张芬, 曲维光, 赵红艳, 周俊生. 基于CRF和转换错误驱动学习的浅层句法分析[J]. 广西师范大学学报（自然科学版）, 2011, 29(3): 147-150.
[10]	卓广平, 孙静宇, 李鲜花, 余雪丽. 一种基于CBR的个性化推荐算法[J]. 广西师范大学学报（自然科学版）, 2011, 29(3): 151-156.
[11]	程显毅, 潘燕, 朱倩, 孙萍. 面向事件的多文档文摘生成算法的研究[J]. 广西师范大学学报（自然科学版）, 2011, 29(1): 147-150.
[12]	杨亮, 潘凤鸣, 林鸿飞. 基于组块分析的评价对象识别及其应用[J]. 广西师范大学学报（自然科学版）, 2011, 29(1): 151-156.
[13]	程显毅, 朱倩, 韩飞. 基于HNC和描述逻辑的问句语义块分析[J]. 广西师范大学学报（自然科学版）, 2010, 28(3): 131-134.
[14]	夏佞, 林鸿飞, 杨志豪, 李彦鹏. 基于扩展语义特征机器学习消歧的基因提及标准化[J]. 广西师范大学学报（自然科学版）, 2010, 28(3): 144-147.
[15]	陈羽中, 李峰, 毛先领, 何靖, 闫宏飞. 文献检索与基于影响的摘要系统设计与实现[J]. 广西师范大学学报（自然科学版）, 2010, 28(1): 135-138.