广西师范大学学报(自然科学版) ›› 2023, Vol. 41 ›› Issue (4): 96-108.doi: 10.16088/j.issn.1001-6600.2022103101

• 研究论文 • 上一篇    下一篇

基于Transformer和TextRank的微博观点摘要方法

孙旭1, 沈彬1, 严馨1,2*, 张金鹏3,4, 徐广义5   

  1. 1.昆明理工大学信息工程与自动化学院, 云南昆明 650500;
    2.云南省人工智能重点实验室(昆明理工大学),云南昆明 650500;
    3.云南大学信息学院, 云南昆明 650091;
    4.云南财经大学信息学院, 云南昆明 650221;
    5.云南南天电子信息产业股份有限公司, 云南昆明 650040
  • 收稿日期:2022-10-31 修回日期:2023-03-16 出版日期:2023-07-25 发布日期:2023-09-06
  • 通讯作者: 严馨(1969—),女,重庆人,昆明理工大学副教授。E-mail:kg_yanxin@sina.com
  • 基金资助:
    国家自然科学基金(U21B2027,61972186);云南省重点研发计划(202103AA080015);云南省基础研究计划(202001AS070014);云南省科技人才与平台计划(202105AC160018)

Microblog Opinion Summarization Method Based on Transformer and TextRank

SUN Xu1, SHEN Bin1, YAN Xin1,2*, ZHANG Jinpeng3,4, XU Guangyi5   

  1. 1. School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming Yunnan 650500, China;
    2. Key Laboratory of Artificial Intelligence in Yunnan Province (Kunming University of Science and Technology), Kunming Yunnan 650500, China;
    3. School of Computer Science and Engineering, Yunnan University, Kunming Yunnan 650091, China;
    4. School of Information, Yunnan University of Finance and Economics, Kunming Yunnan 650221, China;
    5. Yunnan Nantian Electronic Information Industry Co., Ltd., Kunming Yunnan 650040, China
  • Received:2022-10-31 Revised:2023-03-16 Online:2023-07-25 Published:2023-09-06

摘要: 针对已有研究没有考虑微博文本之间情感关联的问题,本文提出基于Transformer和TextRank的微博观点摘要方法。首先通过Transformer中的编码器和量化空间部分对文本的字向量进行编码和量化;然后根据量化结果实现语义聚类来划分微博文本集的观点类别,并选取重要的类别进行摘要抽取;之后将情感特征向量和微博文本的特征向量进行拼接;接着在每个类别中使用融入情感特征的TextRank算法,将抽取出的权重最高的微博文本作为摘要文本;最后将所有类别下最具有代表性的摘要文本相结合,得到最终的微博观点摘要。实验结果表明:在加入情感极性影响因子后,相比于基线方法,本文方法的各项ROUGE值均有明显地提升,Rouge-1、Rouge-2和Rouge-SU4的F-measure值最高达到0.493 7、0.255 5、0.270 6,证明本文方法对于微博观点摘要抽取任务是有效的。

关键词: 情感特征, 观点摘要, 语义聚类, 摘要抽取, Transformer, TextRank

Abstract: The association of sentiment among microblog texts has not been considered by previous research. A microblog opinion summarization method based on Transformer and TextRank is proposed in this paper. Firstly, the word vectors of the texts are encoded and quantified by encoder and quantization space of Transformer. Then according to the quantization results, the opinion categories of microblog textset are divided by semanteme clustering, and the important categories are selected for summary extraction. Then the sentiment feature vector and the microblog text feature vector are concatenated. Then TextRank algorithm with sentiment features is used in every category, and the microblog text with the highest weight is extracted as the summary text. Finally, the most representative summary texts in all categories are combined to obtain the final microblog opinion summarizations. The experimental results show that, after adding the sentiment polarity influence factor, the ROUGE values of the proposed method has significantly improved compared with the baseline method. The maximum F-measure values of Rouge-1, Rouge-2 and Rouge-SU4 can top out at 0.493 7, 0.255 5, 0.270 6 respectively. It proves that the proposed method is effective for the task of extracting microblog opinion summarizations.

Key words: sentiment feature, opinion summarization, semanteme clustering, summary extraction, Transformer, TextRank

中图分类号:  TP391.1

[1] 田宁梦. 面向微博话题的立场检测和观点摘要[D]. 武汉: 中南财经政法大学, 2019.
[2] LLORET E, PALOMAR M. Analyzing the use of word graphs for abstractive text summarization[C]// IMMM 2011: The
First International Conference on Advances in Information Mining and Management. Barcelona: IARIA, 2011: 61-66.
[3] GANESAN K, ZHAI C X, HAN J W. Opinosis:a graph based approach to abstractive summarization of highly redundant opinions[C]// Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010). Beijing: Tsinghua University Press, 2010: 340-348.
[4] GERANI S, MEHDAD Y, CARENINI G, et al. Abstractive summarization of product reviews using discourse structure[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).Stroudsburg, PA: Association for Computational Linguistics, 2014: 1602-1613. DOI: 10.3115/v1/D14-1168.
[5] KHAN A, GUL M A, ZAREEI M, et al. Movie review summarization using supervised learning and graph-based ranking algorithm[J]. Computational Intelligence and Neuroscience, 2020, 2020:7526580. DOI: 10.1155/2020/7526580.
[6] ZHU L H, GAO S, PAN S J, et al. Graph-based informative-sentence selection for opinion summarization[C]// Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013). New York, NY: Association for Computing Machinery, 2013: 408-412. DOI: 10.1145/2492517.2492651.
[7] SANKARASUBRAMANIAM Y, RAMANATHAN K, GHOSH S. Text summarization using Wikipedia[J]. Information Processing & Management, 2014, 50(3): 443-461. DOI: 10.1016/j.ipm.2014.02.001.
[8] 张聪, 裴家欢, 黄锴宇, 等. 基于语义图优化算法的中文微博观点摘要研究[J]. 山东大学学报(理学版), 2017, 52(7): 59-65. DOI: 10.6040/j.issn.1671-9352.1.2016.PC2.
[9] 熊娇, 王明文, 李茂西, 等. 基于词项—句子—文档三层图模型的多文档自动摘要[J]. 中文信息学报, 2014, 28(6): 201-207. DOI: 10.3969/j.issn.1003-0077.2014.06.029.
[10] 余珊珊, 苏锦钿, 李鹏飞. 基于改进的TextRank的自动摘要提取方法[J]. 计算机科学, 2016, 43(6): 240-247. DOI: 10.11896/j.issn.1002-137X.2016.6.048.
[11] 莫鹏, 胡珀, 黄湘冀, 等. 基于超图的文本摘要与关键词协同抽取研究[J]. 中文信息学报, 2015, 29(6): 135-140. DOI: 10.3969/j.issn.1003-0077.2015.06.018.
[12] ANGELIDIS S, LAPATA M. Summarizing opinions:aspect extraction meets sentiment prediction and they are both weakly supervised[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2018: 3675-3686. DOI: 10.18653/v1/D18-1403.
[13] CHU E, LIU P. Meansum: a neural model for unsupervised multi-document abstractive summarization[C]// Proceedings of the 36th International Conference on Machine Learning. Long Beach, CA: PMLR, 2019: 1223-1232.
[14] BRAŽINSKAS A, LAPATA M, TITOV I. Unsupervised opinion summarization as copycat-review generation[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2020: 5151-5169. DOI: 10.18653/v1/2020.acl-main.461.
[15] ANGELIDIS S, AMPLAYO R K, SUHARA Y, et al. Extractive opinion summarization in quantized transformer spaces[J]. Transactions of the Association for Computational Linguistics, 2021, 9: 277-293. DOI: 10.1162/ tacl_a_00366.
[16] KE W J, GAO J H, SHEN H W, et al. ConsistSum: unsupervised opinion summarization with the consistency of aspect, sentiment and semantic[C]// Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. New York, NY: Association for Computing Machinery, 2022: 467-475. DOI: 10.1145/3488560.3498463.
[17] IM J, KIM M, LEE H, et al. Self-supervised multimodal opinion summarization[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Stroudsburg, PA: Association for Computational Linguistics, 2021: 388-403. DOI: 10.18653/v1/2021.acl-long.33.
[18] ABDI A, HASAN S, SHAMSUDDIN S M, et al. A hybrid deep learning architecture for opinion-oriented multi-document summarization based on multi-feature fusion[J]. Knowledge-Based Systems, 2021, 213: 106658. DOI: 10.1016/j.knosys.2020.106658.
[19] 余传明, 郑智梁, 朱星宇, 等. 面向查询的观点摘要模型研究:以Debatepedia为数据源[J]. 情报学报, 2020, 39(4): 374-386. DOI: 10.3772/j.issn.1000-0135.2020.04.004.
[20] 苏放, 王晓宇, 张治. 基于注意力机制的评论摘要生成[J]. 北京邮电大学学报, 2018, 41(3): 7-13. DOI: 10.13190/ j.jbupt.2017-219.
[21] 余传明, 朱星宇, 龚雨田, 等. 基于序列到序列模型的抽象式中文文本摘要研究[J]. 图书情报工作, 2019, 63(11): 108-117. DOI: 10.13266/j.issn.0252-3116.2019.11.012.
[22] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Advances in Neural Information Processing Systerms 30 (NIPS 2017). Red Hook, NY: Curran Associates Inc., 2017: 6000-6010.
[23] ROY A, VASWANI A, NEELAKANTAN A, et al. Theory and experiments on vector quantizedautoencoders[EB/OL]. (2018-07-20)[2022-10-31]. https://arxiv.org/abs/1805.11063. DOI: 10.48550/arXiv.1805.11063.
[24] BENGIO Y, LÉONARD N, COURVILLE A. Estimating or propagating gradients through stochastic neurons for conditional computation[EB/OL].(2013-08-15)[2022-10-31]. https://arxiv.org/abs/1308.3432. DOI: 10.48550/arXiv.1308.3432.
[25] 林莉媛, 王中卿, 李寿山, 等. 基于PageRank的中文多文档文本情感摘要[J]. 中文信息学报, 2014, 28(2): 85-90. DOI: 10.3969/j.issn.1003-0077.2014.02.013.
[26] 沈彬, 严馨, 周丽华, 等. 基于ERNIE和双重注意力机制的微博情感分析[J]. 云南大学学报(自然科学版), 2022, 44(3): 480-489. DOI: 10.7540/j.ynu.20210263.
[27] LI S, ZHAO Z, HU R F, et al. Analogical reasoning on Chinese morphological and semantic relations[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Stroudsburg, PA: Association for Computational Linguistics, 2018: 138-143. DOI: 10.18653/v1/P18-2023.
[28] KINGMA D P, BA J. Adam: a method for stochasticoptimization[EB/OL]. (2017-01-30)[2022-10-31]. https://arxiv.org/abs/1412.6980. DOI: 10.48550/arXiv.1412.6980.
[29] LIN C Y. Rouge: a package for automatic evaluation of summaries[C]// Text Summarization Branches Out. Stroudsburg, PA: Association for Computational Linguistics, 2004: 74-81.
[30] PUDUPPULLY R, STEEDMAN M. Multi-document summarization with centroid-based pretraining[EB/OL]. (2022-08-01)[2022-10-31]. https://arxiv.org/abs/2208.01006. DOI: 10.48550/arXiv.2208.01006.
[31] KAZEMI A, PÉREZ-ROSAS V, MIHALCEA R. Biased TextRank: unsupervised graph-based content extraction[C]// Proceedings of the 28th International Conference on Computational Linguistics. Barcelona: International Committee on Computational Linguistics, 2020: 1642-1652. DOI: 10.18653/v1/2020.coling-main.144.
[1] 潘海明, 陈庆锋, 邱杰, 何乃旭, 刘春雨, 杜晓敬. 基于卷积推理的多跳知识图谱问答算法[J]. 广西师范大学学报(自然科学版), 2023, 41(1): 102-112.
[2] 郝雅茹, 董力, 许可, 李先贤. 预训练语言模型的可解释性研究进展[J]. 广西师范大学学报(自然科学版), 2022, 40(5): 59-71.
[3] 晁睿, 张坤丽, 王佳佳, 胡斌, 张维聪, 韩英杰, 昝红英. 中文多模态知识库构建[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 31-39.
[4] 李正光, 陈恒, 林鸿飞. 基于双向语言模型的社交媒体药物不良反应识别[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 40-48.
[5] 周圣凯, 富丽贞, 宋文爱. 基于深度学习的短文本语义相似度计算模型[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 49-56.
[6] 孙岩松, 杨亮, 林鸿飞. 基于多粒度的分词消歧和语义增强的情景剧幽默识别[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 57-65.
[7] 王健, 郑七凡, 李超, 石晶. 基于ENCODER_ATT机制的远程监督关系抽取[J]. 广西师范大学学报(自然科学版), 2019, 37(4): 53-60.
[8] 宋俊, 韩啸宇, 黄宇, 黄廷磊, 付琨. 一种面向实体的演化式多文档摘要生成方法[J]. 广西师范大学学报(自然科学版), 2015, 33(2): 36-41.
[9] 张芬, 曲维光, 赵红艳, 周俊生. 基于CRF和转换错误驱动学习的浅层句法分析[J]. 广西师范大学学报(自然科学版), 2011, 29(3): 147-150.
[10] 卓广平, 孙静宇, 李鲜花, 余雪丽. 一种基于CBR的个性化推荐算法[J]. 广西师范大学学报(自然科学版), 2011, 29(3): 151-156.
[11] 程显毅, 潘燕, 朱倩, 孙萍. 面向事件的多文档文摘生成算法的研究[J]. 广西师范大学学报(自然科学版), 2011, 29(1): 147-150.
[12] 杨亮, 潘凤鸣, 林鸿飞. 基于组块分析的评价对象识别及其应用[J]. 广西师范大学学报(自然科学版), 2011, 29(1): 151-156.
[13] 程显毅, 朱倩, 韩飞. 基于HNC和描述逻辑的问句语义块分析[J]. 广西师范大学学报(自然科学版), 2010, 28(3): 131-134.
[14] 夏佞, 林鸿飞, 杨志豪, 李彦鹏. 基于扩展语义特征机器学习消歧的基因提及标准化[J]. 广西师范大学学报(自然科学版), 2010, 28(3): 144-147.
[15] 陈羽中, 李峰, 毛先领, 何靖, 闫宏飞. 文献检索与基于影响的摘要系统设计与实现[J]. 广西师范大学学报(自然科学版), 2010, 28(1): 135-138.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 徐久成, 李晓艳, 李双群, 张灵均. 基于相容粒的多层次纹理特征图像检索方法[J]. 广西师范大学学报(自然科学版), 2011, 29(1): 186 -187 .
[2] 白德发, 徐欣, 王国长. 函数型数据广义线性模型和分类问题综述[J]. 广西师范大学学报(自然科学版), 2022, 40(1): 15 -29 .
[3] 曾庆樊, 秦永松, 黎玉芳. 一类空间面板数据模型的经验似然推断[J]. 广西师范大学学报(自然科学版), 2022, 40(1): 30 -42 .
[4] 张喜龙, 韩萌, 陈志强, 武红鑫, 李慕航. 面向复杂数据流的集成分类综述[J]. 广西师范大学学报(自然科学版), 2022, 40(4): 1 -21 .
[5] 童凌晨, 李强, 岳鹏鹏. 基于CiteSpace的喀斯特土壤有机碳研究进展[J]. 广西师范大学学报(自然科学版), 2022, 40(4): 22 -34 .
[6] 王党树, 仪家安, 董振, 杨亚强, 邓翾. 单周期控制的带纹波抑制单元无桥Boost PFC变换器研究[J]. 广西师范大学学报(自然科学版), 2022, 40(4): 47 -57 .
[7] 喻思婷, 彭靖静, 彭振赟. 矩阵方程的秩约束最小二乘对称半正定解及其最佳逼近[J]. 广西师范大学学报(自然科学版), 2022, 40(4): 136 -144 .
[8] 覃城阜, 莫芬梅. C3-和C4-临界连通图的结构[J]. 广西师范大学学报(自然科学版), 2022, 40(4): 145 -153 .
[9] 阴玉栋, 柯善喆, 黄家艳, 邓梦湘, 刘观艳, 程克光. 1,3-二溴丙烷与醇羧酸和胺一锅法生成烯丙基化合物[J]. 广西师范大学学报(自然科学版), 2022, 40(4): 154 -161 .
[10] 杜丽波, 李金玉, 张晓, 李永红, 潘卫东. 毛红椿皮的化学成分及生物活性研究[J]. 广西师范大学学报(自然科学版), 2022, 40(4): 162 -172 .
版权所有 © 广西师范大学学报(自然科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发