广西师范大学学报(自然科学版) ›› 2019, Vol. 37 ›› Issue (2): 60-74.doi: 10.16088/j.issn.1001-6600.2019.02.008

• • 上一篇    下一篇

多文档短摘要生成技术研究

张随远1,2,3, 薛源海1,2*, 俞晓明1,2, 刘悦1,2, 程学旗1,2   

  1. 1.中国科学院网络数据科学与技术重点实验室,北京100190;
    2.中国科学院计算技术研究所,北京100190;
    3.中国科学院大学,北京100190
  • 收稿日期:2018-11-02 出版日期:2019-04-25 发布日期:2019-04-28
  • 通讯作者: 薛源海(1987—),男,云南玉溪人,中国科学院计算技术研究所高级工程师,博士。E-mail:xueyuanhai@ict.ac.cn
  • 基金资助:
    国家重点研发计划(2017YFB0803302)

Research on Short Summary Generation of Multi-Document

ZHANG Suiyuan1,2,3, XUE Yuanhai1,2*, YU Xiaoming1,2, LIU Yue1,2, CHENG Xueqi1,2   

  1. 1.Key Laboratory of Network Data Science and Technology, Chinese Academy of Sciences, Beijing 100190, China;
    2.Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;
    3.University of Chinese Academy of Sciences, Beijing 100190, China
  • Received:2018-11-02 Online:2019-04-25 Published:2019-04-28

摘要: 自动摘要技术用于将较长篇幅的文章压缩为一段较短的能概括原文中心内容的文本。多文档冗余度高,电子设备所展示的空间有限,成为摘要发展面临的挑战。本文提出融合图卷积特征的句子粗粒度排序方法。首先将句子之间的相似度矩阵视为拓扑关系图,对其进行图卷积计算得到图卷积特征。然后通过排序模型融合图卷积特征以及主流的抽取式多文档摘要技术对句子进行重要度排序,选取排名前四的句子作为摘要。最后提出基于Seq2seq框架的短摘要生成模型:①在Encoder部分采用基于卷积神经网络(CNN)的方法;②引入基于注意力的指针机制,并将主题向量融入其中。实验结果表明,在本文场景下,相较于循环神经网络(RNN),在Encoder部分基于CNN能够更好地进行并行化,在效果基本一致的前提下,显著提升效率。此外,相较于传统的基于抽取和压缩的模型,本文提出的模型在ROUGE指标以及可读性(信息度和流利度)方面均取得了显著的效果提升。

关键词: 多文档, 短摘要生成, Seq2seq

Abstract: The automatic summarization technique is used to compress a long piece of articles into a shorter text that can generalize the content of the original text. Multiple documents are highly redundant, while the space of electronic devices is limited. Thus, the development of abstracts has faced with challenges. In this paper,a rough granularity sorting method with convolution features is proposed. First,the similarity matrix between sentences is regarded as a topological graph,and the convolution features of the graph convolution are obtained. Then,the convolution features of the graph convolution are fused and the mainstream extraction type multi document summarization technology is used to repeat the sentence,the top four sentences are selected as summaries. Lastly,a short summary generation model based on Seq2seq framework is proposed:1)the method based on the convolution neural network (CNN) is adopted in the encoder part;2)the pointer mechanism based on attention is introduced,the subject vector is incorporated into it. A series of experimental results show that in this scenario,compared with recurrent neural network (RNN),the encoder part based on CNN can be better parallelized,thus increasing the efficiency significantly on the premise of basically consistent effect. In addition, compared with the traditional extraction and compression model, the model proposed in this paper has been significantly improved in ROUGE indicators and readability (information and fluency).

Key words: multi-document, short summary generation, Seq2seq

中图分类号: 

  • TP391
[1] 中国互联网络信息中心.第41次中国互联网络发展状况统计报告[EB/OL]. (2018-03-05) [2018-11-02]. http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201803/t20180305_70249.htm.
[2] RADEV D R,JING H,TAM D. Centroid-based summarization of multiple documents[J]. Information Processing & Management,2004,40(6):919-938.
[3] MULLON C,SHIN Y,CURY P. NEATS:A network economics approach to trophic systems[J]. Ecological Modelling,2009,220(21):3033-3045.
[4] EVANS D K,KLAVANS J L,MCKEOWN K R. Columbia Newsblaster:multilingual news summarization on the web[C]//Demonstration Papers at HLT-NAACL. Association for Computational Linguistics. Stroudburg,PA:ACL,2008:1-4.
[5] OUYANG Y,LI W,LI S,et al. Applying regression models to query-focused multi-document summarization[J]. Information Processing & Management,2011,47(2):227-237.
[6] MIHALCEA R,TARAU P. TextRank:Bringing order into texts[C]//Proceedings of 2004 Conference on Empirical Methods in Natural Language Processing. Stroudburg,PA:ACL,2004:404-411.
[7] ERKAN G,RADEV D R. LexRank:graph-based lexical centrality as salience in text summarization[J].Journal of Artificial Intelligence Research,2004,22:457-479.
[8] WAN X,XIAO J. Graph-based multi-modality learning for topic-focused multi-document summarization[C]//Proceedings of the International Joint Conference on Artificial Intelligence IJCAI 2009. California:AAAI,2009:1586-1591.
[9] WAN X,YANG J. Improved affinity graph based multi-document summarization[C]//Human Language Technology Conference of the NAACl,Companion Volume:Short Papers. Stroudburg,PA:ACL,2006:181-184.
[10] WAN X,YANG J. Multi-document summarization using cluster-based link analysis[C]//International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM,2008:299-306.
[11] YAN R,YUAN Z,WAN X,et al. Hierarchical graph summarization:Leveraging hybrid information through visible and invisible linkage[M]//Advances in Knowledge Discovery and Data Mining. Berlin Heidelberg:Springer,2012:97-108.
[12] WAN X. TimedTextRank:adding the temporal dimension to multi-document summarization[C] //International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM,2007:867-868.
[13] SWAN R,ALLAN J. Automatic generation of overview timelines[C]//Proceedings of International ACM SIGIR Conference on Research & Development in Information Retrieval. New York:ACM,2000:49-56.
[14] HAI L C,LEE Y K. Query based event extraction along a timeline[C]//International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM,2004:425-432.
[15] WHITE M,KORELSKY T,CARDIE C,et al. Multidocument summarization via information extraction[C]//International Conference on Human Language Technology Research. Association for Computational Linguistics. Stroudburg,PA:ACL,2001:1-7.
[16] LI L,WANG D,SHEN C,et al. Ontology-enriched multi-document summarization in disaster management[J]. Information Sciences,2013,224(2):118-129.
[17] DORR B,ZAJIC D,SCHWARTZ R. Hedgetrimmer:a parse-and-trim approach to headline generation[C]//Hlt-Naacl 03 on Text Summarization Workshop. Association for Computational Linguistics. Stroudburg,PA:ACL,2003:1-8.
[18] ZAJIC D,DORR B,SCHWARTZ R. Headline generation for written and broadcast news[EB/OL]. (2003-07-01)[2018-11-02]. https://www.researchgate.net/publication/228509374_Headline_generation_for_written_and_broadcast_news.
[19] ALFONSECA E,PIGHIN D,GARRIDO G. HEADY:News headline abstraction through event pattern clustering[C]//Proceedings of the the Meeting of the Association for Computational Linguistics. Stroudburg,PA:ACL,2013:1243-1253.
[20] COLMENARES C A,LITVAK M,MANTRACH A,et al. HEADS:Headlinegeneration as sequence prediction using an abstract feature-rich space[C]//Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudburg,PA:ACL,2015:133-142.
[21] BANKO M,MITTAL V O,WITBROCK M J. Headline generation based on statistical translation[C]//Proceedings of the Meeting of the Association of the Computational Linguistics. Stroudburg,PA:ACL,2000:318-325.
[22] SORICUT R,MARCU D. Abstractive headline generation using WIDL-expressions[J]. Information Processing & Management,2007,43(6):1536-1548.
[23] WOODSEND K,FENG Y,LAPATA M. Title generation with quasi-synchronous grammar[C]// Conference on Empirical Methods in Natural Language Processing. Stroudburg,PA:ACL,2010:513-523.
[24] SUN R,ZHANG Y,ZHANG M,et al. Event-driven headline generation[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Stroudburg,PA:ACL,2015:462-472.
[25] XU L,WANG Z,LIU Z,et al. Topicsensitive neural headline generation[EB/OL]. (2016-08-20)[2018-11-02]. https://arXiv.org/abs/1608.05777v1.
[26] TAN J,WAN X,XIAO J,et al. Fromneural sentence summarization to headline generation:A coarse-to-fine approach[C]//Twenty-Sixth International Joint Conference on Artificial Intelligence. California:AAAI,2017:4109-4115.
[27] GEHRING J,AULI M,GRANGIER D,et al. Convolutionalsequence to sequence learning[EB/OL].(2017-07-25)[2018-11-02]. https://arXiv.org/abs/1705.03122v3.
[28] MOODY C E. Mixing Dirichlettopic models and eord embeddings to make Lda2vec[EB/OL].(2016-05-06)[2018-11-02]. http://cn.arXiv.org/abs/1605.02019v1.
[1] 李维勇, 柳斌, 张伟, 陈云芳. 一种基于深度学习的中文生成式自动摘要方法[J]. 广西师范大学学报(自然科学版), 2020, 38(2): 51-63.
[2] 宋俊, 韩啸宇, 黄宇, 黄廷磊, 付琨. 一种面向实体的演化式多文档摘要生成方法[J]. 广西师范大学学报(自然科学版), 2015, 33(2): 36-41.
[3] 程显毅, 潘燕, 朱倩, 孙萍. 面向事件的多文档文摘生成算法的研究[J]. 广西师范大学学报(自然科学版), 2011, 29(1): 147-150.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!
版权所有 © 广西师范大学学报(自然科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发