多文档短摘要生成技术研究

doi:10.16088/j.issn.1001-6600.2019.02.008

广西师范大学学报（自然科学版） ›› 2019, Vol. 37 ›› Issue (2): 60-74.doi: 10.16088/j.issn.1001-6600.2019.02.008

多文档短摘要生成技术研究

张随远^1,2,3, 薛源海^1,2*, 俞晓明^1,2, 刘悦^1,2, 程学旗^1,2

1.中国科学院网络数据科学与技术重点实验室,北京100190;
2.中国科学院计算技术研究所,北京100190;
3.中国科学院大学,北京100190

收稿日期:2018-11-02 出版日期:2019-04-25 发布日期:2019-04-28
通讯作者: 薛源海(1987—),男,云南玉溪人,中国科学院计算技术研究所高级工程师,博士。E-mail:xueyuanhai@ict.ac.cn
基金资助:
国家重点研发计划(2017YFB0803302)

Research on Short Summary Generation of Multi-Document

ZHANG Suiyuan^1,2,3, XUE Yuanhai^1,2*, YU Xiaoming^1,2, LIU Yue^1,2, CHENG Xueqi^1,2

1.Key Laboratory of Network Data Science and Technology, Chinese Academy of Sciences, Beijing 100190, China;
2.Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;
3.University of Chinese Academy of Sciences, Beijing 100190, China

Received:2018-11-02 Online:2019-04-25 Published:2019-04-28

摘要/Abstract

摘要： 自动摘要技术用于将较长篇幅的文章压缩为一段较短的能概括原文中心内容的文本。多文档冗余度高,电子设备所展示的空间有限,成为摘要发展面临的挑战。本文提出融合图卷积特征的句子粗粒度排序方法。首先将句子之间的相似度矩阵视为拓扑关系图,对其进行图卷积计算得到图卷积特征。然后通过排序模型融合图卷积特征以及主流的抽取式多文档摘要技术对句子进行重要度排序,选取排名前四的句子作为摘要。最后提出基于Seq2seq框架的短摘要生成模型:①在Encoder部分采用基于卷积神经网络(CNN)的方法;②引入基于注意力的指针机制,并将主题向量融入其中。实验结果表明,在本文场景下,相较于循环神经网络(RNN),在Encoder部分基于CNN能够更好地进行并行化,在效果基本一致的前提下,显著提升效率。此外,相较于传统的基于抽取和压缩的模型,本文提出的模型在ROUGE指标以及可读性(信息度和流利度)方面均取得了显著的效果提升。

关键词: 多文档, 短摘要生成, Seq2seq

Abstract: The automatic summarization technique is used to compress a long piece of articles into a shorter text that can generalize the content of the original text. Multiple documents are highly redundant, while the space of electronic devices is limited. Thus, the development of abstracts has faced with challenges. In this paper,a rough granularity sorting method with convolution features is proposed. First,the similarity matrix between sentences is regarded as a topological graph,and the convolution features of the graph convolution are obtained. Then,the convolution features of the graph convolution are fused and the mainstream extraction type multi document summarization technology is used to repeat the sentence,the top four sentences are selected as summaries. Lastly,a short summary generation model based on Seq2seq framework is proposed:1)the method based on the convolution neural network (CNN) is adopted in the encoder part;2)the pointer mechanism based on attention is introduced,the subject vector is incorporated into it. A series of experimental results show that in this scenario,compared with recurrent neural network (RNN),the encoder part based on CNN can be better parallelized,thus increasing the efficiency significantly on the premise of basically consistent effect. In addition, compared with the traditional extraction and compression model, the model proposed in this paper has been significantly improved in ROUGE indicators and readability (information and fluency).

Key words: multi-document, short summary generation, Seq2seq

中图分类号:

TP391

张随远, 薛源海, 俞晓明, 刘悦, 程学旗. 多文档短摘要生成技术研究[J]. 广西师范大学学报（自然科学版）, 2019, 37(2): 60-74.

ZHANG Suiyuan, XUE Yuanhai, YU Xiaoming, LIU Yue, CHENG Xueqi. Research on Short Summary Generation of Multi-Document[J]. Journal of Guangxi Normal University(Natural Science Edition), 2019, 37(2): 60-74.

参考文献

[1] 中国互联网络信息中心.第41次中国互联网络发展状况统计报告[EB/OL]. (2018-03-05) [2018-11-02]. http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201803/t20180305_70249.htm.
[2] RADEV D R,JING H,TAM D. Centroid-based summarization of multiple documents[J]. Information Processing & Management,2004,40(6):919-938.
[3] MULLON C,SHIN Y,CURY P. NEATS:A network economics approach to trophic systems[J]. Ecological Modelling,2009,220(21):3033-3045.
[4] EVANS D K,KLAVANS J L,MCKEOWN K R. Columbia Newsblaster:multilingual news summarization on the web[C]//Demonstration Papers at HLT-NAACL. Association for Computational Linguistics. Stroudburg,PA:ACL,2008:1-4.
[5] OUYANG Y,LI W,LI S,et al. Applying regression models to query-focused multi-document summarization[J]. Information Processing & Management,2011,47(2):227-237.
[6] MIHALCEA R,TARAU P. TextRank:Bringing order into texts[C]//Proceedings of 2004 Conference on Empirical Methods in Natural Language Processing. Stroudburg,PA:ACL,2004:404-411.
[7] ERKAN G,RADEV D R. LexRank:graph-based lexical centrality as salience in text summarization[J].Journal of Artificial Intelligence Research,2004,22:457-479.
[8] WAN X,XIAO J. Graph-based multi-modality learning for topic-focused multi-document summarization[C]//Proceedings of the International Joint Conference on Artificial Intelligence IJCAI 2009. California:AAAI,2009:1586-1591.
[9] WAN X,YANG J. Improved affinity graph based multi-document summarization[C]//Human Language Technology Conference of the NAACl,Companion Volume:Short Papers. Stroudburg,PA:ACL,2006:181-184.
[10] WAN X,YANG J. Multi-document summarization using cluster-based link analysis[C]//International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM,2008:299-306.
[11] YAN R,YUAN Z,WAN X,et al. Hierarchical graph summarization:Leveraging hybrid information through visible and invisible linkage[M]//Advances in Knowledge Discovery and Data Mining. Berlin Heidelberg:Springer,2012:97-108.
[12] WAN X. TimedTextRank:adding the temporal dimension to multi-document summarization[C] //International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM,2007:867-868.
[13] SWAN R,ALLAN J. Automatic generation of overview timelines[C]//Proceedings of International ACM SIGIR Conference on Research & Development in Information Retrieval. New York:ACM,2000:49-56.
[14] HAI L C,LEE Y K. Query based event extraction along a timeline[C]//International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM,2004:425-432.
[15] WHITE M,KORELSKY T,CARDIE C,et al. Multidocument summarization via information extraction[C]//International Conference on Human Language Technology Research. Association for Computational Linguistics. Stroudburg,PA:ACL,2001:1-7.
[16] LI L,WANG D,SHEN C,et al. Ontology-enriched multi-document summarization in disaster management[J]. Information Sciences,2013,224(2):118-129.
[17] DORR B,ZAJIC D,SCHWARTZ R. Hedgetrimmer:a parse-and-trim approach to headline generation[C]//Hlt-Naacl 03 on Text Summarization Workshop. Association for Computational Linguistics. Stroudburg,PA:ACL,2003:1-8.
[18] ZAJIC D,DORR B,SCHWARTZ R. Headline generation for written and broadcast news[EB/OL]. (2003-07-01)[2018-11-02]. https://www.researchgate.net/publication/228509374_Headline_generation_for_written_and_broadcast_news.
[19] ALFONSECA E,PIGHIN D,GARRIDO G. HEADY:News headline abstraction through event pattern clustering[C]//Proceedings of the the Meeting of the Association for Computational Linguistics. Stroudburg,PA:ACL,2013:1243-1253.
[20] COLMENARES C A,LITVAK M,MANTRACH A,et al. HEADS:Headlinegeneration as sequence prediction using an abstract feature-rich space[C]//Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudburg,PA:ACL,2015:133-142.
[21] BANKO M,MITTAL V O,WITBROCK M J. Headline generation based on statistical translation[C]//Proceedings of the Meeting of the Association of the Computational Linguistics. Stroudburg,PA:ACL,2000:318-325.
[22] SORICUT R,MARCU D. Abstractive headline generation using WIDL-expressions[J]. Information Processing & Management,2007,43(6):1536-1548.
[23] WOODSEND K,FENG Y,LAPATA M. Title generation with quasi-synchronous grammar[C]// Conference on Empirical Methods in Natural Language Processing. Stroudburg,PA:ACL,2010:513-523.
[24] SUN R,ZHANG Y,ZHANG M,et al. Event-driven headline generation[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Stroudburg,PA:ACL,2015:462-472.
[25] XU L,WANG Z,LIU Z,et al. Topicsensitive neural headline generation[EB/OL]. (2016-08-20)[2018-11-02]. https://arXiv.org/abs/1608.05777v1.
[26] TAN J,WAN X,XIAO J,et al. Fromneural sentence summarization to headline generation:A coarse-to-fine approach[C]//Twenty-Sixth International Joint Conference on Artificial Intelligence. California:AAAI,2017:4109-4115.
[27] GEHRING J,AULI M,GRANGIER D,et al. Convolutionalsequence to sequence learning[EB/OL].(2017-07-25)[2018-11-02]. https://arXiv.org/abs/1705.03122v3.
[28] MOODY C E. Mixing Dirichlettopic models and eord embeddings to make Lda2vec[EB/OL].(2016-05-06)[2018-11-02]. http://cn.arXiv.org/abs/1605.02019v1.

Metrics

Viewed

Full text

649

HTML			PDF

Just accepted	Online first	Issue	Just accepted	Online first	Issue
0	0	0	0	0	649

From	Others	local

Times	89	560
Rate	14%	86%

Abstract

276

Just accepted	Online first	Issue

0	0	276

From	Others	local

Times	264	12
Rate	96%	4%

Cited

Web of Science	Crossref	ScienceDirect	Search for Citations in Google Scholar >>


This page requires you have already subscribed to WoS.

Shared

Discussed

多文档短摘要生成技术研究

Research on Short Summary Generation of Multi-Document

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 3

Metrics

本文评价

推荐阅读 0

[1]	李维勇, 柳斌, 张伟, 陈云芳. 一种基于深度学习的中文生成式自动摘要方法[J]. 广西师范大学学报（自然科学版）, 2020, 38(2): 51-63.
[2]	宋俊, 韩啸宇, 黄宇, 黄廷磊, 付琨. 一种面向实体的演化式多文档摘要生成方法[J]. 广西师范大学学报（自然科学版）, 2015, 33(2): 36-41.
[3]	程显毅, 潘燕, 朱倩, 孙萍. 面向事件的多文档文摘生成算法的研究[J]. 广西师范大学学报（自然科学版）, 2011, 29(1): 147-150.