广西师范大学学报(自然科学版) ›› 2020, Vol. 38 ›› Issue (2): 51-63.doi: 10.16088/j.issn.1001-6600.2020.02.006

• CTCIS2019 • 上一篇    下一篇

一种基于深度学习的中文生成式自动摘要方法

李维勇1, 柳斌2, 张伟2, 陈云芳2*   

  1. 1.南京信息职业技术学院计算机与软件学院,江苏南京210023;
    2.南京邮电大学计算机学院,江苏南京210023
  • 收稿日期:2019-10-08 发布日期:2020-04-02
  • 通讯作者: 陈云芳(1976—),男,江苏镇江人,南京邮电大学副教授,博士。E-mail:chenyf@njupt.edu.cn
  • 基金资助:
    国家自然科学基金(61672297);2019年度高校“青蓝工程”优秀教学团队项目(苏教师[2019]3号)

An Automatic Summarization Model Based on Deep Learning for Chinese

LI Weiyong1, LIU Bin2, ZHANG Wei2, CHEN Yunfang2*   

  1. 1. Institute of Computing Software,Nanjing Vocational College of Information Technology, Nanjing Jiangsu210023,China;
    2.School of Computer Science and Technology,Nanjing University of Postsand Telecommunications, Nanjing Jiangsu 210023,China
  • Received:2019-10-08 Published:2020-04-02

摘要: 针对中文的象形性和结构性特点,本文提出了一种新的生成式自动摘要解决方案,包括基于笔画的文本向量生成技术和一个生成式自动摘要模型。基于笔画的文本向量方法针对组成汉字的最小粒度笔画进行编码,增强了通过Skip-Gram模型得到对应的中文词向量语义信息;然后通过对Seq2Seq模型进行优化,使用Bi-LSTM解决长序列文本信息丢失以及逆向信息的补充问题;并在编码端加入Attention机制以计算不同输入词对解码端的影响权重,在解码端加入Beam Search算法优化生成序列的流畅度。基于LCSTS数据集实验表明,本文提出的模型在中文文本摘要生成质量和可读性上有所提升。

关键词: 深度学习, 生成式自动摘要, 笔画向量, Seq2Seq, 注意力机制

Abstract: Based on the unique pictograph and the structure of Chinese character, a new way to form automatic summarization is proposed in the paper, which includes text vector technique directing at Chinese stroke and an automatic summarizing model. Stroke-based text vector codes the basic element of Chinese character and it highlights the specific characteristics of the word, which makes the relationship between words tightened. The corresponding text vector of Chinese word is gained by Skip-Gram model and optimized through Seq2Seq model. It solves the problem of long-sequence text information loss and the supplement of reversing information by using Bi-LSTM. Attention mechanism is used in encoder to weigh different effects of the input statement on decoder and meanwhile the use of Beam Search in the decoder optimizes the sequence of the results. The experiments based on LCSTS data set training model show the automatic summarization model can improve the quality and the readability of Chinese text summary.

Key words: deep learning, generation summarization, stroke_embedding, Seq2Seq, attention mechanism

中图分类号: 

  • TP391
[1] LUHN H P.The automatic creation of literature abstracts[J].IBM Journal of Research and Development,1958, 2(2):159-165.DOI: 10.1147/rd.22.0159.
[2] 张随远,薛源海,俞晓明,等.多文档短摘要生成技术研究[J].广西师范大学学报(自然科学版),2019,37(2):60-74.DOI: 10.16088/j.issn.1001-6600.2019.02.008.
[3] LOPYREVK. Generating news headlines with recurrent neural networks[EB/OL].(2015-12-05)[2019-10-08]. https://arxiv.org/abs/1512.01712.
[4] 宋俊,韩啸宇,黄宇,等.一种面向实体的演化式多文档摘要生成方法[J].广西师范大学学报(自然科学版),2015,33(2):36-41.DOI: 10.16088/j.issn.1001-6600.2015.02.006.
[5] CHO K, van MERRIËNBOER B,GULCEHRE C,et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.Stroudsburg,PA:Association for Computational Linguistics,2014:1724-1734.DOI:10.3115/v1/D14-1179.
[6] BAHDANAU D,CHO K,BENGIO Y.Neural machine translation by jointly learning to align and translate[EB/OL]. (2016-05-19)[2019-10-08].https://arxiv.org/abs/1409.0473v7.
[7] 张仰森,曹元大,俞士汶.基于规则与统计相结合的中文文本自动查错模型与算法[J].中文信息学报,2006, 20(4):1-7,55.DOI: 10.3969/j.issn.1003-0077.2006.04.001.
[8] HU Baotian,CHEN Qingcai,ZHU Fangze.LCSTS:a large scale chinese short text summarization dataset[EB/OL]. (2015-06-19)[2019-10-08].https://arxiv.org/abs/1506.05865.
[9] RUSHA M,CHOPRA S,WESTON J.A neural attention model for abstractive sentence summarization[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg,PA: Association for Computational Linguistics,2015:379-389.DOI: 10.18653/v1/D15-1044.
[10]BENGIO Y,DUCHARME R,VINCENT P, et al.A neural probabilistic language model[J].Journal of Machine Learning Research,2003,3: 1137-1155.
[11]GOODFELLOWI J,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems:Volume 2.Cambridge,MA:MIT Press,2014: 2672-2680.
[12]BOJANOWSKI P,GRAVE E,JOULIN A,et al.Enriching word vectors with subword information[J].Transactions of the Association for Computational Linguistics,2017,5:135-146.DOI 10.1162/tacl_a_00051.
[13]SUTSKEVER I,VINYALS O,LE Q V.Sequence to sequence learning with neural networks[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems:Volume 2.Cambridge,MA:MIT Press,2014: 3104-3112.
[14]BAHDANAU D,CHO K,BENGIO Y.Neural machine translation by jointly learning to align and translate[EB/OL]. (2014-09-01)[2019-10-08].https://arxiv.org/abs/1409.0473v7.
[15]LIN C Y,HOVY E.Automatic evaluation of summaries using N-gram co-occurrence statistics[C]//Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology:Volume 1.Stroudsburg,PA:Association for Computational Linguistics,2003:71-78.DOI:10.3115/ 1073445.1073465.
[16]YU Jinxing,JIAN Xun,XIN Hao,et al.Joint embeddings of Chinese words, characters, and fine-grained subcharacter components[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing Stroudsburg,PA:Association for Computational Linguistics,2017:286-291.DOI:10.18653/v1/D17-1027.
[17]LUONG T,PHAM H,MANNING C D.Effective approaches to attention-based neural machine translation[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.Stroudsburg,PA: Association for Computational Linguistics,2015:1412-1421.DOI:10.18653/v1/D15-1166.
[18]Term Frequency by Inverse Document Frequency[M]//LIU Ling,ÖZSU M.Encyclopedia of Database Systems.Boston, MA:Springer,2009. DOI: 10.1007/978-0-387-39940-9_3784.
[1] 张明宇, 赵猛, 蔡夫鸿, 梁钰, 王鑫红. 基于深度学习的波浪能发电功率预测[J]. 广西师范大学学报(自然科学版), 2020, 38(3): 25-32.
[2] 刘英璇, 伍锡如, 雪刚刚. 基于深度学习的道路交通标志多目标实时检测[J]. 广西师范大学学报(自然科学版), 2020, 38(2): 96-106.
[3] 王健, 郑七凡, 李超, 石晶. 基于ENCODER_ATT机制的远程监督关系抽取[J]. 广西师范大学学报(自然科学版), 2019, 37(4): 53-60.
[4] 张随远, 薛源海, 俞晓明, 刘悦, 程学旗. 多文档短摘要生成技术研究[J]. 广西师范大学学报(自然科学版), 2019, 37(2): 60-74.
[5] 张金磊, 罗玉玲, 付强. 基于门控循环单元神经网络的金融时间序列预测[J]. 广西师范大学学报(自然科学版), 2019, 37(2): 82-89.
[6] 黄丽明, 陈维政, 闫宏飞, 陈翀. 基于循环神经网络和深度学习的股票预测方法[J]. 广西师范大学学报(自然科学版), 2019, 37(1): 13-22.
[7] 武文雅, 陈钰枫, 徐金安, 张玉洁. 基于高层语义注意力机制的中文实体关系抽取[J]. 广西师范大学学报(自然科学版), 2019, 37(1): 32-41.
[8] 岳天驰, 张绍武, 杨亮, 林鸿飞, 于凯. 基于两阶段注意力机制的立场检测方法[J]. 广西师范大学学报(自然科学版), 2019, 37(1): 42-49.
[9] 余传明, 李浩男, 安璐. 基于多任务深度学习的文本情感原因分析[J]. 广西师范大学学报(自然科学版), 2019, 37(1): 50-61.
[10] 王祺, 邱家辉, 阮彤, 高大启, 高炬. 基于循环胶囊网络的临床语义关系识别研究[J]. 广西师范大学学报(自然科学版), 2019, 37(1): 80-88.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 李钰慧, 陈泽柠, 黄中豪, 周岐海. 广西弄岗熊猴的雨季活动时间分配[J]. 广西师范大学学报(自然科学版), 2018, 36(3): 80 -86 .
[2] 李贤江, 石淑芹, 蔡为民, 曹玉青. 基于CA-Markov模型的天津滨海新区土地利用变化模拟[J]. 广西师范大学学报(自然科学版), 2018, 36(3): 133 -143 .
[3] 刘国伦, 宋树祥, 岑明灿, 李桂琴, 谢丽娜. 带宽可调带阻滤波器的设计[J]. 广西师范大学学报(自然科学版), 2018, 36(3): 1 -8 .
[4] 黄燕萍, 韦煜明. 一类分数阶微分方程多点边值问题的多解性[J]. 广西师范大学学报(自然科学版), 2018, 36(3): 41 -49 .
[5] 万雷,罗玉玲,黄星月. 脉冲神经网络硬件系统性能监测平台[J]. 广西师范大学学报(自然科学版), 2018, 36(1): 9 -16 .
[6] 林越,刘廷章,陈一凡,金勇,梁立新. 基于AP-HMM混合模型的充电桩故障诊断[J]. 广西师范大学学报(自然科学版), 2018, 36(1): 25 -33 .
[7] 卢家宽,刘雪霞,覃雪清. 关于Frobenius群的注记[J]. 广西师范大学学报(自然科学版), 2018, 36(1): 84 -87 .
[8] 吴雷,阳丽,郭鹏霄. Rucklidge系统的反馈线性化控制[J]. 广西师范大学学报(自然科学版), 2017, 35(1): 21 -27 .
[9] 韩彩虹, 李略, 黄丽丽. 一类差分方程的全局渐近稳定性[J]. 广西师范大学学报(自然科学版), 2017, 35(1): 53 -57 .
[10] 王娟,覃景芳,柳春艳,唐煌. 小叶薜荔根茎提取物抗α-葡萄糖苷酶及乙酰胆碱酯酶活性研究[J]. 广西师范大学学报(自然科学版), 2017, 35(1): 69 -74 .
版权所有 © 广西师范大学学报(自然科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发