广西师范大学学报(自然科学版) ›› 2019, Vol. 37 ›› Issue (1): 89-97.doi: 10.16088/j.issn.1001-6600.2019.01.010

• 第二十四届全国信息检索学术会议专栏 • 上一篇    下一篇

汉语-印尼语平行语料自动对齐方法研究

郑铿涛1, 林楠铠1, 付颖雯1, 王连喜2, 蒋盛益1,2*   

  1. 1.广东外语外贸大学信息科学与技术学院,广东广州510420;
    2.广州市非通用语种智能处理重点实验室(广东外语外贸大学),广东广州510420
  • 收稿日期:2018-09-27 发布日期:2019-01-08
  • 通讯作者: 蒋盛益(1963—),男,湖南隆回人,广东外语外贸大学教授。E-mail:jiangshengyi@163.com
  • 基金资助:
    国家自然科学基金(61572145);国家社会科学基金青年项目(17CTQ045);广东省教育厅基础研究重大项目及应用研究重大项目(2017KZDXM031);2018年广东大学生科技创新培育专项资金(pdjhb0177)

Study on the Automatic Alignment of Mandarin-Indonesian Bilingual Texts

ZHENG Kengtao1, LIN Nankai1, FU Yingwen1, WANG Lianxi2, JIANG Shengyi1, 2*   

  1. 1.School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou Guangdong 510420, China;
    2.Eastern Language Processing Center, Guangdong University of Foreign Studies, Guangzhou Guangdong 510420, China
  • Received:2018-09-27 Published:2019-01-08

摘要: 双语平行语料库是多语种自然语言处理的重要资源,已被广泛地应用于机器翻译、机助人译、翻译知识抽取与跨语言信息检索等领域中。本文针对汉语-印尼语平行语料的自动对齐与可比语料的自动提取问题,提出了基于锚点和词典相结合的段落对齐方法,并在此基础上采用基于置信区间的长度模型实现句子对齐,同时,为了快速提高汉语-印尼语平行语料库的构建效率,还提出了基于跨语言文档相似度的可比语料提取方法。实验结果表明,本文提出的平行语料对齐方法和可比语料提取方法的准确率较传统方法有显著的提高,说明本文提出方法是有效的、可行的。

关键词: 平行语料, 语料库构建, 可比语料, 段落对齐, 句对齐

Abstract: Bilingual parallel corpus is an important resource for multilingual natural language processing. It has been widely used in the fields of machine translation, machine-assisted translation, translation knowledge extraction and cross-language information retrieval. In this paper, the automatic alignment of Chinese-Indonesian parallel corpus and the automatic extraction of comparable corpus are proposed. Firstly, a paragraph alignment method based on the combination of anchor point and dictionary is proposed. On this basis, the length alignment model based on confidence interval is used to achieve sentence alignment. At the same time, in order to quickly improve the construction efficiency of the Chinese-Indonesian parallel corpus, a comparable corpus extraction method based on the similarity of cross-language documents is proposed. The experimental results show that the accuracy of parallel corpus alignment method and comparable corpus extraction method is significantly higher than that of traditional methods, which indicates that the proposed method is effective and feasible.

Key words: parallel corpus, corpus construction, comparable corpus, paragraph alignment, sentence alignment

中图分类号: 

  • TP391.1
[1] 林政, 吕雅娟, 刘群, 等.Web平行语料挖掘及其在机器翻译中的应用[J]. 中文信息学报, 2010, 24(5):85-91.
[2] 郭华伟, 张帆, 杨小敏, 等.英汉平行语料库在跨语言信息检索中的应用分析[J]. 医学信息学杂志, 2012, 33(3):39-43.
[3] CHEN J, NIE J. Automatic construction of parallel english-chinese corpus for cross-language information retrieval[C]//Proceedings of the 6th Applied Natural Language Processing Conference. Seattle, WA:Applied Natural Language Processing Conference, 2000:21-28.
[4] PHILIP R, SMITH N A. The Web as a parallel corpus[J]. Computational Linguistics, 2003, 29(3):349-380.
[5] ZHANG Y, WU K, GAO J, et al. Automatic acquisition of Chinese–English parallel corpus from the Web[C]//European Conference on Information Retrieval, Berlin. Heidelberg:Springer, 2006: 420-431.
[6] MOORE R C. Fast and accurate sentence alignment of bilingual corpora[J]. Lecture Notes in Computer Science, 2002, 2499:135-144.
[7] VARGA D, HALÁCSY P, KORNAI A, et al. Parallel corpora for medium density languages[J]. Amsterdam Studies in the Theory and History of Linguistic Science Series 4, 2007, 292: 247.
[8] MA X Y. Champollion: a robust parallel text sentence aligner[C]//Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC). [S.l.]:LREC, 2006: 489-492.
[9] 向露. 基于网络的翻译知识自动获取方法研究与实现[D]. 北京:中国科学院大学, 2014.
[10] BROWN P F, LAI J C, MERCER R L. Aligning sentences in parallel corpora[C]//Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics(ACL). Stroudsburg PA:ACL, 1991:169-176.
[11] GALE W A, CHURCH K W. A program for aligning sentences in bilingual corpora[J]. Computational Linguistics, 1993, 19(1): 75-102.
[1] 王健, 郑七凡, 李超, 石晶. 基于ENCODER_ATT机制的远程监督关系抽取[J]. 广西师范大学学报(自然科学版), 2019, 37(4): 53-60.
[2] 宋俊, 韩啸宇, 黄宇, 黄廷磊, 付琨. 一种面向实体的演化式多文档摘要生成方法[J]. 广西师范大学学报(自然科学版), 2015, 33(2): 36-41.
[3] 张芬, 曲维光, 赵红艳, 周俊生. 基于CRF和转换错误驱动学习的浅层句法分析[J]. 广西师范大学学报(自然科学版), 2011, 29(3): 147-150.
[4] 卓广平, 孙静宇, 李鲜花, 余雪丽. 一种基于CBR的个性化推荐算法[J]. 广西师范大学学报(自然科学版), 2011, 29(3): 151-156.
[5] 刘金龙, 郭岩, 余智华, 刘悦, 俞晓明, 程学旗. 基于词聚类的跨媒体突发事件检测方法[J]. 广西师范大学学报(自然科学版), 2019, 37(1): 23-31.
[6] 程显毅, 潘燕, 朱倩, 孙萍. 面向事件的多文档文摘生成算法的研究[J]. 广西师范大学学报(自然科学版), 2011, 29(1): 147-150.
[7] 杨亮, 潘凤鸣, 林鸿飞. 基于组块分析的评价对象识别及其应用[J]. 广西师范大学学报(自然科学版), 2011, 29(1): 151-156.
[8] 周鑫, 郝志峰, 蔡瑞初, 温雯. 带噪声的文本聚类及其在反垃圾邮件中的应用[J]. 广西师范大学学报(自然科学版), 2011, 29(2): 156-160.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!
版权所有 © 广西师范大学学报(自然科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发