Journal of Guangxi Normal University(Natural Science Edition) ›› 2019, Vol. 37 ›› Issue (1): 89-97.doi: 10.16088/j.issn.1001-6600.2019.01.010

Previous Articles     Next Articles

Study on the Automatic Alignment of Mandarin-Indonesian Bilingual Texts

ZHENG Kengtao1, LIN Nankai1, FU Yingwen1, WANG Lianxi2, JIANG Shengyi1, 2*   

  1. 1.School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou Guangdong 510420, China;
    2.Eastern Language Processing Center, Guangdong University of Foreign Studies, Guangzhou Guangdong 510420, China
  • Received:2018-09-27 Published:2019-01-08

Abstract: Bilingual parallel corpus is an important resource for multilingual natural language processing. It has been widely used in the fields of machine translation, machine-assisted translation, translation knowledge extraction and cross-language information retrieval. In this paper, the automatic alignment of Chinese-Indonesian parallel corpus and the automatic extraction of comparable corpus are proposed. Firstly, a paragraph alignment method based on the combination of anchor point and dictionary is proposed. On this basis, the length alignment model based on confidence interval is used to achieve sentence alignment. At the same time, in order to quickly improve the construction efficiency of the Chinese-Indonesian parallel corpus, a comparable corpus extraction method based on the similarity of cross-language documents is proposed. The experimental results show that the accuracy of parallel corpus alignment method and comparable corpus extraction method is significantly higher than that of traditional methods, which indicates that the proposed method is effective and feasible.

Key words: parallel corpus, corpus construction, comparable corpus, paragraph alignment, sentence alignment

CLC Number: 

  • TP391.1
[1] 林政, 吕雅娟, 刘群, 等.Web平行语料挖掘及其在机器翻译中的应用[J]. 中文信息学报, 2010, 24(5):85-91.
[2] 郭华伟, 张帆, 杨小敏, 等.英汉平行语料库在跨语言信息检索中的应用分析[J]. 医学信息学杂志, 2012, 33(3):39-43.
[3] CHEN J, NIE J. Automatic construction of parallel english-chinese corpus for cross-language information retrieval[C]//Proceedings of the 6th Applied Natural Language Processing Conference. Seattle, WA:Applied Natural Language Processing Conference, 2000:21-28.
[4] PHILIP R, SMITH N A. The Web as a parallel corpus[J]. Computational Linguistics, 2003, 29(3):349-380.
[5] ZHANG Y, WU K, GAO J, et al. Automatic acquisition of Chinese–English parallel corpus from the Web[C]//European Conference on Information Retrieval, Berlin. Heidelberg:Springer, 2006: 420-431.
[6] MOORE R C. Fast and accurate sentence alignment of bilingual corpora[J]. Lecture Notes in Computer Science, 2002, 2499:135-144.
[7] VARGA D, HALÁCSY P, KORNAI A, et al. Parallel corpora for medium density languages[J]. Amsterdam Studies in the Theory and History of Linguistic Science Series 4, 2007, 292: 247.
[8] MA X Y. Champollion: a robust parallel text sentence aligner[C]//Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC). [S.l.]:LREC, 2006: 489-492.
[9] 向露. 基于网络的翻译知识自动获取方法研究与实现[D]. 北京:中国科学院大学, 2014.
[10] BROWN P F, LAI J C, MERCER R L. Aligning sentences in parallel corpora[C]//Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics(ACL). Stroudsburg PA:ACL, 1991:169-176.
[11] GALE W A, CHURCH K W. A program for aligning sentences in bilingual corpora[J]. Computational Linguistics, 1993, 19(1): 75-102.
[1] WANG Jian, ZHENG Qifan, LI Chao, SHI Jing. Remote Supervision Relationship Extraction Based on Encoder and Attention Mechanism [J]. Journal of Guangxi Normal University(Natural Science Edition), 2019, 37(4): 53-60.
[2] SONG Jun, HAN Xiao-yu, HUANG Yu, HUANG Ting-lei, FU Kun. A Method for Entity-Oriented Timeline Summarization [J]. Journal of Guangxi Normal University(Natural Science Edition), 2015, 33(2): 36-41.
[3] ZHANG Fen, QU Wei-guang, ZHAO Hong-yan, ZHOU Jun-sheng. Shallow Parsing Based on CRF and Transformation-basedError-driven Learning [J]. Journal of Guangxi Normal University(Natural Science Edition), 2011, 29(3): 147-150.
[4] ZHUO Guang-ping, SUN Jing-yu, LI Xian-hua, YU Xue-li. Personalized Recommendation Algorithm Based on CBR [J]. Journal of Guangxi Normal University(Natural Science Edition), 2011, 29(3): 151-156.
[5] LIU Jinlong,GUO Yan, YU Zhihua, LIU Yue,YU Xiaoming,CHENGXueqi. A New Method to Detect Busty Events with Different Media Data Based on Word Clustering [J]. Journal of Guangxi Normal University(Natural Science Edition), 2019, 37(1): 23-31.
[6] CHENG Xian-yi, PAN Yan, ZHU Qian, SUN Ping. Automatic Generating Algorithm of Event-oriented Multi-documentSummarization [J]. Journal of Guangxi Normal University(Natural Science Edition), 2011, 29(1): 147-150.
[7] YANG Liang, PAN Feng-ming, LIN Hong-fei. Chunk-based Opinion Object Extraction and Application in OpinionAnalysis [J]. Journal of Guangxi Normal University(Natural Science Edition), 2011, 29(1): 151-156.
[8] ZHOU Xin, HAO Zhi-feng, CAI Rui-chu, WEN Wen. Text Clustering with Noise and It's Application in Anti-spam Systems [J]. Journal of Guangxi Normal University(Natural Science Edition), 2011, 29(2): 156-160.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!