可比较语料库构建及在跨语言信息检索中的应用

广西师范大学学报（自然科学版） ›› 2010, Vol. 28 ›› Issue (3): 126-130.

可比较语料库构建及在跨语言信息检索中的应用

房璐¹, 葛运东¹, 洪宇¹, 姚建民^1,2

1.苏州大学计算机科学与技术学院,江苏苏州215006;
2.苏州市科技局,江苏苏州215006

收稿日期:2010-06-05 出版日期:2010-09-20 发布日期:2023-02-06
通讯作者: 姚建民(1971—),男,河北乐亭人,苏州大学副教授,博士。E-mail:jyao@suda.edu.cn
基金资助:
国家自然科学基金资助项目(60970057)

Acquisition of Comparable and Its Application in CLIR

FANG Lu¹, GE Yun-dong¹, HONG Yu¹, YAO Jian-ming^1,2

1. School of Computer Science and Technology,Soochow University,Suzhou Jiangsu 215006,China;
2. Office of Science and Technology of Suzhou,Suzhou Jiangsu 215006,China

Received:2010-06-05 Online:2010-09-20 Published:2023-02-06

摘要/Abstract

摘要： 本文主要研究基于网络的可比较语料库的构建以及其在跨语言信息检索(CLIR)中的应用。首先从新闻网站获取新闻文本,用Lucene进行对齐,构建可比较语料库;其次,利用上下文信息从对齐的文本中抽取翻译知识;最后,用获得的翻译知识在TDT4语料上进行CLIR性能实验。实验表明,所抽取的翻译知识可以提高CLIR的性能,取得了0.272 8的MAP值,相对于基于本地词典的方法提高了35.44个百分点。

关键词: 可比较语料库, 翻译知识抽取, 上下文向量, 跨语言信息检索, 查询翻译

Abstract: This paper studies the acquisition of comparable corpora and its application in cross-language information retrieval (CLIR).First,download news articles from news sites,and align them with Lucene,and acquire comparable corpora.Then translation knowledge is extracted from the aligned articles.At last,apply the translation knowledge on TDT4 to test the performance of CLIR system.Theexperiments show that the translation knowledge could improve the performance of CLIR,achieve the MAP value of 0.272 8,35.44 percentage points higher than the method based on dictionary.

Key words: comparable corpora, translation knowledge extraction, context vector, cross-language information retrieval, query translation

中图分类号:

TP391

房璐, 葛运东, 洪宇, 姚建民. 可比较语料库构建及在跨语言信息检索中的应用[J]. 广西师范大学学报（自然科学版）, 2010, 28(3): 126-130.

FANG Lu, GE Yun-dong, HONG Yu, YAO Jian-ming. Acquisition of Comparable and Its Application in CLIR[J]. Journal of Guangxi Normal University(Natural Science Edition), 2010, 28(3): 126-130.

参考文献

[1] TAO Tao,ZHAI Cheng-xiang.Mining comparable bilingual text corporafor cross-language information integration[C]//Proceedings of the 8th ACM SIGKDD International Conference on Knowledge and Data Mining.New York:ACM Press,2005:691-696.
[2] VU T,AW A T,ZHANG Min.Feature-based method for document alignmentin comparable news corpora[C]//Proceeding s of the 12th Conference of the European Chapter of the ACL.Morristown,NJ:ACL,2009:843-851.
[3] TUOMAS T,ARI P,KALERVO J,et al.Focused web crawling in the acquisition of comparable corpora[J].Information Retrieval,2008,11(5):427-445.
[4] RAPP R.Identifying word translations in non-parallel texts[C]//Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics.Morristown,NJ:ACL,1995:320-322.
[5] FUNG P.A statistical view on bilingual lexicon extraction:from parallel corpora to non-parallel corpora[C]//Machine Translation and the Information Soup;LNCS Vol 1529.Berlin:Springer-Verlag,1998:1-17.
[6] TALVENSAARI T.Effects of aligned corpus quality and size in corpus-based CLIR[C]//Proceedings of the IR Research,30th European Conference on Advances in Information Retrieval.Berlin:Springer-Verlag,2008:114-125.
[7] CHENG Pu-jen,TENG Jei-wen,CHEN Ruei-cheng,et al.Translating unknown queries with web corpora for cross-language information retrieval[C]//Proceeding of 27th AnnualInternational ACM SIGIR Conference on Research and Development in Information Retrieval.New York:ACM,2004:146-153.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed