广西师范大学学报(自然科学版) ›› 2022, Vol. 40 ›› Issue (6): 69-81.doi: 10.16088/j.issn.1001-6600.2022022201

• 研究论文 • 上一篇    下一篇

基于多任务学习的跨语言信息检索方法研究

代佳洋, 周栋*   

  1. 湖南科技大学计算机科学与工程学院,湖南湘潭411201
  • 收稿日期:2022-02-22 修回日期:2022-03-07 出版日期:2022-11-25 发布日期:2023-01-17
  • 通讯作者: 周栋(1979—),男,湖南长沙人,湖南科技大学教授,博导。E-mail:dongzhou1979@hotmail.com
  • 基金资助:
    国家自然科学基金(61876062)

Research on Cross-Language Information Retrieval Method Based on Multi-task Learning

DAI Jiayang, ZHOU Dong*   

  1. School of Computer Science and Engineering, Hunan University of Science and Technology, Xiangtan Hunan 411201, China
  • Received:2022-02-22 Revised:2022-03-07 Online:2022-11-25 Published:2023-01-17

摘要: 跨语言信息检索是信息检索领域的重要任务之一。现有的跨语言神经检索方法通常使用单任务学习,单一的特征捕捉模式限制了神经检索模型的性能。为此,本文提出一种基于多任务学习的跨语言检索方法,利用文本分类任务作为辅助任务,使用共享文本特征提取层同时捕捉2个任务的特征信息,使其学习不同任务的特征模式,然后将特征向量分别输入到神经检索模型和文本分类模型中完成2个任务。另外,文本分类任务引入的外部语料也在一定程度上起到了数据增强的作用,进一步增加了特征信息的层次。在CLEF 2000-2003数据集的4个语言对上进行的实验表明,本方法明显改善了文本特征提取的效果,从而增强了神经检索模型性能,使神经检索模型的MAP值提高0.012~0.188,并使模型收敛速度平均提高了24.3%。

关键词: 信息检索, 多任务学习, 跨语言信息检索, 神经检索模型, 外部语料

Abstract: Cross-language information retrieval is one of the important tasks in the field of information retrieval. Existing cross-lingual neural retrieval methods usually use single-task learning, and the single feature capture model limits the performance of neural retrieval models. Therefore, a cross-language retrieval method based on multi-task learning is proposed, which uses a text classification task as a secondary task and captures feature information of both tasks simultaneously using a shared text feature extraction layer so that it learns the feature patterns of different tasks, and then feeds the feature vectors into the neural retrieval model and the text classification model to complete the two tasks, respectively. In addition, the external corpus introduced by the text classification task also plays a role in data augmentation to a certain extent, further increasing the level of feature information. Experiments conducted on four language pairs from the CLEF 2000-2003 dataset show that the present method significantly improves the text feature extraction and thus enhances the neural retrieval model performance, increasing the MAP values of the neural retrieval model by 0.012-0.188 and increased the speed of model convergence by an average of 24.3%.

Key words: information retrieval, multi-task learning, cross-language information retrieval, neural retrieval model, external corpus

中图分类号: 

  • TP391.3
[1] 周栋, 赵文玉, 伍璇, 等. 个性化跨语言信息检索中结果重排序研究[J].计算机工程与科学, 2017, 39(10): 1923-1929. DOI: 10.3969/j.issn.1007-130X.2017.10.022.
[2] 王灿辉, 张敏, 马少平. 自然语言处理在信息检索中的应用综述[J].中文信息学报, 2007, 21(2): 35-45. DOI: 10.3969/j.issn.1003-0077.2007.02.006.
[3] 苏祺, 昝红英, 胡景贺, 等. 词性标注对信息检索系统性能的影响[J].中文信息学报, 2005, 19(2): 58-65. DOI: 10.3969/j.issn.1003-0077.2005.02.009.
[4] PANG L, LAN YY, Guo J F, et al. Text matching as image recognition[J].Proceedings of the AAAI Conference on Artificial Intelligence, 2016, 30(1): 2793-2799. DOI: 10.1609/aaai.v30i1.10341.
[5] XIONG C Y, DAIZ Y, CALLAN J, et al. End-to-end neural ad-hoc ranking with kernel pooling[C]// Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval.New York, NY: ACM, 2017: 55-64. DOI: 10.1145/3077136.3080809.
[6] GUO J F, FANY X, PANGL, et al. A deep look into neural ranking models for information retrieval[J].Information Processing & Management, 2020, 57(6): 102067. DOI: 10.1016/j.ipm.2019.102067.
[7] YU P X, ALLAN J. A study of neural matching models for cross-lingual IR[C]// Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY: ACM, 2020: 1637-1640. DOI: 10.1145/3397271.3401322.
[8] BONAB H, SARWAR S M, ALLAN J. Training effective neural CLIR by bridging the translation gap[C]// Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY: ACM, 2020: 9-18. DOI: 10.1145/3397271.3401035.
[9] 彭晓娅, 周栋. 跨语言词向量研究综述[J].中文信息学报, 2020, 34(2): 1-15, 26. DOI: 10.3969/j.issn.1003-0077.2020.02.001.
[10] 李岩, 郭军军, 余正涛, 等.基于词映射构建伪查询改善低资源跨语言信息检索研究[J].山西大学学报(自然科学版), 2022, 45(2): 322-331. DOI: 10.13451/j.sxu.ns.2021106.
[11] 戚园园. 基于特征表示学习的文本检索研究[D].北京: 北京邮电大学, 2021. DOI: 10.26969/d.cnki.gbydu.2021.000110.
[12] ZHANG Y, YANG Q. An overview of multi-task learning[J]. National Science Review, 2018, 5(1): 30-43. DOI: 10.1093/nsr/nwx105.
[13] LIU X D, GAOJ F, HEX D, et al. Representation learning using multi-task deep neural networks for semantic classification and information retrieval[C]// Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: Association for Computational Linguistics, 2015: 912-921. DOI: 10.3115/v1/N15-1092.
[14] NIE J Y, SIMARD M, ISABELLE P, et al. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the web[C]// Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY: ACM, 1999: 74-81. DOI: 10.1145/312624.312656.
[15] ELAYEB B, ROMDHANE W B, SAOUD N B B. Towards a new possibilistic query translation tool for cross-language information retrieval[J]. Multimedia Tools and Applications, 2018, 77(2): 2423-2465. DOI: 10.1007/s11042-017-4398-2.
[16] 黄名选, 蒋曹清.基于项权值排序挖掘的跨语言查询扩展[J].电子学报, 2020, 48(3): 568-576. DOI: 10.3969/j.issn.0372-2112.2020.03.021.
[17] TURE F, LIN J. Flat vs. hierarchical phrase-based translation models for cross-language information retrieval[C]// Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY: ACM, 2013: 813-816. DOI: 10.1145/2484028.2484137.
[18] AZARBONYAD H, SHAKERY A, FAILI H. A learning to rank approach for cross-language information retrieval exploiting multiple translation resources[J]. Natural Language Engineering, 2019, 25(3): 363-384. DOI: 10.1017/S1351324919000032.
[19] 梁少博, 朱慧宁, 吴丹.基于公共数字文化资源命名实体识别与翻译的跨语言信息检索研究[J].图书馆建设, 2022(1): 87-95. DOI: 10.19764/j.cnki.tsgjs.20211994.
[20] CHANDRA G, DWIVEDI S K. Assessing query translation quality using back translation in Hindi-English CLIR[J]. International Journal of Intelligent Systems and Applications, 2017, 9(3): 51-59. DOI: 10.5815/ijisa.2017.03.07.
[21] 马路佳, 赖文, 赵小兵.基于跨语言词向量模型的蒙汉查询词扩展方法研究[J].中文信息学报, 2019, 33(6): 27-34. DOI: 10.3969/j.issn.1003-0077.2019.06.004.
[22] LITSCHKO R, GLAVAŠ G, PONZETTO S P, et al. Unsupervised cross-lingual information retrieval using monolingual data only[C]// The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY: ACM, 2018: 1253-1256. DOI: 10.1145/3209978.3210157.
[23] 邹小芳, 王明文, 左家莉, 等. 新的基于中间语义的多语言信息检索模型[J]. 小型微型计算机系统, 2010, 31(4): 696-701.
[24] VULIC′ I, DE SMET W, MOENS M F. Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora[J]. Information Retrieval, 2013, 16(3): 331-368. DOI: 10.1007/s10791-012-9200-5.
[25] HUO Z L, WU J F, LU Y, et al. A topic-based cross-language retrieval model with PLSA and TF-IDF[C]// 2018 IEEE 3rd International Conference on Big Data Analysis(ICBDA). Piscataway, NJ: IEEE, 2018: 340-344. DOI: 10.1109/ICBDA.2018.8367704.
[26] GLAVAŠ G, VULIC′ I. Zero-shot language transfer for cross-lingual sentence retrieval using bidirectional attention model[C]// Advances in Information Retrieval: LNCS Volume 11437. Cham: Springer, 2019: 523-538. DOI: 10.1007/978-3-030-15712-8_34.
[27] JIANG Z L, EL-JAROUDI A, HARTMANN W, et al. Cross-lingual information retrieval with BERT[C]// Proceedings of the Workshop on Cross-Language Search and Summarization of Text and Speech (CLSSTS2020).Paris: European Language Resources Association, 2020: 26-31.
[28] 曲琳琳.查询翻译方法研究: 以汉英跨语言信息检索为例[J].情报科学, 2021, 39(8): 132-138, 193. DOI: 10.13833/j.issn.1007-7634.2021.08.017.
[29] 叶雪,梁娟.基于平行语料库的英汉跨语言信息检索设计研究[J].电子设计工程,2021,29(17):135-138.DOI: 10.14022/j.issn1674-6236.2021.17.029.
[30] OARD D W, HE D Q, WANG J Q. User-assisted query translation for interactive cross-language information retrieval[J]. Information Processing & Management, 2008, 44(1): 181-211. DOI: 10.1016/j.ipm.2006.12.009.
[31] YANG Z C, YANG D Y, DYER C, et al. Hierarchical attention networks for document classification[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: Association for Computational Linguistics, 2016: 1480-1489. DOI: 10.18653/v1/N16-1174.
[1] 杨州, 范意兴, 朱小飞, 郭嘉丰, 王越. 神经信息检索模型建模因素综述[J]. 广西师范大学学报(自然科学版), 2021, 39(2): 1-12.
[2] 葛奕飞, 郑彦斌. 带有纠删或纠错性质的隐私保护信息检索方案[J]. 广西师范大学学报(自然科学版), 2020, 38(3): 33-44.
[3] 余传明, 李浩男, 安璐. 基于多任务深度学习的文本情感原因分析[J]. 广西师范大学学报(自然科学版), 2019, 37(1): 50-61.
[4] 林原, 刘海峰, 林鸿飞, 许侃. 基于损失函数融合的组排序学习方法[J]. 广西师范大学学报(自然科学版), 2019, 37(1): 62-70.
[5] 吕学强, 舒燕, 孙立华, 程涛. 搜索引擎日志中“V+N1+N2”型短语研究[J]. 广西师范大学学报(自然科学版), 2011, 29(1): 109-115.
[6] 房璐, 葛运东, 洪宇, 姚建民. 可比较语料库构建及在跨语言信息检索中的应用[J]. 广西师范大学学报(自然科学版), 2010, 28(3): 126-130.
[7] 李颖, 刘静波. 基于“结构化数字对象”的学术信息检索系统[J]. 广西师范大学学报(自然科学版), 2010, 28(1): 82-87.
[8] 罗文兵, 吴润秀, 王明文, 朱莹婷, 熊超. 基于结果聚类分析的个性化推荐模型[J]. 广西师范大学学报(自然科学版), 2010, 28(1): 113-116.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!
版权所有 © 广西师范大学学报(自然科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发