广西师范大学学报(自然科学版) ›› 2022, Vol. 40 ›› Issue (3): 1-12.doi: 10.16088/j.issn.1001-6600.2021071302

• 综述 •    下一篇

基于表示学习的跨模态检索方法研究进展

杜锦丰, 王海荣*, 梁焕, 王栋   

  1. 北方民族大学 计算机科学与工程学院, 宁夏 银川 750021
  • 收稿日期:2021-07-13 修回日期:2021-10-09 出版日期:2022-05-25 发布日期:2022-05-27
  • 通讯作者: 王海荣(1977—), 女, 宁夏石嘴山人, 北方民族大学副教授, 博士。E-mail: bmdwhr@163.com
  • 基金资助:
    宁夏自然科学基金(2020AAC03218); 宁夏省级培育项目(PY1906); 宁夏人才项目(KJT2019002)

Progress of Cross-modal Retrieval Methods Based on Representation Learning

DU Jinfeng, WANG Hairong*, LIANG Huan, WANG Dong   

  1. Department of Computer Science and Engineering, North Minzu University, Yinchuan Ningxia 750021, China
  • Received:2021-07-13 Revised:2021-10-09 Online:2022-05-25 Published:2022-05-27

摘要: 多模态数据的急剧增长带来了跨模态检索的应用需求,促进了对跨模态检索方法的研究。本文追溯该领域最新进展,跟踪并深入研究国内外基于表示学习的跨模态检索方法,对跨模态检索问题进行定义并梳理该领域常用技术方法、主流模型、常用数据集、评价方法和面临的主要挑战。主要从统计相关分析、图正则化和度量学习3方面介绍基于表示学习跨模态检索方法,并分析其优缺点。为了分析上述方法的优劣性,实验分别在4个数据集上复现14种方法进行对比评价。实验结果表明:基于统计相关分析方法训练效率较高且易于实施;基于图正则化方法通过挖掘模态内和模态间的相似性,实现语义关联;基于度量学习方法是在公共子空间中尽可能保留数据语义相似/不相似的信息。本文介绍基于表示学习的跨模态检索方法的研究现状,为跨模态检索方法研究提供参考。

关键词: 多模态数据, 跨模态检索, 统计相关分析, 图正则化, 度量学习

Abstract: With the rapid growth of multi-modal data, the application requirements of cross-modal retrieval are brought, and the research on cross-modal retrieval methods is proposed. This paper traces the latest progress in this field, tracks and deeply studies the cross-modal retrieval methods based on representation learning at home and abroad, defines the cross-modal retrieval problems, and combs the common technical methods, mainstream models, common data sets, evaluation methods and main challenges in this field. This paper mainly introduces the cross-modal retrieval method based on representation learning from three aspects:statistical correlation analysis, graph regularization and metric learning, and analyzes its advantages and disadvantages. In order to analyze the advantages and disadvantages of the above methods, 14 methods are reproduced on four data sets for comparative evaluation. The experimental results show that the training method based on statistical correlation analysis is efficient and easy to implement; Based on graph regularization method, semantic association is realized by mining the similarity between and within modes; The metric-based learning method is to preserve the semantically similar / dissimilar information of data in the common subspace as much as possible. To sum up, this paper introduces the research status of cross-modal retrieval methods based on representation learning, which provided a reference for the research of cross-modal retrieval methods.

Key words: multi-modal data, cross-modal retrieval, statistical correlation analysis, graph regularization, metric learning

中图分类号: 

  • TP391.3
[1]KAUR P, PANNU H S, MALHI A K. Comparative analysis on cross-modal information retrieval: a review[J]. Computer Science Review, 2021, 39: 100336.
[2]王振, 孙福振, 张龙波, 等. 强序列关系保持二值编码[J]. 计算机应用研究, 2020, 37(12): 3803-3806, 3810. DOI: 10.19734/j.issn.1001-3695.2019.07.0263.
[3]陈宁, 段友祥, 孙歧峰. 跨模态检索研究文献综述[J]. 计算机科学与探索, 2021, 15(8): 1390-1404.
[4]ZENG D H, YU Y, OYAMA K. Audio-visual embedding for cross-modal music video retrieval through supervised deep CCA[C]// Proceedings of 2018 IEEE International Symposium on Multimedia(ISM). Piscataway: IEEE, 2018: 143-150. DOI: 10.1109/ISM.2018.00-21.
[5]YAO T, MEI T, NGO C W. Learning query and image similarities with ranking canonical correlation analysis[C]// Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 28-36.
[6]相子喜, 吕学强, 张凯. 基于有向图模型的多模态新闻图像检索研究[J].科学技术与工程, 2016, 16(3): 78-84, 99.
[7]CHENG Q R, GU X D. Bridging multimedia heterogeneity gap via graph representation learning for cross-modal retrieval[J]. Neural Networks, 2021, 134: 143-162. DOI: 10.1016/j.neunet.2020.11.011.
[8]WANG L, ZHU L, DONG X, et al. Joint feature selection and graph regularization for modality-dependent cross-modal retrieval[J]. Journal of Visual Communication and Image Representation, 2018, 54: 213-222. DOI: 10.1016/j.jvcir.2018.05.006.
[9]WU Y L, WANG S H, SONG G L, et al. Online asymmetric metric learning with multi-layer similarity aggregation for cross-modal retrieval[J]. IEEE Transactions on Image Processing, 2019, 28(9): 4299-4312.
[10]FENG F X, WANG X J, LI R F. Cross-modal retrieval with correspondence autoencoder[C]// Proceedings of the 22nd ACM international conference on Multimedia.New York:Association for Computing Machinery, 2014: 7-16. DOI: 10.1145/2647868.2654902.
[11]JIAN Y W, XIAO J, CAO Y, et al. Deep pairwise ranking with multi-label information for cross-modal retrieval[C]// Proceedings of 2019 IEEE International Conference on Multimedia and Expo(ICME).Piscataway: IEEE, 2019: 1810-1815. DOI: 10.1109/ICME.2019.00311.
[12]WANG Y F, WU F, SONG J, et al. Multi-modal mutual topic reinforce modeling for cross-media retrieval[C]// Proceedings of the 22nd ACM International Conference on Multimedia. New York: Association for Computing Machinery, 2014: 307-316. DOI: 10.1145/2647868.2654901.
[13]ZHENG Y. Methodologies for cross-domain data fusion: an overview[J]. IEEE Transactions on Big Data, 2015, 1(1): 16-34. DOI: 10.1109/TBDATA.2015.2465959.
[14]李超越. 基于特征融合的跨模态检索方法研究与应用[D]. 北京:北京化工大学, 2020. DOI:10.26939/d.cnki.gbhgu.2020.000809.
[15]路凯峰, 杨溢龙, 李智. 一种基于BERT和DPCNN的Web服务分类方法[J]. 广西师范大学学报(自然科学版),2021,39(6):87-98. DOI:10.16088/j.issn.1001-6600.2020111402.
[16]WANG L M, GUO S, HUANG W L, et al. Places205-VGGNet models for scene recognition[EB/OL].(2015-08-07)[2021-07-13].https://arciv.org/abs/1508.01667.
[17]HOTELLING H. Relations between two sets of variates[M]// KOTZ S, JOHNSON N L. Breakthroughs in Statistics. New York: Springer, 1992: 162-190.
[18]RANJAN V, RASIWASIA N, JAWAHAR C V. Multi-label cross-modal retrieval[C]// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 4094-4102. DOI:10.1109/ICCV.2015.466.
[19]AKAHO S. A kernel method for canonical correlation analysis[EB/OL].(2006-09-13)[2021-07-13].https://arxiv.org/abs/cs/0609071v1.
[20]YAN F, MIKOLAJCZYK K. Deep correlation for matching images and text[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 3441-3450. DOI: 10.1109/CVPR.2015.7298966.
[21]ZENG D H, YU Y, OYAMA K. Deep triplet neural networks with cluster-CCA for audio-visual cross-modal retrieval[J]. ACM Transactions on Multimedia Computing, Communications, and Applications(TOMM), 2020, 16(3): 1-23. DOI: 10.1145/3387164.
[22]QI Y D, ZHANG H X. Joint graph regularization in a homogeneous subspace for cross-media retrieval[J]. Journal of Advanced Computational Intelligence and Intelligent Informatics, 2019, 23(5): 939-946. DOI: 10.20965/jaciii.2019.p0939.
[23]WANG G H, JI H, KONG D X, et al. Modality-dependent cross-modal retrieval based on graph regularization[J]. Mobile Information Systems, 2020, 2020:4164692. DOI: 10.1155/2020/4164692.
[24]XU G W, LI X M, ZHANG Z J. Semantic consistency cross-modal retrieval with semi-supervised graph regularization[J]. IEEE Access, 2020, 8: 14278-14288. DOI: 10.1109/ACCESS.2020.2966220.
[25]YAN J H, ZHANG H X, SUN J D, et al. Joint graph regularization based modality-dependent cross-media retrieval[J]. Multimedia Tools and Applications, 2018, 77(3): 3009-3027. DOI: 10.1007/s11042-017-4918-0.
[26]WEI J W, XU X, YANG Y, et al. Universal weighting metric learning for cross-modal matching[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 13002-13011. DOI: 10.1109/CVPR4 2600.2020.01302.
[27]WU W, XU J, LI H. Learning similarity function between objects in heterogeneous spaces: MSR-TR-2010-86[R]. Beijing: Microsoft Research Asia, 2010.
[28]徐信芯, 姜鑫, 张辉, 等. 基于多层联合降噪的信号处理方法[J]. 科学技术与工程, 2021, 21(29): 12566-12573.
[29]REN L, LI K, WANG L Q, et al. Beyond the deep metric learning: enhance the cross-modal matching with adversarial discriminative domain regularization[C]// 2020 25th International Conference on Pattern Recognition(ICPR). Piscataway: IEEE, 2021: 10165-10172. DOI: 10.1109/ICPR48806.2021.9412297.
[30]XU X, HE L, LU H M, et al. Deep adversarial metric learning for cross-modal retrieval[J]. World Wide Web, 2019, 22(2): 657-672. DOI: 10.1007/s11280-018-0541-x.
[31]CHUA T S, TANG J H, HONG R C, et al. NUS-WIDE: a real-world web image database from National University of Singapore[C]// Proceedings of the ACM International Conference on Image and Video Retrieval. New York: Association for Computing Machinery, 2009: 1-9. DOI: 10.1145/1646396.1646452.
[32]RASIWASIA N, COSTA PEREIRA J, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval[C]// Proceedings of the 18th ACM International Conference on Multimedia. New York: Association for Computing Machinery, 2010: 251-260. DOI: 10.1145/1873951.1873987.
[33]FARHADI A, HEJRATI M, SADEGHI M A, et al. Every picture tells a story: generating sentences from images[C]// Proceedings of European Conference on Computer Vision. Berlin: Springer, 2010: 15-29. DOI: 10.1007/978-3-642-15561-1_2.
[34]PENG Y X, ZHAI X H, ZHAO Y Z, et al. Semi-supervised cross-media feature learning with unified patch graph regularization[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2016, 26(3): 583-596. DOI: 10.1109/TCSVT.2015.2400779.
[35]YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78. DOI: 10.116 2/tacl_a_00166.
[36]VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 652-663. DOI: 10.1109/TPAMI.2016.2587640.
[37]李志欣, 凌锋, 张灿龙, 等. 融合两级相似度的跨媒体图像文本检索[J]. 电子学报, 2021, 49(2): 268-274.
[38]刘颖, 郭莹莹, 房杰, 等. 深度学习跨模态图文检索研究综述[J]. 计算机科学与探索,2022,16(3):489-511.
[39]KARPATHY A, LI F F. Deep visual-semantic alignments for genrating image descriptions[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 3128-3137. DOI: 10.1109/TPAMI.2016.2598339.
[40]HUANG Y, WU Q, SONG C F, et al. Learning semantic concepts and order for image and sentence matching[C]// Proceedings of the 2018 IEEE CVF Conference on Computer Vision and Pattern Recognition.Piscataway: IEEE, 2018: 6163-6171. DOI: 10.1109/CVPR.2018.00645.
[41]LIU Y, GUO Y M, BAKKERr E M, et al. Learning a recurrent residual fusion network for multimodal matching[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 4127-4136. DOI: 10.1109/ICCV.2017.442.
[42]LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching[C]// Proceedings of the European Conference on Computer Vision(ECCV). Cham: Springer, 2018: 212-228.
[43]FAGHRI F, FLEET D J, KIROS J R, et al. VSE++: improving visual-semantic embeddings with hard negatives[EB/OL].(2018-07-29)[2021-07-13].https://arxiv.org/abs/1707.05612.
[44]ZHENG Z D, ZHENG L, GARRETT M, et al. Dual-path convolutional image-text embeddings with instance loss[J]. ACM Transactions on Multimedia Computing, Communications, and Applications(TOMM), 2020, 16(2): 1-23. DOI: 10.1145/3383184.
[45]QI J W, PENG Y X, YUAN Y X. Cross-media multi-level alignment with relation attention network[EB/OL].(2018-04-25)[2021-07-13].https://arxiv.org/abs/1804.09539v1.
[46]MA L, JIANG W H, JIE Z Q, et al. Bidirectional image-sentence retrieval by local and global deep matching[J]. Neurocomputing, 2019, 345: 36-44. DOI: 10.1016/j.neucom.2018.11.089.
[47]MITHUN N C, PANDA R, PAPALEXAKIS E E, et al. Webly supervised joint embedding for cross-modal image-text retrieval[C]// Proceedings of the 26th ACM international conference on Multimedia. New York: Association for Computing Machinery, 2018: 1856-1864. DOI: 10.1145/3240508.3240712.
[48]谢金峰, 王羽, 葛唯益, 等. 基于多语义相似性的关系检测方法[J]. 西北工业大学学报, 2021, 39(6): 1387-1394.
[49]史占堂, 马玉鹏, 赵凡, 等. 基于CNN-Head Transformer编码器的中文实体识别[J/OL]. 计算机工程[2021-12-20]. https://doi.org/10.19678/j.issn.1000-3428.0062525.
[50]ZHAI X H, PENG Y X, XIAO J G. Learning cross-media joint representation with sparse and semisupervised regularization[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2014, 24(6): 965-978.
[51]PENG Y X, QI J W. Quintuple-media joint correlation learning with deep compression and regularization[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(8): 2709-2722.
[52]WEI Y C, ZHAO Y, LU C Y, et al. Cross-modal retrieval with CNN visual features: a new baseline[J]. IEEE Transactions on Cybernetics, 2017, 47(2): 449-460.
[53]AUER S, KOVTUN V, PRINZ M, et al. Towards a knowledge graph for science[C]// Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics. New York: Association for Computing Machinery, 2018: 1-6. DOI: 10.1145/3227609.3227689.
[54]WANG Z C, LV Q S, LAN X H, et al. Cross-lingual knowledge graph alignment via graph convolutional networks[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 349-357.
[55]LIU Z H, XIONG C Y, SUN M S, et al. Entity-duet neural ranking: understanding the role of knowledge graph semantics in neural information retrieval[EB/OL].(2018-06-03)[2021-07-28].https://arxiv.org/abs/1805.07591.
[1] 孔亚钰, 卢玉洁, 孙中天, 肖敬先, 侯昊辰, 陈廷伟. 面向强化当前兴趣的图神经网络推荐算法研究[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 151-160.
[2] 杨州, 范意兴, 朱小飞, 郭嘉丰, 王越. 神经信息检索模型建模因素综述[J]. 广西师范大学学报(自然科学版), 2021, 39(2): 1-12.
[3] 李双群, 徐久成, 张灵均, 李晓艳. 基于相容粒的彩色图像检索算法[J]. 广西师范大学学报(自然科学版), 2011, 29(3): 173-178.
[4] 李丽娜, 余正涛, 王亚盛, 毛存礼, 郭剑毅. 中文专家实体主页识别方法研究[J]. 广西师范大学学报(自然科学版), 2011, 29(1): 157-161.
[5] 崔林卫, 苏伟, 郭卫, 李廉. 基于Nutch的Web数学公式提取[J]. 广西师范大学学报(自然科学版), 2011, 29(1): 167-172.
[6] 唐楠, 杨志豪, 吴佳金, 王艳华, 林鸿飞. 基于监督学习的蛋白质络合物抽取方法[J]. 广西师范大学学报(自然科学版), 2011, 29(2): 174-179.
[7] 罗辛, 潘乔, 王洪亚, 陈美, 北研二. 基于SOFM的高速图像检索算法实现[J]. 广西师范大学学报(自然科学版), 2011, 29(2): 180-184.
[8] 夏天. 基于扩展标记树的网页正文抽取[J]. 广西师范大学学报(自然科学版), 2011, 29(1): 133-137.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 钟祥贵, 孙悦, 吴湘华. 几乎CAP*-子群与有限群的p-超可解性[J]. 广西师范大学学报(自然科学版), 2019, 37(4): 74 -78 .
[2] 王涵, 王绪安, 周能, 柳玉东. 基于区块链的可审计数据分享方案[J]. 广西师范大学学报(自然科学版), 2020, 38(2): 1 -7 .
[3] 王俊峰, 李平. 爆发性逾渗模型的Shortest-path指数和Backbone指数[J]. 广西师范大学学报(自然科学版), 2020, 38(2): 81 -86 .
[4] 艾艳, 贾楠, 王媛, 郭静, 潘东东. 多性状多位点遗传关联分析的统计方法研究及其应用进展[J]. 广西师范大学学报(自然科学版), 2022, 40(1): 1 -14 .
[5] 白德发, 徐欣, 王国长. 函数型数据广义线性模型和分类问题综述[J]. 广西师范大学学报(自然科学版), 2022, 40(1): 15 -29 .
[6] 曾庆樊, 秦永松, 黎玉芳. 一类空间面板数据模型的经验似然推断[J]. 广西师范大学学报(自然科学版), 2022, 40(1): 30 -42 .
[7] 张治飞, 段谦, 刘乃嘉, 黄磊. 基于Jackknife互信息的高维非线性回归模型研究[J]. 广西师范大学学报(自然科学版), 2022, 40(1): 43 -56 .
[8] 杨迪, 方扬鑫, 周彦. 基于MEB和SVM方法的新类别分类研究[J]. 广西师范大学学报(自然科学版), 2022, 40(1): 57 -67 .
[9] 陈钟秀, 张兴发, 熊强, 宋泽芳. 非对称DAR模型的估计与检验[J]. 广西师范大学学报(自然科学版), 2022, 40(1): 68 -81 .
[10] 李慕航, 韩萌, 陈志强, 武红鑫, 张喜龙. 面向复杂高效用模式的挖掘算法综述[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 13 -30 .
版权所有 © 广西师范大学学报(自然科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发