|
广西师范大学学报(自然科学版) ›› 2022, Vol. 40 ›› Issue (3): 1-12.doi: 10.16088/j.issn.1001-6600.2021071302
• 综述 • 下一篇
杜锦丰, 王海荣*, 梁焕, 王栋
DU Jinfeng, WANG Hairong*, LIANG Huan, WANG Dong
摘要: 多模态数据的急剧增长带来了跨模态检索的应用需求,促进了对跨模态检索方法的研究。本文追溯该领域最新进展,跟踪并深入研究国内外基于表示学习的跨模态检索方法,对跨模态检索问题进行定义并梳理该领域常用技术方法、主流模型、常用数据集、评价方法和面临的主要挑战。主要从统计相关分析、图正则化和度量学习3方面介绍基于表示学习跨模态检索方法,并分析其优缺点。为了分析上述方法的优劣性,实验分别在4个数据集上复现14种方法进行对比评价。实验结果表明:基于统计相关分析方法训练效率较高且易于实施;基于图正则化方法通过挖掘模态内和模态间的相似性,实现语义关联;基于度量学习方法是在公共子空间中尽可能保留数据语义相似/不相似的信息。本文介绍基于表示学习的跨模态检索方法的研究现状,为跨模态检索方法研究提供参考。
中图分类号:
[1]KAUR P, PANNU H S, MALHI A K. Comparative analysis on cross-modal information retrieval: a review[J]. Computer Science Review, 2021, 39: 100336. [2]王振, 孙福振, 张龙波, 等. 强序列关系保持二值编码[J]. 计算机应用研究, 2020, 37(12): 3803-3806, 3810. DOI: 10.19734/j.issn.1001-3695.2019.07.0263. [3]陈宁, 段友祥, 孙歧峰. 跨模态检索研究文献综述[J]. 计算机科学与探索, 2021, 15(8): 1390-1404. [4]ZENG D H, YU Y, OYAMA K. Audio-visual embedding for cross-modal music video retrieval through supervised deep CCA[C]// Proceedings of 2018 IEEE International Symposium on Multimedia(ISM). Piscataway: IEEE, 2018: 143-150. DOI: 10.1109/ISM.2018.00-21. [5]YAO T, MEI T, NGO C W. Learning query and image similarities with ranking canonical correlation analysis[C]// Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 28-36. [6]相子喜, 吕学强, 张凯. 基于有向图模型的多模态新闻图像检索研究[J].科学技术与工程, 2016, 16(3): 78-84, 99. [7]CHENG Q R, GU X D. Bridging multimedia heterogeneity gap via graph representation learning for cross-modal retrieval[J]. Neural Networks, 2021, 134: 143-162. DOI: 10.1016/j.neunet.2020.11.011. [8]WANG L, ZHU L, DONG X, et al. Joint feature selection and graph regularization for modality-dependent cross-modal retrieval[J]. Journal of Visual Communication and Image Representation, 2018, 54: 213-222. DOI: 10.1016/j.jvcir.2018.05.006. [9]WU Y L, WANG S H, SONG G L, et al. Online asymmetric metric learning with multi-layer similarity aggregation for cross-modal retrieval[J]. IEEE Transactions on Image Processing, 2019, 28(9): 4299-4312. [10]FENG F X, WANG X J, LI R F. Cross-modal retrieval with correspondence autoencoder[C]// Proceedings of the 22nd ACM international conference on Multimedia.New York:Association for Computing Machinery, 2014: 7-16. DOI: 10.1145/2647868.2654902. [11]JIAN Y W, XIAO J, CAO Y, et al. Deep pairwise ranking with multi-label information for cross-modal retrieval[C]// Proceedings of 2019 IEEE International Conference on Multimedia and Expo(ICME).Piscataway: IEEE, 2019: 1810-1815. DOI: 10.1109/ICME.2019.00311. [12]WANG Y F, WU F, SONG J, et al. Multi-modal mutual topic reinforce modeling for cross-media retrieval[C]// Proceedings of the 22nd ACM International Conference on Multimedia. New York: Association for Computing Machinery, 2014: 307-316. DOI: 10.1145/2647868.2654901. [13]ZHENG Y. Methodologies for cross-domain data fusion: an overview[J]. IEEE Transactions on Big Data, 2015, 1(1): 16-34. DOI: 10.1109/TBDATA.2015.2465959. [14]李超越. 基于特征融合的跨模态检索方法研究与应用[D]. 北京:北京化工大学, 2020. DOI:10.26939/d.cnki.gbhgu.2020.000809. [15]路凯峰, 杨溢龙, 李智. 一种基于BERT和DPCNN的Web服务分类方法[J]. 广西师范大学学报(自然科学版),2021,39(6):87-98. DOI:10.16088/j.issn.1001-6600.2020111402. [16]WANG L M, GUO S, HUANG W L, et al. Places205-VGGNet models for scene recognition[EB/OL].(2015-08-07)[2021-07-13].https://arciv.org/abs/1508.01667. [17]HOTELLING H. Relations between two sets of variates[M]// KOTZ S, JOHNSON N L. Breakthroughs in Statistics. New York: Springer, 1992: 162-190. [18]RANJAN V, RASIWASIA N, JAWAHAR C V. Multi-label cross-modal retrieval[C]// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 4094-4102. DOI:10.1109/ICCV.2015.466. [19]AKAHO S. A kernel method for canonical correlation analysis[EB/OL].(2006-09-13)[2021-07-13].https://arxiv.org/abs/cs/0609071v1. [20]YAN F, MIKOLAJCZYK K. Deep correlation for matching images and text[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 3441-3450. DOI: 10.1109/CVPR.2015.7298966. [21]ZENG D H, YU Y, OYAMA K. Deep triplet neural networks with cluster-CCA for audio-visual cross-modal retrieval[J]. ACM Transactions on Multimedia Computing, Communications, and Applications(TOMM), 2020, 16(3): 1-23. DOI: 10.1145/3387164. [22]QI Y D, ZHANG H X. Joint graph regularization in a homogeneous subspace for cross-media retrieval[J]. Journal of Advanced Computational Intelligence and Intelligent Informatics, 2019, 23(5): 939-946. DOI: 10.20965/jaciii.2019.p0939. [23]WANG G H, JI H, KONG D X, et al. Modality-dependent cross-modal retrieval based on graph regularization[J]. Mobile Information Systems, 2020, 2020:4164692. DOI: 10.1155/2020/4164692. [24]XU G W, LI X M, ZHANG Z J. Semantic consistency cross-modal retrieval with semi-supervised graph regularization[J]. IEEE Access, 2020, 8: 14278-14288. DOI: 10.1109/ACCESS.2020.2966220. [25]YAN J H, ZHANG H X, SUN J D, et al. Joint graph regularization based modality-dependent cross-media retrieval[J]. Multimedia Tools and Applications, 2018, 77(3): 3009-3027. DOI: 10.1007/s11042-017-4918-0. [26]WEI J W, XU X, YANG Y, et al. Universal weighting metric learning for cross-modal matching[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 13002-13011. DOI: 10.1109/CVPR4 2600.2020.01302. [27]WU W, XU J, LI H. Learning similarity function between objects in heterogeneous spaces: MSR-TR-2010-86[R]. Beijing: Microsoft Research Asia, 2010. [28]徐信芯, 姜鑫, 张辉, 等. 基于多层联合降噪的信号处理方法[J]. 科学技术与工程, 2021, 21(29): 12566-12573. [29]REN L, LI K, WANG L Q, et al. Beyond the deep metric learning: enhance the cross-modal matching with adversarial discriminative domain regularization[C]// 2020 25th International Conference on Pattern Recognition(ICPR). Piscataway: IEEE, 2021: 10165-10172. DOI: 10.1109/ICPR48806.2021.9412297. [30]XU X, HE L, LU H M, et al. Deep adversarial metric learning for cross-modal retrieval[J]. World Wide Web, 2019, 22(2): 657-672. DOI: 10.1007/s11280-018-0541-x. [31]CHUA T S, TANG J H, HONG R C, et al. NUS-WIDE: a real-world web image database from National University of Singapore[C]// Proceedings of the ACM International Conference on Image and Video Retrieval. New York: Association for Computing Machinery, 2009: 1-9. DOI: 10.1145/1646396.1646452. [32]RASIWASIA N, COSTA PEREIRA J, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval[C]// Proceedings of the 18th ACM International Conference on Multimedia. New York: Association for Computing Machinery, 2010: 251-260. DOI: 10.1145/1873951.1873987. [33]FARHADI A, HEJRATI M, SADEGHI M A, et al. Every picture tells a story: generating sentences from images[C]// Proceedings of European Conference on Computer Vision. Berlin: Springer, 2010: 15-29. DOI: 10.1007/978-3-642-15561-1_2. [34]PENG Y X, ZHAI X H, ZHAO Y Z, et al. Semi-supervised cross-media feature learning with unified patch graph regularization[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2016, 26(3): 583-596. DOI: 10.1109/TCSVT.2015.2400779. [35]YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78. DOI: 10.116 2/tacl_a_00166. [36]VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 652-663. DOI: 10.1109/TPAMI.2016.2587640. [37]李志欣, 凌锋, 张灿龙, 等. 融合两级相似度的跨媒体图像文本检索[J]. 电子学报, 2021, 49(2): 268-274. [38]刘颖, 郭莹莹, 房杰, 等. 深度学习跨模态图文检索研究综述[J]. 计算机科学与探索,2022,16(3):489-511. [39]KARPATHY A, LI F F. Deep visual-semantic alignments for genrating image descriptions[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 3128-3137. DOI: 10.1109/TPAMI.2016.2598339. [40]HUANG Y, WU Q, SONG C F, et al. Learning semantic concepts and order for image and sentence matching[C]// Proceedings of the 2018 IEEE CVF Conference on Computer Vision and Pattern Recognition.Piscataway: IEEE, 2018: 6163-6171. DOI: 10.1109/CVPR.2018.00645. [41]LIU Y, GUO Y M, BAKKERr E M, et al. Learning a recurrent residual fusion network for multimodal matching[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 4127-4136. DOI: 10.1109/ICCV.2017.442. [42]LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching[C]// Proceedings of the European Conference on Computer Vision(ECCV). Cham: Springer, 2018: 212-228. [43]FAGHRI F, FLEET D J, KIROS J R, et al. VSE++: improving visual-semantic embeddings with hard negatives[EB/OL].(2018-07-29)[2021-07-13].https://arxiv.org/abs/1707.05612. [44]ZHENG Z D, ZHENG L, GARRETT M, et al. Dual-path convolutional image-text embeddings with instance loss[J]. ACM Transactions on Multimedia Computing, Communications, and Applications(TOMM), 2020, 16(2): 1-23. DOI: 10.1145/3383184. [45]QI J W, PENG Y X, YUAN Y X. Cross-media multi-level alignment with relation attention network[EB/OL].(2018-04-25)[2021-07-13].https://arxiv.org/abs/1804.09539v1. [46]MA L, JIANG W H, JIE Z Q, et al. Bidirectional image-sentence retrieval by local and global deep matching[J]. Neurocomputing, 2019, 345: 36-44. DOI: 10.1016/j.neucom.2018.11.089. [47]MITHUN N C, PANDA R, PAPALEXAKIS E E, et al. Webly supervised joint embedding for cross-modal image-text retrieval[C]// Proceedings of the 26th ACM international conference on Multimedia. New York: Association for Computing Machinery, 2018: 1856-1864. DOI: 10.1145/3240508.3240712. [48]谢金峰, 王羽, 葛唯益, 等. 基于多语义相似性的关系检测方法[J]. 西北工业大学学报, 2021, 39(6): 1387-1394. [49]史占堂, 马玉鹏, 赵凡, 等. 基于CNN-Head Transformer编码器的中文实体识别[J/OL]. 计算机工程[2021-12-20]. https://doi.org/10.19678/j.issn.1000-3428.0062525. [50]ZHAI X H, PENG Y X, XIAO J G. Learning cross-media joint representation with sparse and semisupervised regularization[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2014, 24(6): 965-978. [51]PENG Y X, QI J W. Quintuple-media joint correlation learning with deep compression and regularization[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(8): 2709-2722. [52]WEI Y C, ZHAO Y, LU C Y, et al. Cross-modal retrieval with CNN visual features: a new baseline[J]. IEEE Transactions on Cybernetics, 2017, 47(2): 449-460. [53]AUER S, KOVTUN V, PRINZ M, et al. Towards a knowledge graph for science[C]// Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics. New York: Association for Computing Machinery, 2018: 1-6. DOI: 10.1145/3227609.3227689. [54]WANG Z C, LV Q S, LAN X H, et al. Cross-lingual knowledge graph alignment via graph convolutional networks[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 349-357. [55]LIU Z H, XIONG C Y, SUN M S, et al. Entity-duet neural ranking: understanding the role of knowledge graph semantics in neural information retrieval[EB/OL].(2018-06-03)[2021-07-28].https://arxiv.org/abs/1805.07591. |
[1] | 孔亚钰, 卢玉洁, 孙中天, 肖敬先, 侯昊辰, 陈廷伟. 面向强化当前兴趣的图神经网络推荐算法研究[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 151-160. |
[2] | 杨州, 范意兴, 朱小飞, 郭嘉丰, 王越. 神经信息检索模型建模因素综述[J]. 广西师范大学学报(自然科学版), 2021, 39(2): 1-12. |
[3] | 李双群, 徐久成, 张灵均, 李晓艳. 基于相容粒的彩色图像检索算法[J]. 广西师范大学学报(自然科学版), 2011, 29(3): 173-178. |
[4] | 李丽娜, 余正涛, 王亚盛, 毛存礼, 郭剑毅. 中文专家实体主页识别方法研究[J]. 广西师范大学学报(自然科学版), 2011, 29(1): 157-161. |
[5] | 崔林卫, 苏伟, 郭卫, 李廉. 基于Nutch的Web数学公式提取[J]. 广西师范大学学报(自然科学版), 2011, 29(1): 167-172. |
[6] | 唐楠, 杨志豪, 吴佳金, 王艳华, 林鸿飞. 基于监督学习的蛋白质络合物抽取方法[J]. 广西师范大学学报(自然科学版), 2011, 29(2): 174-179. |
[7] | 罗辛, 潘乔, 王洪亚, 陈美, 北研二. 基于SOFM的高速图像检索算法实现[J]. 广西师范大学学报(自然科学版), 2011, 29(2): 180-184. |
[8] | 夏天. 基于扩展标记树的网页正文抽取[J]. 广西师范大学学报(自然科学版), 2011, 29(1): 133-137. |
|
版权所有 © 广西师范大学学报(自然科学版)编辑部 地址:广西桂林市三里店育才路15号 邮编:541004 电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn 本系统由北京玛格泰克科技发展有限公司设计开发 |