Journal of Guangxi Normal University(Natural Science Edition) ›› 2022, Vol. 40 ›› Issue (3): 1-12.doi: 10.16088/j.issn.1001-6600.2021071302

    Next Articles

Progress of Cross-modal Retrieval Methods Based on Representation Learning

DU Jinfeng, WANG Hairong*, LIANG Huan, WANG Dong   

  1. Department of Computer Science and Engineering, North Minzu University, Yinchuan Ningxia 750021, China
  • Received:2021-07-13 Revised:2021-10-09 Online:2022-05-25 Published:2022-05-27

Abstract: With the rapid growth of multi-modal data, the application requirements of cross-modal retrieval are brought, and the research on cross-modal retrieval methods is proposed. This paper traces the latest progress in this field, tracks and deeply studies the cross-modal retrieval methods based on representation learning at home and abroad, defines the cross-modal retrieval problems, and combs the common technical methods, mainstream models, common data sets, evaluation methods and main challenges in this field. This paper mainly introduces the cross-modal retrieval method based on representation learning from three aspects:statistical correlation analysis, graph regularization and metric learning, and analyzes its advantages and disadvantages. In order to analyze the advantages and disadvantages of the above methods, 14 methods are reproduced on four data sets for comparative evaluation. The experimental results show that the training method based on statistical correlation analysis is efficient and easy to implement; Based on graph regularization method, semantic association is realized by mining the similarity between and within modes; The metric-based learning method is to preserve the semantically similar / dissimilar information of data in the common subspace as much as possible. To sum up, this paper introduces the research status of cross-modal retrieval methods based on representation learning, which provided a reference for the research of cross-modal retrieval methods.

Key words: multi-modal data, cross-modal retrieval, statistical correlation analysis, graph regularization, metric learning

CLC Number: 

  • TP391.3
[1]KAUR P, PANNU H S, MALHI A K. Comparative analysis on cross-modal information retrieval: a review[J]. Computer Science Review, 2021, 39: 100336.
[2]王振, 孙福振, 张龙波, 等. 强序列关系保持二值编码[J]. 计算机应用研究, 2020, 37(12): 3803-3806, 3810. DOI: 10.19734/j.issn.1001-3695.2019.07.0263.
[3]陈宁, 段友祥, 孙歧峰. 跨模态检索研究文献综述[J]. 计算机科学与探索, 2021, 15(8): 1390-1404.
[4]ZENG D H, YU Y, OYAMA K. Audio-visual embedding for cross-modal music video retrieval through supervised deep CCA[C]// Proceedings of 2018 IEEE International Symposium on Multimedia(ISM). Piscataway: IEEE, 2018: 143-150. DOI: 10.1109/ISM.2018.00-21.
[5]YAO T, MEI T, NGO C W. Learning query and image similarities with ranking canonical correlation analysis[C]// Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 28-36.
[6]相子喜, 吕学强, 张凯. 基于有向图模型的多模态新闻图像检索研究[J].科学技术与工程, 2016, 16(3): 78-84, 99.
[7]CHENG Q R, GU X D. Bridging multimedia heterogeneity gap via graph representation learning for cross-modal retrieval[J]. Neural Networks, 2021, 134: 143-162. DOI: 10.1016/j.neunet.2020.11.011.
[8]WANG L, ZHU L, DONG X, et al. Joint feature selection and graph regularization for modality-dependent cross-modal retrieval[J]. Journal of Visual Communication and Image Representation, 2018, 54: 213-222. DOI: 10.1016/j.jvcir.2018.05.006.
[9]WU Y L, WANG S H, SONG G L, et al. Online asymmetric metric learning with multi-layer similarity aggregation for cross-modal retrieval[J]. IEEE Transactions on Image Processing, 2019, 28(9): 4299-4312.
[10]FENG F X, WANG X J, LI R F. Cross-modal retrieval with correspondence autoencoder[C]// Proceedings of the 22nd ACM international conference on Multimedia.New York:Association for Computing Machinery, 2014: 7-16. DOI: 10.1145/2647868.2654902.
[11]JIAN Y W, XIAO J, CAO Y, et al. Deep pairwise ranking with multi-label information for cross-modal retrieval[C]// Proceedings of 2019 IEEE International Conference on Multimedia and Expo(ICME).Piscataway: IEEE, 2019: 1810-1815. DOI: 10.1109/ICME.2019.00311.
[12]WANG Y F, WU F, SONG J, et al. Multi-modal mutual topic reinforce modeling for cross-media retrieval[C]// Proceedings of the 22nd ACM International Conference on Multimedia. New York: Association for Computing Machinery, 2014: 307-316. DOI: 10.1145/2647868.2654901.
[13]ZHENG Y. Methodologies for cross-domain data fusion: an overview[J]. IEEE Transactions on Big Data, 2015, 1(1): 16-34. DOI: 10.1109/TBDATA.2015.2465959.
[14]李超越. 基于特征融合的跨模态检索方法研究与应用[D]. 北京:北京化工大学, 2020. DOI:10.26939/d.cnki.gbhgu.2020.000809.
[15]路凯峰, 杨溢龙, 李智. 一种基于BERT和DPCNN的Web服务分类方法[J]. 广西师范大学学报(自然科学版),2021,39(6):87-98. DOI:10.16088/j.issn.1001-6600.2020111402.
[16]WANG L M, GUO S, HUANG W L, et al. Places205-VGGNet models for scene recognition[EB/OL].(2015-08-07)[2021-07-13].
[17]HOTELLING H. Relations between two sets of variates[M]// KOTZ S, JOHNSON N L. Breakthroughs in Statistics. New York: Springer, 1992: 162-190.
[18]RANJAN V, RASIWASIA N, JAWAHAR C V. Multi-label cross-modal retrieval[C]// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 4094-4102. DOI:10.1109/ICCV.2015.466.
[19]AKAHO S. A kernel method for canonical correlation analysis[EB/OL].(2006-09-13)[2021-07-13].
[20]YAN F, MIKOLAJCZYK K. Deep correlation for matching images and text[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 3441-3450. DOI: 10.1109/CVPR.2015.7298966.
[21]ZENG D H, YU Y, OYAMA K. Deep triplet neural networks with cluster-CCA for audio-visual cross-modal retrieval[J]. ACM Transactions on Multimedia Computing, Communications, and Applications(TOMM), 2020, 16(3): 1-23. DOI: 10.1145/3387164.
[22]QI Y D, ZHANG H X. Joint graph regularization in a homogeneous subspace for cross-media retrieval[J]. Journal of Advanced Computational Intelligence and Intelligent Informatics, 2019, 23(5): 939-946. DOI: 10.20965/jaciii.2019.p0939.
[23]WANG G H, JI H, KONG D X, et al. Modality-dependent cross-modal retrieval based on graph regularization[J]. Mobile Information Systems, 2020, 2020:4164692. DOI: 10.1155/2020/4164692.
[24]XU G W, LI X M, ZHANG Z J. Semantic consistency cross-modal retrieval with semi-supervised graph regularization[J]. IEEE Access, 2020, 8: 14278-14288. DOI: 10.1109/ACCESS.2020.2966220.
[25]YAN J H, ZHANG H X, SUN J D, et al. Joint graph regularization based modality-dependent cross-media retrieval[J]. Multimedia Tools and Applications, 2018, 77(3): 3009-3027. DOI: 10.1007/s11042-017-4918-0.
[26]WEI J W, XU X, YANG Y, et al. Universal weighting metric learning for cross-modal matching[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 13002-13011. DOI: 10.1109/CVPR4 2600.2020.01302.
[27]WU W, XU J, LI H. Learning similarity function between objects in heterogeneous spaces: MSR-TR-2010-86[R]. Beijing: Microsoft Research Asia, 2010.
[28]徐信芯, 姜鑫, 张辉, 等. 基于多层联合降噪的信号处理方法[J]. 科学技术与工程, 2021, 21(29): 12566-12573.
[29]REN L, LI K, WANG L Q, et al. Beyond the deep metric learning: enhance the cross-modal matching with adversarial discriminative domain regularization[C]// 2020 25th International Conference on Pattern Recognition(ICPR). Piscataway: IEEE, 2021: 10165-10172. DOI: 10.1109/ICPR48806.2021.9412297.
[30]XU X, HE L, LU H M, et al. Deep adversarial metric learning for cross-modal retrieval[J]. World Wide Web, 2019, 22(2): 657-672. DOI: 10.1007/s11280-018-0541-x.
[31]CHUA T S, TANG J H, HONG R C, et al. NUS-WIDE: a real-world web image database from National University of Singapore[C]// Proceedings of the ACM International Conference on Image and Video Retrieval. New York: Association for Computing Machinery, 2009: 1-9. DOI: 10.1145/1646396.1646452.
[32]RASIWASIA N, COSTA PEREIRA J, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval[C]// Proceedings of the 18th ACM International Conference on Multimedia. New York: Association for Computing Machinery, 2010: 251-260. DOI: 10.1145/1873951.1873987.
[33]FARHADI A, HEJRATI M, SADEGHI M A, et al. Every picture tells a story: generating sentences from images[C]// Proceedings of European Conference on Computer Vision. Berlin: Springer, 2010: 15-29. DOI: 10.1007/978-3-642-15561-1_2.
[34]PENG Y X, ZHAI X H, ZHAO Y Z, et al. Semi-supervised cross-media feature learning with unified patch graph regularization[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2016, 26(3): 583-596. DOI: 10.1109/TCSVT.2015.2400779.
[35]YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78. DOI: 10.116 2/tacl_a_00166.
[36]VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 652-663. DOI: 10.1109/TPAMI.2016.2587640.
[37]李志欣, 凌锋, 张灿龙, 等. 融合两级相似度的跨媒体图像文本检索[J]. 电子学报, 2021, 49(2): 268-274.
[38]刘颖, 郭莹莹, 房杰, 等. 深度学习跨模态图文检索研究综述[J]. 计算机科学与探索,2022,16(3):489-511.
[39]KARPATHY A, LI F F. Deep visual-semantic alignments for genrating image descriptions[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 3128-3137. DOI: 10.1109/TPAMI.2016.2598339.
[40]HUANG Y, WU Q, SONG C F, et al. Learning semantic concepts and order for image and sentence matching[C]// Proceedings of the 2018 IEEE CVF Conference on Computer Vision and Pattern Recognition.Piscataway: IEEE, 2018: 6163-6171. DOI: 10.1109/CVPR.2018.00645.
[41]LIU Y, GUO Y M, BAKKERr E M, et al. Learning a recurrent residual fusion network for multimodal matching[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 4127-4136. DOI: 10.1109/ICCV.2017.442.
[42]LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching[C]// Proceedings of the European Conference on Computer Vision(ECCV). Cham: Springer, 2018: 212-228.
[43]FAGHRI F, FLEET D J, KIROS J R, et al. VSE++: improving visual-semantic embeddings with hard negatives[EB/OL].(2018-07-29)[2021-07-13].
[44]ZHENG Z D, ZHENG L, GARRETT M, et al. Dual-path convolutional image-text embeddings with instance loss[J]. ACM Transactions on Multimedia Computing, Communications, and Applications(TOMM), 2020, 16(2): 1-23. DOI: 10.1145/3383184.
[45]QI J W, PENG Y X, YUAN Y X. Cross-media multi-level alignment with relation attention network[EB/OL].(2018-04-25)[2021-07-13].
[46]MA L, JIANG W H, JIE Z Q, et al. Bidirectional image-sentence retrieval by local and global deep matching[J]. Neurocomputing, 2019, 345: 36-44. DOI: 10.1016/j.neucom.2018.11.089.
[47]MITHUN N C, PANDA R, PAPALEXAKIS E E, et al. Webly supervised joint embedding for cross-modal image-text retrieval[C]// Proceedings of the 26th ACM international conference on Multimedia. New York: Association for Computing Machinery, 2018: 1856-1864. DOI: 10.1145/3240508.3240712.
[48]谢金峰, 王羽, 葛唯益, 等. 基于多语义相似性的关系检测方法[J]. 西北工业大学学报, 2021, 39(6): 1387-1394.
[49]史占堂, 马玉鹏, 赵凡, 等. 基于CNN-Head Transformer编码器的中文实体识别[J/OL]. 计算机工程[2021-12-20].
[50]ZHAI X H, PENG Y X, XIAO J G. Learning cross-media joint representation with sparse and semisupervised regularization[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2014, 24(6): 965-978.
[51]PENG Y X, QI J W. Quintuple-media joint correlation learning with deep compression and regularization[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(8): 2709-2722.
[52]WEI Y C, ZHAO Y, LU C Y, et al. Cross-modal retrieval with CNN visual features: a new baseline[J]. IEEE Transactions on Cybernetics, 2017, 47(2): 449-460.
[53]AUER S, KOVTUN V, PRINZ M, et al. Towards a knowledge graph for science[C]// Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics. New York: Association for Computing Machinery, 2018: 1-6. DOI: 10.1145/3227609.3227689.
[54]WANG Z C, LV Q S, LAN X H, et al. Cross-lingual knowledge graph alignment via graph convolutional networks[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 349-357.
[55]LIU Z H, XIONG C Y, SUN M S, et al. Entity-duet neural ranking: understanding the role of knowledge graph semantics in neural information retrieval[EB/OL].(2018-06-03)[2021-07-28].
[1] KONG Yayu, LU Yujie, SUN Zhongtian, XIAO Jingxian, HOU Haochen, CHEN Tingwei. Research on Graph Neural Network Recommendation Algorithms for Reinforcing Current Interest [J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(3): 151-160.
[2] YANG Zhou, FAN Yixing, ZHU Xiaofei, GUO Jiafeng, WANG Yue. Survey on Modeling Factors of Neural Information Retrieval Model [J]. Journal of Guangxi Normal University(Natural Science Edition), 2021, 39(2): 1-12.
[3] LI Shuang-qun, XU Jiu-cheng, ZHANG Ling-jun, LI Xiao-yan. Color Image Retrieval Based on Tolerance Granules [J]. Journal of Guangxi Normal University(Natural Science Edition), 2011, 29(3): 173-178.
[4] LI Li-na, YU Zheng-tao, WANG Ya-sheng, MAO Cun-li, GUO Jian-yi. Method of Chinese Expert Entity Homepage Recognition [J]. Journal of Guangxi Normal University(Natural Science Edition), 2011, 29(1): 157-161.
[5] CUI Lin-wei, SU Wei, GUO Wei, LI Lian. Extraction of Web Mathematical Formulas Based on Nutch [J]. Journal of Guangxi Normal University(Natural Science Edition), 2011, 29(1): 167-172.
[6] TANG Nan, YANG Zhi-hao, WU Jia-jin, WANG Yan-hua, LIN Hong-fei. Method of Predicting Protein Complex Based on Supervised Learning [J]. Journal of Guangxi Normal University(Natural Science Edition), 2011, 29(2): 174-179.
[7] LUO Xin, PAN Qiao, WANG Hong-ya, CHEN Mei, KITA Kenji. Realization of High-speed Image Search Based on SOFM [J]. Journal of Guangxi Normal University(Natural Science Edition), 2011, 29(2): 180-184.
[8] XIA Tian. Content Extraction of Web Page Based on Extended Label Tree [J]. Journal of Guangxi Normal University(Natural Science Edition), 2011, 29(1): 133-137.
Full text



[1] ZHONG Xianggui, SUN Yue, WU Xianghua. Nearly CAP*-Subgroups and p-Supersolvability of Finite Groups[J]. Journal of Guangxi Normal University(Natural Science Edition), 2019, 37(4): 74 -78 .
[2] WANG Han, WANG Xu’an, ZHOU Neng, LIU Yudong. Blockchain-based Public Verifiable Scheme for Sharing Data[J]. Journal of Guangxi Normal University(Natural Science Edition), 2020, 38(2): 1 -7 .
[3] WANG Junfeng, LI Ping. Shortest-path Exponent and Backbone Exponentof Explosive Percolation Model[J]. Journal of Guangxi Normal University(Natural Science Edition), 2020, 38(2): 81 -86 .
[4] AI Yan, JIA Nan, WANG Yuan, GUO Jing, PAN Dongdong. Review of Statistical Methods and Applications of Genetic Association Analysis for Multiple Traits and Multiple Locus[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(1): 1 -14 .
[5] BAI Defa, XU Xin, WANG Guochang. Review of Generalized Linear Models and Classification for Functional Data[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(1): 15 -29 .
[6] ZENG Qingfan, QIN Yongsong, LI Yufang. Empirical Likelihood Inference for a Class of Spatial Panel Data Models[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(1): 30 -42 .
[7] ZHANG Zhifei, DUAN Qian, LIU Naijia, HUANG Lei. High-dimensional Nonlinear Regression Model Based on JMI[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(1): 43 -56 .
[8] YANG Di, FANG Yangxin, ZHOU Yan. New Category Classification Research Based on MEB and SVM Methods[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(1): 57 -67 .
[9] CHEN Zhongxiu, ZHANG Xingfa, XIONG Qiang, SONG Zefang. Estimation and Test for Asymmetric DAR Model[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(1): 68 -81 .
[10] LI Muhang, HAN Meng, CHEN Zhiqiang, WU Hongxin, ZHANG Xilong. Survey of Algorithms Oriented to Complex High Utility Pattern Mining[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(3): 13 -30 .