广西师范大学学报(自然科学版) ›› 2025, Vol. 43 ›› Issue (5): 145-157.doi: 10.16088/j.issn.1001-6600.2024112901

• 智能信息处理 • 上一篇    下一篇

基于跨模态语义协作学习的文本行人重识别

罗赠丽1, 张灿龙1,2*, 李志欣1,2, 王智文3, 韦春荣4   

  1. 1.区块链与智能技术教育部重点实验室(广西师范大学), 广西 桂林 541004;
    2.广西多源信息挖掘与安全重点实验室(广西师范大学), 广西 桂林 541004;
    3.广西科技大学 计算机科学与通信工程学院, 广西 柳州 545006;
    4.广西师范大学 职业技术师范学院, 广西 桂林 541004
  • 收稿日期:2024-11-29 修回日期:2025-03-07 出版日期:2025-09-05 发布日期:2025-08-05
  • 通讯作者: 张灿龙(1975—), 男, 湖南双峰人, 广西师范大学教授, 博士, 博导。E-mail: zcltyp@163.com
  • 基金资助:
    广西重点研发计划(2024AB26006); 国家自然科学基金(62266009, 62276073, 62466004); 广西多源信息集成与智能处理协同创新中心项目; 广西一流本科生课程建设项目(202103); 广西师范大学创新项目(XYCSR2024097)

Cross-modal Semantic Collaborative Learning for Text-based Person Re-identification

LUO Zengli1, ZHANG Canlong1,2*, LI Zhixin1,2, WANG Zhiwen3, WEI Chunrong4   

  1. 1. Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education (Guangxi Normal University), Guilin Guangxi 541004, China;
    2. Guangxi Key Lab of Multi-source Information Mining & Security (Guangxi Normal University), Guilin Guangxi 541004, China;
    3. School of Electronic Engineering, Guangxi University of Science and Technology, Liuzhou Guangxi 545006, China;
    4. Teachers College for Vocational and Technical Education, Guangxi Normal University, Guilin Guangxi 541004, China
  • Received:2024-11-29 Revised:2025-03-07 Online:2025-09-05 Published:2025-08-05

摘要: 现有的基于文本的行人重识别方法主要受限于特征对齐和语义歧义问题。针对该问题,本文提出一种跨模态语义协作的行人重识别方法(CMSC),通过学习图像与文本的共性语义信息,构建局部视觉与文本的对应约束关系,提升图像与文本的匹配效率。首先,引入文本语义聚类模块,自动提取与局部视觉语义相关的文本信息,并通过图像自监督学习增强局部特征的语义表达;然后,构建共性语义协作模块,捕捉图像与描述的差异和共性,在嵌入空间中建立语义一致性的映射关系;最后,引入语义约束推理模块,通过图像与文本的语义一致性得分进行检索,从而提高效率。在3个基准数据集上的实验表明,本文方法能有效提升模型的性能,在Rank-1指标上较现有方法分别提升0.75、1.43和0.88个百分点,精度分别提升0.64、2.56及3.96个百分点。

关键词: 行人重识别, 跨模态检索, 语义聚类, 大语言模型生成, 语义一致性, 语义协作

Abstract: Existing text-based person re-identification methods are limited by issues of feature alignment and semantic ambiguity. To address these challenges, a cross-modal semantic collaboration framework is proposed. Shared semantic information between images and text is learned, and local visual-text correspondence constraints are established to improve the matching efficiency between images and text. Specifically, a text semantic clustering module is introduced to automatically extract text related to local visual semantics, while image self-supervised learning is applied to enhance the learning of local features. A common semantic collaboration module is then built to capture both the differences and commonalities between the image and its description, establishing a semantic consistency mapping in the embedding space. Finally, a semantic constraint reasoning module is incorporated to perform retrieval by combining the semantic consistency scores of images and text, thereby improving retrieval efficiency. Experiments on three benchmark datasets show that the proposed method effectively enhances the performance of the model.For Rank-1 indicators, it has improves 0.75%, 1.43%, 0.88%, respectively, and the precision rises by 0.64%, 2.56%, 3.96%, respectively.

Key words: perdestrian re-identification, cross-modal retrieval, semantic clustering, large language generation, semantic consistency, semantic collaboration

中图分类号:  TP391.1

[1] JING Y, WANG W, WANG L, et al. Cross-modal cross-domain moment alignment network for person search[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2020: 10675-10683. DOI: 10.1109/CVPR42600.2020.01069.
[2] WANG Z, FANG Z Y, WANG J, et al.ViTAA: visual-textual attributes alignment in person search by natural language[C]// Computer Vision-ECCV 2020. Cham: Springer International Publishing, 2020: 402-420. DOI: 10.1007/978-3-030-58610-2_24.
[3] 肖逸群, 宋树祥, 夏海英. 基于多特征的快速行人检测方法及实现[J]. 广西师范大学学报(自然科学版), 2019, 37(4): 61-67. DOI: 10.16088/j.issn.1001-6600.2019.04.007.
[4] 周东明, 张灿龙, 唐艳平, 等. 联合语义分割与注意力机制的行人再识别模型[J]. 计算机工程, 2022, 48(2): 201-206. DOI: 10.19678/j.issn.1000-3428.0060416.
[5] WU Y S, YAN ZZ, HAN X G, et al. LapsCore: language-guided person search via color reasoning[C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2021: 1604-1613. DOI: 10.1109/ICCV48922.2021.00165.
[6] GAO C Y, CAI G Y, JIANG X Y, et al. Contextual non-local alignment over full-scale representation for text-based person search[EB/OL].(2021-01-08)[2024-11-29]. https://arxiv.org/abs/2101.03036. DOI: 10.48550/arXiv.2101.03036.
[7] 李大伟, 曾智勇. 基于动态双注意力机制的跨模态行人重识别模型[J]. 计算机应用, 2022, 42(10): 3200-3208. DOI: 10.11772/j.issn.1001-9081.2021081510.
[8] 邓淑雅, 李浩源. 基于注意力特征融合的跨模态行人重识别[J]. 计算机系统应用, 2024, 33(9): 269-275. DOI: 10.15888/j.cnki.csa.009604.
[9] YANG X, WANG X Q, WANG N N, et al. Address the unseen relationships: attribute correlations in text attribute person search[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(11): 16916-16926. DOI: 10.1109/TNNLS.2023.3300582.
[10] 张雯欣, 刘玉杰, 王兆勇, 等. 基于原型分散网络的端到端行人搜索方法[J]. 计算机工程, 2025, 51(1): 269-276. DOI: 10.19678/j.issn.1000-3428.0068462.
[11] ZHENG Y W, ZHAO X P, LAN C L, et al. CPCL: cross-modal prototypical contrastive learning for weakly supervised text-based person re-identification[EB/OL].(2024-01-18)[2024-11-29]. https://arxiv.org/abs/2401.10011. DOI: 10.48550/arXiv.2401.10011.
[12] SHAO Z Y, ZHANG X Y, DING C X, et al. Unified pre-training with pseudo texts for text-to-image person re-identification[C]// 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2023: 11140-11150. DOI: 10.1109/ICCV51070.2023.01026.
[13] YANG S Y, ZHOU Y N, ZHENG Z D, et al. Towards unified text-based person retrieval: a large-scale multi-attribute and language search benchmark[C]// MM'23: Proceedings of the 31st ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2023: 4492-4501. DOI: 10.1145/3581783.3611709.
[14] 郭玉彬, 文向, 刘攀, 等. 基于双流结构的跨模态行人重识别关系网络[J]. 计算机应用, 2023, 43(6): 1803-1810. DOI: 10.11772/j.issn.1001-9081.2022050665.
[15] 贾军营, 杨芯茹, 杨海波, 等. 改进CLIP-ReID的跨模态行人重识别[J]. 计算机系统应用, 2025, 34(1): 153-160. DOI: 10.15888/j.cnki.csa.009741.
[16] 何嘉明, 杨巨成, 吴超, 等. 基于多模态图卷积神经网络的行人重识别方法[J]. 计算机应用, 2023, 43(7): 2182-2189. DOI: 10.11772/j.issn.1001-9081.2022060827.
[17] 李灏, 唐敏, 林建武, 等. 基于改进困难三元组损失的跨模态行人重识别框架[J]. 计算机科学, 2020, 47(10): 180-186. DOI:10.11896/jsjkx.191100061.
[18] 姜定, 叶茫. 面向跨模态文本到图像行人重识别的Transformer网络[J]. 中国图象图形学报, 2023, 28(5): 1384-1395.
[19] LI S, XIAO T, LI H S, et al. Person search with natural language description[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 5187-5196. DOI: 10.1109/CVPR.2017.551.
[20] ZHU A C, WANG Z J, LI Y F, et al. DSSL: deep surroundings-person separation learning for text-based person retrieval[C]// MM'21: Proceedings of the 29th ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2021: 209-217. DOI: 10.1145/3474085.3475369.
[21] DING Z F, DING C X, SHAO Z Y, et al. Semantically self-aligned network for text-to-image part-aware person re-identification[EB/OL].(2021-08-09)[2024-11-29]. https://arxiv.org/abs/2107.12666. DOI: 10.48550/arXiv.2107.12666.
[22] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg, PA: Association for Computational Linguistics, 2019: 4171-4186. DOI: 10.18653/v1/N19-1423.
[23] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]// Proceedings of the 38th International Conference on Machine Learning: PMLR 139. Cambridge MA: JMLR, 2021: 8748-8763.
[24] YAN S L, DONG N, ZHANG L Y, et al. CLIP-driven fine-grained text-image person re-identification[J]. IEEE Transactions on Image Processing, 2023, 32: 6032-6046. DOI: 10.1109/TIP.2023.3327924.
[25] JIANG D, YE M. Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2023: 2787-2797. DOI: 10.1109/CVPR52729.2023.00273.
[26] ZHOU J F, HUANG B G, FAN W J, et al. Text-based personsearch via local-relational-global fine grained alignment[J]. Knowledge-Based Systems, 2023, 262: 110253. DOI: 10.1016/j.knosys.2023.110253.
[27] CAO M, BAI Y, ZENG Z Y, et al. An empirical study of CLIP for text-based person search[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(1): 465-473. DOI: 10.1609/aaai.v38i1.27801.
[28] LI Z, SI L J, GUO C L, et al. Data augmentation for text-based person retrieval using large language models[EB/OL].(2024-05-20)[2024-11-29]. https://arxiv.org/abs/2405.11971. DOI: 10.48550/arXiv.2405.11971.
[29] BAO L P, WEI L H, ZHOU W G, et al.Multi-granularity matching transformer for text-based person search[J]. IEEE Transactions on Multimedia, 2023, 26: 4281-4293. DOI: 10.1109/TMM.2023.3321504.
[30] LI S S, XU X, SHEN F M, et al. Multi-granularity separation network for text-based person retrieval with bidirectional refinement regularization[C]// ICMR'23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. New York, NY: Association for Computing Machinery, 2023: 307-315. DOI: 10.1145/3591106.3592253.
[31] GAO L Y, NIU K, JIAO B L, et al. Addressing information inequality for text-based person search via pedestrian-centric visual denoising and bias-aware alignments[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(12): 7884-7899. DOI: 10.1109/TCSVT.2023.3273719.
[32] MA Y W, SUN X S, JI J Y, et al. Beat: bi-directional one-to-many embedding alignment for text-based person retrieval[C]// MM'23: Proceedings of the 31st ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2023: 4157-4168. DOI: 10.1145/3581783.3611768.
[33] CHEN Y C, LI L J, YU L C, et al. UNITER: universal image-text representation learning[C]// Computer Vision-ECCV 2020. Cham: Springer International Publishing, 2020: 104-120. DOI: 10.1007/978-3-030-58577-8_7.
[34] LI J N, SELVARAJU R, GOTMARE A, et al. Align before fuse: vision and language representation learning with momentum distillation[C]// Advances in Neural Information Processing Systems 34 (NeurIPS 2021). Red Hook, NY: Curran Associates Inc., 2021: 9694-9705.
[35] ZANG X H, GAO W, LI G, et al. A baseline investigation: transformer-based cross-view baseline for text-based person search[C]// MM'23: Proceedings of the 31st ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2023: 7737-7746. DOI: 10.1145/3581783.3611916.
[36] YAN S L, TANG H, ZHANG L Y, et al. Image-specific information suppression and implicit local alignment for text-based person search[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(12): 17973-17986. DOI: 10.1109/TNNLS.2023.3310118.
[37] CHEN W H, XU X Z, JIA J, et al. Beyond appearance: a semantic controllable self-supervised learning framework for human-centric visual tasks[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2023: 15050-15061. DOI: 10.1109/CVPR52729.2023.01445.
[38] LIU H T, LI C Y, WU Q Y, et al. Visual instruction tuning[C]// Advances in Neural Information Processing Systems 36 (NeurIPS 2023). Red Hook, NY: Curran Associates Inc., 2023: 1516.
[39] HU E J, SHEN Y L, WALLIS P, et al. LoRA: low-rank adaptation of large language models[EB/OL]. (2021-10-16)[2024-11-29]. https://arxiv.org/abs/2106.09685. DOI: 10.48550/arXiv.2106.09685.
[40] DEB K, GUPTA H. Searching for robust Pareto-optimal solutions in multi-objective optimization[C]// Evolutionary Multi-Criterion Optimization. Berlin: Springer, 2005: 150-164. DOI: 10.1007/978-3-540-31880-4_11.
[41] CARON M, TOUVRON H, MISRA I, et al. Emerging properties in self-supervised vision transformers[C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2021: 9630-9640. DOI: 10.1109/ICCV48922.2021.00951.
[42] LI J N, LI D X, XIONG C M, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[C]// Proceedings of the 39th International Conference on Machine Learning: PMLR 162. Cambridge MA: JMLR, 2022: 12888-12900.
[43] HUANG L, YU W J, MA W T, et al. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions[J]. ACM Transactions on Information Systems, 2025, 43(2):42. DOI: 10.1145/3703155.
[1] 孙旭, 沈彬, 严馨, 张金鹏, 徐广义. 基于Transformer和TextRank的微博观点摘要方法[J]. 广西师范大学学报(自然科学版), 2023, 41(4): 96-108.
[2] 杜锦丰, 王海荣, 梁焕, 王栋. 基于表示学习的跨模态检索方法研究进展[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 1-12.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 钟俏, 陈生龙, 唐聪聪. 水凝胶技术在微藻采收中的应用:现状、挑战与发展分析[J]. 广西师范大学学报(自然科学版), 2024, 42(6): 16 -29 .
[2] 施慧露, 莫燕华, 骆海玉, 马姜明. 檵木乙酸乙酯萃取物抑菌活性研究[J]. 广西师范大学学报(自然科学版), 2025, 43(1): 1 -8 .
[3] 贺青, 李栋, 罗思源, 贺寓东, 李彪, 王强. 超宽带里德堡原子天线技术研究进展[J]. 广西师范大学学报(自然科学版), 2025, 43(2): 1 -19 .
[4] 黄仁慧, 张锐锋, 文晓浩, 闭金杰, 黄守麟, 李廷会. 基于复数协方差卷积神经网络的运动想象脑电信号解码方法[J]. 广西师范大学学报(自然科学版), 2025, 43(3): 43 -56 .
[5] 田晟, 熊辰崟, 龙安洋. 基于改进PointNet++的城市道路点云分类方法[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 1 -14 .
[6] 黎宗孝, 张健, 罗鑫悦, 赵嶷飞, 卢飞. 基于K-means和Adam-LSTM的机场进场航迹预测研究[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 15 -23 .
[7] 宋铭楷, 朱成杰. 基于H-WOA-GWO和区段修正策略的配电网故障定位研究[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 24 -37 .
[8] 韩烁, 江林峰, 杨建斌. 基于注意力机制PINNs方法求解圣维南方程[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 58 -68 .
[9] 李志欣, 匡文兰. 结合互注意力空间自适应和特征对集成判别的细粒度图像分类[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 69 -82 .
[10] 石天怡, 南新元, 郭翔羽, 赵濮, 蔡鑫. 基于改进ConvNeXt的苹果叶片病害分类算法[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 83 -96 .
版权所有 © 广西师范大学学报(自然科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发