Journal of Guangxi Normal University(Natural Science Edition) ›› 2025, Vol. 43 ›› Issue (5): 145-157.doi: 10.16088/j.issn.1001-6600.2024112901

• Intelligence Information Processing • Previous Articles     Next Articles

Cross-modal Semantic Collaborative Learning for Text-based Person Re-identification

LUO Zengli1, ZHANG Canlong1,2*, LI Zhixin1,2, WANG Zhiwen3, WEI Chunrong4   

  1. 1. Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education (Guangxi Normal University), Guilin Guangxi 541004, China;
    2. Guangxi Key Lab of Multi-source Information Mining & Security (Guangxi Normal University), Guilin Guangxi 541004, China;
    3. School of Electronic Engineering, Guangxi University of Science and Technology, Liuzhou Guangxi 545006, China;
    4. Teachers College for Vocational and Technical Education, Guangxi Normal University, Guilin Guangxi 541004, China
  • Received:2024-11-29 Revised:2025-03-07 Online:2025-09-05 Published:2025-08-05

Abstract: Existing text-based person re-identification methods are limited by issues of feature alignment and semantic ambiguity. To address these challenges, a cross-modal semantic collaboration framework is proposed. Shared semantic information between images and text is learned, and local visual-text correspondence constraints are established to improve the matching efficiency between images and text. Specifically, a text semantic clustering module is introduced to automatically extract text related to local visual semantics, while image self-supervised learning is applied to enhance the learning of local features. A common semantic collaboration module is then built to capture both the differences and commonalities between the image and its description, establishing a semantic consistency mapping in the embedding space. Finally, a semantic constraint reasoning module is incorporated to perform retrieval by combining the semantic consistency scores of images and text, thereby improving retrieval efficiency. Experiments on three benchmark datasets show that the proposed method effectively enhances the performance of the model.For Rank-1 indicators, it has improves 0.75%, 1.43%, 0.88%, respectively, and the precision rises by 0.64%, 2.56%, 3.96%, respectively.

Key words: perdestrian re-identification, cross-modal retrieval, semantic clustering, large language generation, semantic consistency, semantic collaboration

CLC Number:  TP391.1
[1] JING Y, WANG W, WANG L, et al. Cross-modal cross-domain moment alignment network for person search[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2020: 10675-10683. DOI: 10.1109/CVPR42600.2020.01069.
[2] WANG Z, FANG Z Y, WANG J, et al.ViTAA: visual-textual attributes alignment in person search by natural language[C]// Computer Vision-ECCV 2020. Cham: Springer International Publishing, 2020: 402-420. DOI: 10.1007/978-3-030-58610-2_24.
[3] 肖逸群, 宋树祥, 夏海英. 基于多特征的快速行人检测方法及实现[J]. 广西师范大学学报(自然科学版), 2019, 37(4): 61-67. DOI: 10.16088/j.issn.1001-6600.2019.04.007.
[4] 周东明, 张灿龙, 唐艳平, 等. 联合语义分割与注意力机制的行人再识别模型[J]. 计算机工程, 2022, 48(2): 201-206. DOI: 10.19678/j.issn.1000-3428.0060416.
[5] WU Y S, YAN ZZ, HAN X G, et al. LapsCore: language-guided person search via color reasoning[C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2021: 1604-1613. DOI: 10.1109/ICCV48922.2021.00165.
[6] GAO C Y, CAI G Y, JIANG X Y, et al. Contextual non-local alignment over full-scale representation for text-based person search[EB/OL].(2021-01-08)[2024-11-29]. https://arxiv.org/abs/2101.03036. DOI: 10.48550/arXiv.2101.03036.
[7] 李大伟, 曾智勇. 基于动态双注意力机制的跨模态行人重识别模型[J]. 计算机应用, 2022, 42(10): 3200-3208. DOI: 10.11772/j.issn.1001-9081.2021081510.
[8] 邓淑雅, 李浩源. 基于注意力特征融合的跨模态行人重识别[J]. 计算机系统应用, 2024, 33(9): 269-275. DOI: 10.15888/j.cnki.csa.009604.
[9] YANG X, WANG X Q, WANG N N, et al. Address the unseen relationships: attribute correlations in text attribute person search[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(11): 16916-16926. DOI: 10.1109/TNNLS.2023.3300582.
[10] 张雯欣, 刘玉杰, 王兆勇, 等. 基于原型分散网络的端到端行人搜索方法[J]. 计算机工程, 2025, 51(1): 269-276. DOI: 10.19678/j.issn.1000-3428.0068462.
[11] ZHENG Y W, ZHAO X P, LAN C L, et al. CPCL: cross-modal prototypical contrastive learning for weakly supervised text-based person re-identification[EB/OL].(2024-01-18)[2024-11-29]. https://arxiv.org/abs/2401.10011. DOI: 10.48550/arXiv.2401.10011.
[12] SHAO Z Y, ZHANG X Y, DING C X, et al. Unified pre-training with pseudo texts for text-to-image person re-identification[C]// 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2023: 11140-11150. DOI: 10.1109/ICCV51070.2023.01026.
[13] YANG S Y, ZHOU Y N, ZHENG Z D, et al. Towards unified text-based person retrieval: a large-scale multi-attribute and language search benchmark[C]// MM'23: Proceedings of the 31st ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2023: 4492-4501. DOI: 10.1145/3581783.3611709.
[14] 郭玉彬, 文向, 刘攀, 等. 基于双流结构的跨模态行人重识别关系网络[J]. 计算机应用, 2023, 43(6): 1803-1810. DOI: 10.11772/j.issn.1001-9081.2022050665.
[15] 贾军营, 杨芯茹, 杨海波, 等. 改进CLIP-ReID的跨模态行人重识别[J]. 计算机系统应用, 2025, 34(1): 153-160. DOI: 10.15888/j.cnki.csa.009741.
[16] 何嘉明, 杨巨成, 吴超, 等. 基于多模态图卷积神经网络的行人重识别方法[J]. 计算机应用, 2023, 43(7): 2182-2189. DOI: 10.11772/j.issn.1001-9081.2022060827.
[17] 李灏, 唐敏, 林建武, 等. 基于改进困难三元组损失的跨模态行人重识别框架[J]. 计算机科学, 2020, 47(10): 180-186. DOI:10.11896/jsjkx.191100061.
[18] 姜定, 叶茫. 面向跨模态文本到图像行人重识别的Transformer网络[J]. 中国图象图形学报, 2023, 28(5): 1384-1395.
[19] LI S, XIAO T, LI H S, et al. Person search with natural language description[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 5187-5196. DOI: 10.1109/CVPR.2017.551.
[20] ZHU A C, WANG Z J, LI Y F, et al. DSSL: deep surroundings-person separation learning for text-based person retrieval[C]// MM'21: Proceedings of the 29th ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2021: 209-217. DOI: 10.1145/3474085.3475369.
[21] DING Z F, DING C X, SHAO Z Y, et al. Semantically self-aligned network for text-to-image part-aware person re-identification[EB/OL].(2021-08-09)[2024-11-29]. https://arxiv.org/abs/2107.12666. DOI: 10.48550/arXiv.2107.12666.
[22] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg, PA: Association for Computational Linguistics, 2019: 4171-4186. DOI: 10.18653/v1/N19-1423.
[23] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]// Proceedings of the 38th International Conference on Machine Learning: PMLR 139. Cambridge MA: JMLR, 2021: 8748-8763.
[24] YAN S L, DONG N, ZHANG L Y, et al. CLIP-driven fine-grained text-image person re-identification[J]. IEEE Transactions on Image Processing, 2023, 32: 6032-6046. DOI: 10.1109/TIP.2023.3327924.
[25] JIANG D, YE M. Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2023: 2787-2797. DOI: 10.1109/CVPR52729.2023.00273.
[26] ZHOU J F, HUANG B G, FAN W J, et al. Text-based personsearch via local-relational-global fine grained alignment[J]. Knowledge-Based Systems, 2023, 262: 110253. DOI: 10.1016/j.knosys.2023.110253.
[27] CAO M, BAI Y, ZENG Z Y, et al. An empirical study of CLIP for text-based person search[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(1): 465-473. DOI: 10.1609/aaai.v38i1.27801.
[28] LI Z, SI L J, GUO C L, et al. Data augmentation for text-based person retrieval using large language models[EB/OL].(2024-05-20)[2024-11-29]. https://arxiv.org/abs/2405.11971. DOI: 10.48550/arXiv.2405.11971.
[29] BAO L P, WEI L H, ZHOU W G, et al.Multi-granularity matching transformer for text-based person search[J]. IEEE Transactions on Multimedia, 2023, 26: 4281-4293. DOI: 10.1109/TMM.2023.3321504.
[30] LI S S, XU X, SHEN F M, et al. Multi-granularity separation network for text-based person retrieval with bidirectional refinement regularization[C]// ICMR'23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. New York, NY: Association for Computing Machinery, 2023: 307-315. DOI: 10.1145/3591106.3592253.
[31] GAO L Y, NIU K, JIAO B L, et al. Addressing information inequality for text-based person search via pedestrian-centric visual denoising and bias-aware alignments[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(12): 7884-7899. DOI: 10.1109/TCSVT.2023.3273719.
[32] MA Y W, SUN X S, JI J Y, et al. Beat: bi-directional one-to-many embedding alignment for text-based person retrieval[C]// MM'23: Proceedings of the 31st ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2023: 4157-4168. DOI: 10.1145/3581783.3611768.
[33] CHEN Y C, LI L J, YU L C, et al. UNITER: universal image-text representation learning[C]// Computer Vision-ECCV 2020. Cham: Springer International Publishing, 2020: 104-120. DOI: 10.1007/978-3-030-58577-8_7.
[34] LI J N, SELVARAJU R, GOTMARE A, et al. Align before fuse: vision and language representation learning with momentum distillation[C]// Advances in Neural Information Processing Systems 34 (NeurIPS 2021). Red Hook, NY: Curran Associates Inc., 2021: 9694-9705.
[35] ZANG X H, GAO W, LI G, et al. A baseline investigation: transformer-based cross-view baseline for text-based person search[C]// MM'23: Proceedings of the 31st ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2023: 7737-7746. DOI: 10.1145/3581783.3611916.
[36] YAN S L, TANG H, ZHANG L Y, et al. Image-specific information suppression and implicit local alignment for text-based person search[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(12): 17973-17986. DOI: 10.1109/TNNLS.2023.3310118.
[37] CHEN W H, XU X Z, JIA J, et al. Beyond appearance: a semantic controllable self-supervised learning framework for human-centric visual tasks[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2023: 15050-15061. DOI: 10.1109/CVPR52729.2023.01445.
[38] LIU H T, LI C Y, WU Q Y, et al. Visual instruction tuning[C]// Advances in Neural Information Processing Systems 36 (NeurIPS 2023). Red Hook, NY: Curran Associates Inc., 2023: 1516.
[39] HU E J, SHEN Y L, WALLIS P, et al. LoRA: low-rank adaptation of large language models[EB/OL]. (2021-10-16)[2024-11-29]. https://arxiv.org/abs/2106.09685. DOI: 10.48550/arXiv.2106.09685.
[40] DEB K, GUPTA H. Searching for robust Pareto-optimal solutions in multi-objective optimization[C]// Evolutionary Multi-Criterion Optimization. Berlin: Springer, 2005: 150-164. DOI: 10.1007/978-3-540-31880-4_11.
[41] CARON M, TOUVRON H, MISRA I, et al. Emerging properties in self-supervised vision transformers[C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2021: 9630-9640. DOI: 10.1109/ICCV48922.2021.00951.
[42] LI J N, LI D X, XIONG C M, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[C]// Proceedings of the 39th International Conference on Machine Learning: PMLR 162. Cambridge MA: JMLR, 2022: 12888-12900.
[43] HUANG L, YU W J, MA W T, et al. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions[J]. ACM Transactions on Information Systems, 2025, 43(2):42. DOI: 10.1145/3703155.
[1] DU Jinfeng, WANG Hairong, LIANG Huan, WANG Dong. Progress of Cross-modal Retrieval Methods Based on Representation Learning [J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(3): 1-12.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] ZHONG Qiao, CHEN Shenglong, TANG Congcong. Hydrogel Technology for Microalgae Collection: Status Overview, Challenges and Development Analysis[J]. Journal of Guangxi Normal University(Natural Science Edition), 2024, 42(6): 16 -29 .
[2] SHI Huilu, MO Yanhua, LUO Haiyu, MA Jiangming. Inhibitory Activity of Ethyl Acetate Extracts of Loropetalum chinense against Pathogens[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(1): 1 -8 .
[3] HE Qing, LI Dong, LUO Siyuan, HE Yudong, LI Biao, WANG Qiang. Research Progress in Ultra-wideband Rydberg Atomic Antenna Technology[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(2): 1 -19 .
[4] HUANG Renhui, ZHANG Ruifeng, WEN Xiaohao, BI Jinjie, HUANG Shoulin, LI Tinghui. Complex-value Covariance-based Convolutional Neural Network for Decoding Motor Imagery-based EEG Signals[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(3): 43 -56 .
[5] TIAN Sheng, XIONG Chenyin, LONG Anyang. Point Cloud Classification Method of Urban Roads Based on Improved PointNet++[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(4): 1 -14 .
[6] LI Zongxiao, ZHANG Jian, LUO Xinyue, ZHAO Yifei, LU Fei. Research on Arrival Trajectory Prediction Based on K-means and Adam-LSTM[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(4): 15 -23 .
[7] SONG Mingkai, ZHU Chengjie. Research on Fault Location of Distribution Network Based on H-WOA-GWO and Region Correction Strategies[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(4): 24 -37 .
[8] HAN Shuo, JIANG Linfeng, YANG Jianbin. Attention-based PINNs Method for Solving Saint-Venant Equations[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(4): 58 -68 .
[9] LI Zhixin, KUANG Wenlan. Fine-grained Image Classification Combining Adaptive Spatial Mutual Attention and Feature Pair Integration Discrimination[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(4): 69 -82 .
[10] SHI Tianyi, NAN Xinyuan, GUO Xiangyu, ZHAO Pu, CAI Xin. Improved ConvNeXt-based Algorithm for Apple Leaf Disease Classification[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(4): 83 -96 .