基于多分辨率特征定位的跨模态行人检索方法

doi:10.16088/j.issn.1001-6600.2024121902

Abstract

Abstract: Text-to-image person retrieval, which can overcome limitations of traditional image-based methods, emerges as an innovative paradigm in smart city development. However, long-distance imaging and complex backgrounds in surveillance scenarios lead to scale inconsistency and feature contamination, hindering retrieval performance. This paper proposes a cross-modal person retrieval approach based on multi-resolution feature grounding, which effectively addresses detail loss and background interference through integrating multi-scale image feature representations with semantic segmentation boundary information. Two key innovations are introduced: 1) a multi-scale resolution input scheme that processes both low-resolution global features and high-resolution local features, 2) a semantic segmentation-based boundary grounding strategy that precisely segments pedestrian contours to suppress background interference. The Rank-1 accuracies on the CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets are 70.58%, 60.88%, and 55.24%, respectively. Compared with recent methods, the proposed method demonstrates a relatively significant performance advantage in the cross-modal text-to-image person retrieval task.

Key words: multi-resolution, boundary grounding, cross-modal, person retrieval, person re-identification

CLC Number: TP391.41

XIE Sheng, MA Haifei, ZHANG Canlong, WANG Zhiwen, WEI Chunrong. Multi-resolution Feature Grounding for Cross-Modal Person Retrieval[J].Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 69-79.

References

[1] 李小宝. 无监督行人重识别方法研究[D]. 北京: 北京交通大学, 2023. DOI: 10.26944/d.cnki.gbfju.2023.003631.
[2] 罗浩, 姜伟, 范星, 等. 基于深度学习的行人重识别研究进展[J]. 自动化学报, 2019, 45(11): 2032-2049. DOI: 10.16383/j.aas.c180154.
[3] LI S, XIAO T, LI H S, et al. Person search with natural language description[C] //2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 5187-5196. DOI: 10.1109/CVPR.2017.551.
[4] 陈浩彬. 基于深度学习的跨域跨模态复杂场景下的行人重识别[D]. 深圳: 中国科学院大学(中国科学院深圳先进技术研究院), 2022. DOI: 10.27822/d.cnki.gszxj.2022.000164.
[5] ZHANG Y, LU H C. Deep cross-modal projection learning for image-text matching[C] //Computer Vision-ECCV 2018: LNCS Volume 11205. Cham: Springer Nature Switzerland AG, 2018: 707-723. DOI: 10.1007/978-3-030-01246-5_42.
[6] SARAFIANOS N, XU X, KAKADIARIS I. Adversarial representation learning for text-to-image matching[C] //2019 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2019: 5813-5823. DOI: 10.1109/iccv.2019.00591.
[7] CHENG Y, WANG H Y, LIU X K. Pose-guided neural network with hybrid representation for person re-identification[C] //2019 IEEE International Conference on Signal, Information and Data Processing (ICSIDP). Piscataway,NJ: IEEE, 2019: 1-6. DOI: 10.1109/icsidp47821.2019.9173449.
[8] 罗浩. 基于深度学习的行人重识别算法研究: 从无遮挡到遮挡[D]. 杭州: 浙江大学, 2020. DOI: 10.27461/d.cnki.gzjdx.2020.001378.
[9] 匡澄, 陈莹. 基于多粒度特征融合网络的行人重识别[J]. 电子学报, 2021, 49(8): 1541-1550. DOI: 10.12263/DZXB.20200974.
[10] NIU K, HUANG Y, OUYANG W L, et al. Improving description-based person re-identification by multi-granularity image-text alignments[J]. IEEE Transactions on Image Processing, 2020, 29: 5542-5556. DOI: 10.1109/TIP.2020.2984883.
[11] WANG C J, LUO Z M, LIN Y J, et al. Text-based person searchvia multi-granularity embedding learning[C] //Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21). Montreal: IJCAI, 2021: 1068-1074. DOI: 10.24963/ijcai.2021/148.
[12] WANG Z, FANG Z Y, WANG J, et al. ViTAA: visual-textual attributes alignment in person search by natural language[C] //Computer Vision-ECCV 2020: LNCS Volume 12357. Cham: Springer Nature Switzerland AG, 2020: 402-420. DOI: 10.1007/978-3-030-58610-2_24.
[13] GAO C Y, CAI G Y, JIANG X Y, et al. Contextual non-local alignment over full-scale representation for text-based person search[EB/OL].(2021-01-08)[2024-12-19]. https://arxiv.org/abs/2101.03036. DOI: 10.48550/arXiv.2101.03036.
[14] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C] //Advances in Neural Information Processing Systems 30 (NIPS 2017). Red Hook, NY: Curran Associates Inc., 2017: 6000-6010.
[15] LIU Q, HE X H, TENG Q Z, et al.BDNet: a BERT-based dual-path network for text-to-image cross-modal person re-identification[J]. Pattern Recognition, 2023, 141: 109636. DOI: 10.1016/j.patcog.2023.109636.
[16] KE X, LIU H, XU P R, et al. Text-based person search via cross-modal alignment learning[J]. Pattern Recognition, 2024, 152: 110481. DOI: 10.1016/j.patcog.2024.110481.
[17] WANG Z J, ZHU A C, XUE J Y, et al. CAIBC: capturing all-round information beyond color for text-based person retrieval[C] //MM’22: Proceedings of the 30th ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2022: 5314-5322. DOI: 10.1145/3503161.3548057.
[18] 吕敬钦. 视频行人检测及跟踪的关键技术研究[D]. 上海: 上海交通大学, 2013.
[19] 郑可成. 面向开放场景的行人重识别关键技术研究[D]. 合肥: 中国科学技术大学, 2022. DOI: 10.27517/d.cnki.gzkju.2022.000618.
[20] 刘志刚, 黄朝, 谢东军, 等. 抑制背景干扰的行人重识别方法[J]. 计算机辅助设计与图形学学报, 2022, 34(4): 563-569. DOI: 10.3724/SP.J.1089.2022.18927.
[21] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C] //Proceedings of the 38th International Conference on Machine Learning: PMLR 139. Cambridge, MA: JMLR, 2021: 8748-8763.
[22] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. (2021-06-03)[2024-12-19]. https://arxiv.org/abs/2010.11929. DOI: 10.48550/arXiv.2010.11929.
[23] OQUAB M, DARCET T, MOUTAKANNI T, et al. DINOv2: learning robust visual features without supervision[EB/OL].(2024-02-02)[2024-12-19]. https://arxiv.org/abs/2304.07193. DOI: 10.48550/arXiv.2304.07193.
[24] 杨静, 张灿龙, 李志欣, 等. 集成空间注意力和姿态估计的遮挡行人再辨识[J]. 计算机研究与发展, 2022, 59(7): 1522-1532. DOI: 10.7544/issn1000-1239.20200949.
[25] LI F, ZHANG H, SUN P Z, et al. Semantic-SAM: segment and recognize anything at any granularity[EB/OL].(2023-07-10)[2024-12-19]. https://arxiv.org/abs/2307.04767. DOI: 10.48550/arXiv.2307.04767.
[26] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C] //Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg, PA: Association for Computational Linguistics, 2019: 4171-4186. DOI: 10.18653/v1/N19-1423.
[27] DING Z F, DING C X, SHAO Z Y, et al. Semantically self-aligned network for text-to-image part-aware person re-identification[EB/OL].(2021-08-09)[2024-12-19]. https://arxiv.org/abs/2107.12666. DOI: 10.48550/arXiv.2107.12666.
[28] ZHU A C, WANG Z J, LI Y F, et al. DSSL: deep surroundings-person separation learning for text-based person retrieval[C] //MM’21: Proceedings of the 29th ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2021: 209-217. DOI: 10.1145/3474085.3475369.
[29] CHEN C Q, YE M, JIANG D. Towards modality-agnostic person re-identification with descriptive query[C] //2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2023: 15128-15137. DOI: 10.1109/CVPR52729.2023.01452.
[30] WANG Z J, ZHU A C, XUE J Y, et al. SUM: serialized updating and matching for text-based person retrieval[J]. Knowledge-Based Systems, 2022, 248: 108891. DOI: 10.1016/j.knosys.2022.108891.
[31] SHAO Z Y, ZHANG X Y, DING C X, et al. Unified pre-training with pseudo texts for text-to-image person re-identification[C] //2023 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2023: 11140-11150. DOI: 10.1109/ICCV51070.2023.01026.
[32] YOO J, AHN N, SOHN K A. Rethinking data augmentation for image super-resolution: a comprehensive analysis and a new strategy[C] //2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2020: 8372-8381. DOI: 10.1109/CVPR42600.2020.00840.
[33] WEI Y X, GU S H, LI Y W, et al. Unsupervised real-world image super resolution via domain-distance aware training [C] //2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2021: 13380-13389. DOI: 10.1109/CVPR46437.2021.01318.
[34] CHEN Z, ZHANG Y L, GU JJ, et al. Dual aggregation transformer for image super-resolution[C] //2023 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2023: 12278-12287. DOI: 10.1109/ICCV51070.2023.01131.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 10

[1]	LIU Xiaojuan, LIN Lu, HU Yucong, PAN Lei. Research on the Influence of Land Use Types Surrounding Stations on Subway Passenger Satisfaction[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 1 -12 .
[2]	HAN Huabin, GAO Bingpeng, CAI Xin, SUN Kai. Fault Diagnosis of Wind Turbine Blade Icing Based on HO-CNN-BiLSTM-Transformer Model[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 13 -28 .
[3]	CHEN Jianguo, LIANG Enhua, SONG Xuewei, QIN Zhangrong. Lattice Boltzmann Simulation for the Aqueous Humour Dynamics of the Human Eye Based on 3D Reconstruction of OCT Images[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 29 -41 .
[4]	LI Hao, HE Bing. Droplet Rebound Behavior on Grooves Surface[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 42 -53 .
[5]	LING Fu, ZHANG Yonggang, WEN Binghai. Study on Curve Boundary Algorithm of Multiphase Lattice Boltzmann Method Based on Interpolation[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 54 -68 .
[6]	WEI Zishu, CHEN Zhigang, WANG Yanxue, Hasitieer Madetihan. Lightweight Bearing Defect Detection Algorithm Based on SBSI-YOLO11[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 80 -91 .
[7]	YI Jianbing, HU Yayi, CAO Feng, LI Jun, PENG Xin, CHEN Xin. Design of Lightweight Pulmonary Nodules Detection Network on CT Images with Dynamic Channel Pruning[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 92 -106 .
[8]	LU Mengxiao,ZHANG Yangchun,ZHANG Xiaofeng. Controlling Value Estimation Biasin Successor Features by Distributional Reinforcement Learning[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 107 -119 .
[9]	JIANG Yunlu, LU Huijie, HUANG Xiaowen. Application Research of Penalized Weighted Composite Quantile Regression Method in Fixed Effects Panel Data[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 120 -127 .
[10]	DENG Jinna, LIU Qiumei, CHEN Yiming, YANG Aimin. Numerical Simulation and Stability Analysis of Two Kinds of Viscoelastic Moving Plates[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 128 -139 .

Multi-resolution Feature Grounding for Cross-Modal Person Retrieval

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 3

Metrics

Comments

Recommended 10

[1]	LUO Zengli, ZHANG Canlong, LI Zhixin, WANG Zhiwen, WEI Chunrong. Cross-modal Semantic Collaborative Learning for Text-based Person Re-identification [J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(5): 145-157.
[2]	WANG Xuyang, WANG Changrui, ZHANG Jinfeng, XING Mengyi. Multimodal Sentiment Analysis Based on Cross-Modal Cross-Attention Network [J]. Journal of Guangxi Normal University(Natural Science Edition), 2024, 42(2): 84-93.
[3]	DU Jinfeng, WANG Hairong, LIANG Huan, WANG Dong. Progress of Cross-modal Retrieval Methods Based on Representation Learning [J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(3): 1-12.