广西师范大学学报(自然科学版) ›› 2025, Vol. 43 ›› Issue (6): 69-79.doi: 10.16088/j.issn.1001-6600.2024121902

• 智能信息处理 • 上一篇    下一篇

基于多分辨率特征定位的跨模态行人检索方法

解盛1, 马海菲1, 张灿龙1,2*, 王智文3, 韦春荣4   

  1. 1.教育区块链与智能技术教育部重点实验室(广西师范大学), 广西 桂林 541004;
    2.广西多源信息挖掘与安全重点实验室(广西师范大学), 广西 桂林 541004;
    3.广西科技大学 电子工程学院, 广西 柳州 545006;
    4.广西师范大学 职业技术师范学院, 广西 桂林 541004
  • 收稿日期:2024-12-19 修回日期:2025-04-18 发布日期:2025-11-19
  • 通讯作者: 张灿龙(1975—), 男, 湖南双峰人, 广西师范大学教授, 博导。E-mail: zcltyp@163.com
  • 基金资助:
    国家自然科学基金(62266009, 62276073, 62466004); 广西自然科学基金(2018GXNSFDA281009); 广西一流本科课程建设项目(202103)

Multi-resolution Feature Grounding for Cross-Modal Person Retrieval

XIE Sheng1, MA Haifei1, ZHANG Canlong1,2*, WANG Zhiwen3, WEI Chunrong4   

  1. 1. Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education (Guangxi Normal University), Guilin Guangxi 541004, China;
    2. Guangxi Key Lab of Multi-source Information Mining & Security (Guangxi Normal University), Guilin Guangxi 541004, China;
    3. School of Electronic Engineering, Guangxi University of Science and Technology, Liuzhou Guangxi 545006, China;
    4. Teachers College for Vocational and Technical Education, Guangxi Normal University, Guilin Guangxi 541004, China
  • Received:2024-12-19 Revised:2025-04-18 Published:2025-11-19

摘要: 随着智慧城市建设的发展,以文搜图的跨模态行人检索作为创新检索范式,可克服传统以图搜图方法的局限性。然而,监控场景中的远距离成像和复杂背景干扰导致行人特征存在尺度不一致和特征污染问题,制约检索性能的提升。为此,本文提出一种基于多分辨率特征定位的跨模态行人检索方法,通过融合多尺度图像特征表示和语义分割边界信息,有效解决远距离细节损失和背景干扰问题。该方法包含2个关键创新:1)设计多尺度分辨率输入方案,同时处理低分辨率全局特征和高分辨率局部特征;2)提出基于语义分割的边界定位策略,精确分割行人轮廓以抑制背景干扰。在CUHK-PEDES、ICFG-PEDES和RSTPReid数据集上,所提方法的Rank-1准确率分别达70.58%、60.88%和55.24%。相较于现有方法,该方法在跨模态文本-图像行人检索任务中性能优势显著。

关键词: 多分辨率, 边界定位, 跨模态, 行人检索, 行人重识别

Abstract: Text-to-image person retrieval, which can overcome limitations of traditional image-based methods, emerges as an innovative paradigm in smart city development. However, long-distance imaging and complex backgrounds in surveillance scenarios lead to scale inconsistency and feature contamination, hindering retrieval performance. This paper proposes a cross-modal person retrieval approach based on multi-resolution feature grounding, which effectively addresses detail loss and background interference through integrating multi-scale image feature representations with semantic segmentation boundary information. Two key innovations are introduced: 1) a multi-scale resolution input scheme that processes both low-resolution global features and high-resolution local features, 2) a semantic segmentation-based boundary grounding strategy that precisely segments pedestrian contours to suppress background interference. The Rank-1 accuracies on the CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets are 70.58%, 60.88%, and 55.24%, respectively. Compared with recent methods, the proposed method demonstrates a relatively significant performance advantage in the cross-modal text-to-image person retrieval task.

Key words: multi-resolution, boundary grounding, cross-modal, person retrieval, person re-identification

中图分类号:  TP391.41

[1] 李小宝. 无监督行人重识别方法研究[D]. 北京: 北京交通大学, 2023. DOI: 10.26944/d.cnki.gbfju.2023.003631.
[2] 罗浩, 姜伟, 范星, 等. 基于深度学习的行人重识别研究进展[J]. 自动化学报, 2019, 45(11): 2032-2049. DOI: 10.16383/j.aas.c180154.
[3] LI S, XIAO T, LI H S, et al. Person search with natural language description[C] //2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 5187-5196. DOI: 10.1109/CVPR.2017.551.
[4] 陈浩彬. 基于深度学习的跨域跨模态复杂场景下的行人重识别[D]. 深圳: 中国科学院大学(中国科学院深圳先进技术研究院), 2022. DOI: 10.27822/d.cnki.gszxj.2022.000164.
[5] ZHANG Y, LU H C. Deep cross-modal projection learning for image-text matching[C] //Computer Vision-ECCV 2018: LNCS Volume 11205. Cham: Springer Nature Switzerland AG, 2018: 707-723. DOI: 10.1007/978-3-030-01246-5_42.
[6] SARAFIANOS N, XU X, KAKADIARIS I. Adversarial representation learning for text-to-image matching[C] //2019 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2019: 5813-5823. DOI: 10.1109/iccv.2019.00591.
[7] CHENG Y, WANG H Y, LIU X K. Pose-guided neural network with hybrid representation for person re-identification[C] //2019 IEEE International Conference on Signal, Information and Data Processing (ICSIDP). Piscataway,NJ: IEEE, 2019: 1-6. DOI: 10.1109/icsidp47821.2019.9173449.
[8] 罗浩. 基于深度学习的行人重识别算法研究: 从无遮挡到遮挡[D]. 杭州: 浙江大学, 2020. DOI: 10.27461/d.cnki.gzjdx.2020.001378.
[9] 匡澄, 陈莹. 基于多粒度特征融合网络的行人重识别[J]. 电子学报, 2021, 49(8): 1541-1550. DOI: 10.12263/DZXB.20200974.
[10] NIU K, HUANG Y, OUYANG W L, et al. Improving description-based person re-identification by multi-granularity image-text alignments[J]. IEEE Transactions on Image Processing, 2020, 29: 5542-5556. DOI: 10.1109/TIP.2020.2984883.
[11] WANG C J, LUO Z M, LIN Y J, et al. Text-based person searchvia multi-granularity embedding learning[C] //Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21). Montreal: IJCAI, 2021: 1068-1074. DOI: 10.24963/ijcai.2021/148.
[12] WANG Z, FANG Z Y, WANG J, et al. ViTAA: visual-textual attributes alignment in person search by natural language[C] //Computer Vision-ECCV 2020: LNCS Volume 12357. Cham: Springer Nature Switzerland AG, 2020: 402-420. DOI: 10.1007/978-3-030-58610-2_24.
[13] GAO C Y, CAI G Y, JIANG X Y, et al. Contextual non-local alignment over full-scale representation for text-based person search[EB/OL].(2021-01-08)[2024-12-19]. https://arxiv.org/abs/2101.03036. DOI: 10.48550/arXiv.2101.03036.
[14] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C] //Advances in Neural Information Processing Systems 30 (NIPS 2017). Red Hook, NY: Curran Associates Inc., 2017: 6000-6010.
[15] LIU Q, HE X H, TENG Q Z, et al.BDNet: a BERT-based dual-path network for text-to-image cross-modal person re-identification[J]. Pattern Recognition, 2023, 141: 109636. DOI: 10.1016/j.patcog.2023.109636.
[16] KE X, LIU H, XU P R, et al. Text-based person search via cross-modal alignment learning[J]. Pattern Recognition, 2024, 152: 110481. DOI: 10.1016/j.patcog.2024.110481.
[17] WANG Z J, ZHU A C, XUE J Y, et al. CAIBC: capturing all-round information beyond color for text-based person retrieval[C] //MM’22: Proceedings of the 30th ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2022: 5314-5322. DOI: 10.1145/3503161.3548057.
[18] 吕敬钦. 视频行人检测及跟踪的关键技术研究[D]. 上海: 上海交通大学, 2013.
[19] 郑可成. 面向开放场景的行人重识别关键技术研究[D]. 合肥: 中国科学技术大学, 2022. DOI: 10.27517/d.cnki.gzkju.2022.000618.
[20] 刘志刚, 黄朝, 谢东军, 等. 抑制背景干扰的行人重识别方法[J]. 计算机辅助设计与图形学学报, 2022, 34(4): 563-569. DOI: 10.3724/SP.J.1089.2022.18927.
[21] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C] //Proceedings of the 38th International Conference on Machine Learning: PMLR 139. Cambridge, MA: JMLR, 2021: 8748-8763.
[22] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. (2021-06-03)[2024-12-19]. https://arxiv.org/abs/2010.11929. DOI: 10.48550/arXiv.2010.11929.
[23] OQUAB M, DARCET T, MOUTAKANNI T, et al. DINOv2: learning robust visual features without supervision[EB/OL].(2024-02-02)[2024-12-19]. https://arxiv.org/abs/2304.07193. DOI: 10.48550/arXiv.2304.07193.
[24] 杨静, 张灿龙, 李志欣, 等. 集成空间注意力和姿态估计的遮挡行人再辨识[J]. 计算机研究与发展, 2022, 59(7): 1522-1532. DOI: 10.7544/issn1000-1239.20200949.
[25] LI F, ZHANG H, SUN P Z, et al. Semantic-SAM: segment and recognize anything at any granularity[EB/OL].(2023-07-10)[2024-12-19]. https://arxiv.org/abs/2307.04767. DOI: 10.48550/arXiv.2307.04767.
[26] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C] //Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg, PA: Association for Computational Linguistics, 2019: 4171-4186. DOI: 10.18653/v1/N19-1423.
[27] DING Z F, DING C X, SHAO Z Y, et al. Semantically self-aligned network for text-to-image part-aware person re-identification[EB/OL].(2021-08-09)[2024-12-19]. https://arxiv.org/abs/2107.12666. DOI: 10.48550/arXiv.2107.12666.
[28] ZHU A C, WANG Z J, LI Y F, et al. DSSL: deep surroundings-person separation learning for text-based person retrieval[C] //MM’21: Proceedings of the 29th ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2021: 209-217. DOI: 10.1145/3474085.3475369.
[29] CHEN C Q, YE M, JIANG D. Towards modality-agnostic person re-identification with descriptive query[C] //2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2023: 15128-15137. DOI: 10.1109/CVPR52729.2023.01452.
[30] WANG Z J, ZHU A C, XUE J Y, et al. SUM: serialized updating and matching for text-based person retrieval[J]. Knowledge-Based Systems, 2022, 248: 108891. DOI: 10.1016/j.knosys.2022.108891.
[31] SHAO Z Y, ZHANG X Y, DING C X, et al. Unified pre-training with pseudo texts for text-to-image person re-identification[C] //2023 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2023: 11140-11150. DOI: 10.1109/ICCV51070.2023.01026.
[32] YOO J, AHN N, SOHN K A. Rethinking data augmentation for image super-resolution: a comprehensive analysis and a new strategy[C] //2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2020: 8372-8381. DOI: 10.1109/CVPR42600.2020.00840.
[33] WEI Y X, GU S H, LI Y W, et al. Unsupervised real-world image super resolution via domain-distance aware training [C] //2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2021: 13380-13389. DOI: 10.1109/CVPR46437.2021.01318.
[34] CHEN Z, ZHANG Y L, GU JJ, et al. Dual aggregation transformer for image super-resolution[C] //2023 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2023: 12278-12287. DOI: 10.1109/ICCV51070.2023.01131.
[1] 罗赠丽, 张灿龙, 李志欣, 王智文, 韦春荣. 基于跨模态语义协作学习的文本行人重识别[J]. 广西师范大学学报(自然科学版), 2025, 43(5): 145-157.
[2] 王旭阳, 王常瑞, 张金峰, 邢梦怡. 基于跨模态交叉注意力网络的多模态情感分析方法[J]. 广西师范大学学报(自然科学版), 2024, 42(2): 84-93.
[3] 杜锦丰, 王海荣, 梁焕, 王栋. 基于表示学习的跨模态检索方法研究进展[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 1-12.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 刘晓娟, 林璐, 胡郁葱, 潘雷. 站点周边用地类型对地铁乘车满意度影响研究[J]. 广西师范大学学报(自然科学版), 2025, 43(6): 1 -12 .
[2] 韩华彬, 高丙朋, 蔡鑫, 孙凯. 基于HO-CNN-BiLSTM-Transformer模型的风机叶片结冰故障诊断[J]. 广西师范大学学报(自然科学版), 2025, 43(6): 13 -28 .
[3] 陈建国, 梁恩华, 宋学伟, 覃章荣. 基于OCT图像三维重建的人眼房水动力学LBM模拟[J]. 广西师范大学学报(自然科学版), 2025, 43(6): 29 -41 .
[4] 李好, 何冰. 凹槽结构表面液滴弹跳行为研究[J]. 广西师范大学学报(自然科学版), 2025, 43(6): 42 -53 .
[5] 凌福, 张永刚, 闻炳海. 基于插值的多相流格子Boltzmann方法曲线边界算法研究[J]. 广西师范大学学报(自然科学版), 2025, 43(6): 54 -68 .
[6] 魏梓书, 陈志刚, 王衍学, 哈斯铁尔·马德提汗. 基于SBSI-YOLO11的轻量化轴承外观缺陷检测算法[J]. 广西师范大学学报(自然科学版), 2025, 43(6): 80 -91 .
[7] 易见兵, 胡雅怡, 曹锋, 李俊, 彭鑫, 陈鑫. 融合动态通道剪枝的轻量级CT图像肺结节检测网络设计[J]. 广西师范大学学报(自然科学版), 2025, 43(6): 92 -106 .
[8] 卢梦筱, 张阳春, 章晓峰. 基于分布式强化学习方法解决后继特征中的低估问题[J]. 广西师范大学学报(自然科学版), 2025, 43(6): 107 -119 .
[9] 姜云卢, 卢辉杰, 黄晓雯. 惩罚加权复合分位数回归方法在固定效应面板数据中的应用研究[J]. 广西师范大学学报(自然科学版), 2025, 43(6): 120 -127 .
[10] 邓金娜, 刘秋梅, 陈一鸣, 杨爱民. 两种黏弹性运动板的数值模拟与稳定性分析[J]. 广西师范大学学报(自然科学版), 2025, 43(6): 128 -139 .
版权所有 © 广西师范大学学报(自然科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发