广西师范大学学报(自然科学版) ›› 2025, Vol. 43 ›› Issue (4): 69-82.doi: 10.16088/j.issn.1001-6600.2024102502

• 智能信息处理 • 上一篇    下一篇

结合互注意力空间自适应和特征对集成判别的细粒度图像分类

李志欣1,2*, 匡文兰1,2   

  1. 1.教育区块链与智能技术教育部重点实验室(广西师范大学), 广西 桂林 541004;
    2.广西多源信息挖掘与安全重点实验室(广西师范大学), 广西 桂林 541004
  • 收稿日期:2024-10-25 修回日期:2025-01-15 出版日期:2025-07-05 发布日期:2025-07-14
  • 通讯作者: 李志欣(1971—),男,广西桂林人,广西师范大学教授,博士。E-mail: lizx@gxnu.edu.cn
  • 基金资助:
    国家自然科学基金(62276073);广西“八桂学者”工程专项基金

Fine-grained Image Classification Combining Adaptive Spatial Mutual Attention and Feature Pair Integration Discrimination

LI Zhixin1,2*, KUANG Wenlan1,2   

  1. 1. Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education (Guangxi Normal University), Guilin Guangxi 541004, China;
    2. Guangxi Key Lab of Multi-source Information Mining and Security (Guangxi Normal University), Guilin Guangxi 541004, China
  • Received:2024-10-25 Revised:2025-01-15 Online:2025-07-05 Published:2025-07-14

摘要: 细粒度图像具有类间差异小和类内区别大的特点,许多研究利用Vision Transformer挖掘关键区域特征来提升细粒度图像分类的精度,但其仍存在2个主要问题:首先,网络挖掘关键性分类线索时背景区域也考虑在内,给模型带来额外噪声干扰;其次,输入的图像局部嵌入特征之间欠缺空间联系,模型缺乏物体结构认知能力,导致提取的类别特征不准确。针对此问题,本文提出互注意力空间自适应和特征对集成判别2个模块。先通过互注意力空间自适应模块学习不同嵌入层的互注意力增强权重,用于选择更佳的判别性区域,通过图卷积网络自适应学习不同区域的邻接关系;再利用特征对集成判别模块考虑图像对之间的线索交互,减少细粒度图像间的混淆,在令牌特征增强策略下得出最终预测结果。本文方法在CUB-200-2011、Stanford Dogs和NABirds等3个基准数据集上测试准确率分别达到92.5%、93.3%和91.8%,优于现有许多先进方法。

关键词: 细粒度图像分类, 互注意力空间自适应, 特征对集成判别, 图卷积网络, 令牌特征增强

Abstract: Due to the characteristics of small inter-class differences and large intra-class distinctions in fine-grained image, many studies have utilized Vision Transformer to mine critical region features to improve the accuracy of fine-grained image classification. However, there still exists two major problems: firstly, background regions are also considered when the network mines critical classification cues, bringing additional noise interference to the model. Secondly, there is a lack of spatial connection between the local embeddings feature of the input images, and the model lacks the ability of object structure cognition, leading to inaccurate extracted category features. To address these problem, this paper proposes two modules: adaptive spatial mutual attention module and feature pair integrated discrimination module, which first learns the mutual attention weights of different embedding layers to select better discriminative regions through mutual attention spatial adaptive module, and adaptively learns the neighbor relationship of different regions through graph convolutional network. Then the feature pair integration discrimination module is utilized to treat the cue interactions between image pairs and reduce the confusion between fine-grained images. The final prediction results are derived under the token feature enhancement strategy. The proposed method achieves an accuracy of 92.5%, 93.3% and 91.8% on three benchmark datasets, namely, CUB-200-2011, Stanford Dogs and NABirds, which are better than many other existing state-of-the-art methods.

Key words: fine-grained image classification, adaptive spatial mutual attention, feature pair integration discrimination, graph convolutional network, token feature enhancement

中图分类号:  TP391.41

[1] 李志欣, 张佳, 吴璟莉, 等. 基于半监督对抗学习的图像语义分割[J]. 中国图象图形学报, 2022, 27(7): 2157-2170. DOI: 10.11834/jig.200600.
[2] 曹家乐, 李亚利, 孙汉卿, 等. 基于深度学习的视觉目标检测技术综述[J]. 中国图象图形学报, 2022, 27(6): 1697-1722. DOI: 10.11834/jig.220069.
[3] 刘颖, 庞羽良, 张伟东, 等. 基于主动学习的图像分类技术: 现状与未来[J]. 电子学报, 2023, 51(10): 2960-2984. DOI: 10.12263/DZXB.20230397.
[4] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. (2021-06-03)[2024-10-25]. https://arxiv.org/abs/2010.11929. DOI: 10.48550/arXiv.2010.11929.
[5] 李志欣, 侯传文, 谢秀敏. 融合多重实例关系的无监督跨模态哈希检索[J]. 软件学报, 2023, 34(11): 4973-4988. DOI: 10.13328/j.cnki.jos.006742.
[6] 卓亚琦, 魏家辉, 李志欣. 基于双注意模型的图像描述生成方法研究[J]. 电子学报, 2022, 50(5): 1123-1130. DOI: 10.12263/DZXB.20210696.
[7] 李志欣, 苏强. 基于知识辅助的图像描述生成[J]. 广西师范大学学报(自然科学版), 2022,40(5): 418-432. DOI: 10.16088/j.issn.1001-6600.2022013101.
[8] 项剑文, 陈泯融, 杨百冰. 结合Swin及多尺度特征融合的细粒度图像分类[J]. 计算机工程与应用, 2023, 59(20): 147-157. DOI: 10.3778/j.issn.1002-8331.2211-0456.
[9] HU Y Q, JIN X, ZHANG Y, et al. RAMS-trans: recurrent attention multi-scale transformer for fine-grained image recognition[C]// Proceedings of the 29th ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2021: 4239-4248. DOI: 10.1145/3474085.3475561.
[10] XU Q, WANG J H, JIANG B, et al. Fine-grained visual classification via internal ensemble learning transformer[J]. IEEE Transactions on Multimedia, 2023, 25: 9015-9028. DOI: 10.1109/TMM.2023.3244340.
[11] KE X, CAI Y H, CHEN B T, et al. Granularity-aware distillation and structure modeling region proposal network for fine-grained image classification[J]. Pattern Recognition, 2023, 137: 109305. DOI: 10.1016/j.patcog.2023.109305.
[12] ZHENG S J, WANG G C, YUAN Y J, et al. Fine-grained image classification based on TinyVit object location and graph convolution network[J]. Journal of Visual Communication and Image Representation, 2024, 100: 104120. DOI: 10.1016/j.jvcir.2024.104120.
[13] XIE J J, ZHONG Y J, ZHANG J G, et al. A weakly supervised spatial group attention network for fine-grained visual recognition[J]. Applied Intelligence, 2023, 53(20): 23301-23315. DOI: 10.1007/s10489-023-04627-z.
[14] 贺小箭, 林金福. 融合弱监督目标定位的细粒度小样本学习[J]. 中国图象图形学报, 2022, 27(7): 2226-2239. DOI: 10.11834/jig.200849.
[15] 黄程, 曾志高, 朱文球, 等. 基于弱监督多注意融合网络的细粒度图像识别[J]. 现代信息科技, 2022, 6(21): 78-82, 87. DOI: 10.19850/j.cnki.2096-4706.2022.21.019.
[16] GAO Y, HAN X T, WANG X, et al. Channel interaction networks for fine-grained image categorization[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 10818-10825. DOI: 10.1609/aaai.v34i07.6712.
[17] ZHU Q X, KUANG W L, LI Z X. A collaborative gated attention network for fine-grained visual classification[J]. Displays, 2023, 79: 102468. DOI: 10.1016/j.displa.2023.102468.
[18] WANG Q, WANG J J, DENG H Y, et al. AA-trans: core attention aggregating transformer with information entropy selector for fine-grained visual classification[J]. Pattern Recognition, 2023, 140: 109547. DOI: 10.1016/j.patcog.2023.109547.
[19] XU Y, WU S S, WANG B Q, et al. Two-stage fine-grained image classification model based on multi-granularity feature fusion[J]. Pattern Recognition, 2024, 146: 110042. DOI: 10.1016/j.patcog.2023.110042.
[20] 王梓祺, 李阳, 张睿, 等. 小样本SAR图像分类方法综述[J]. 中国图象图形学报, 2024, 29(7): 1902-1920. DOI: 10.11834/jig.230359.
[21] 杨传广, 陈路明, 赵二虎, 等. 基于图表征知识蒸馏的图像分类方法[J]. 电子学报, 2024, 52(10): 3435-3447. DOI: 10.12263/DZXB.20230976.
[22] 宋燕, 王勇. 多阶段注意力胶囊网络的图像分类[J]. 自动化学报, 2024, 50(9): 1804-1817. DOI: 10.16383/j.aas.c210012.
[23] WU H P, XIAO B, CODELLA N, et al. CvT: introducing convolutions to vision transformers[C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2021: 22-31. DOI: 10.1109/ICCV48922.2021.00009.
[24] SHAO R, BI X J, CHEN Z. Hybrid ViT-CNN network for fine-grained image classification[J]. IEEE Signal Processing Letters, 2024, 31: 1109-1113. DOI: 10.1109/LSP.2024.3386112.
[25] HE J, CHEN J N, LIU S, et al. TransFG: a transformer architecture for fine-grained recognition[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(1): 852-860. DOI: 10.1609/aaai.v36i1.19967.
[26] DU R Y, XIE J Y, MA Z Y, et al. Progressive learning of category-consistent multi-granularity features for fine-grained visual classification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022,44(12): 9521-9535. DOI: 10.1109/TPAMI.2021.3126668.
[27] CHEN T H, LI Y Y, QIAO Q H. Fine-grained bird image classification based on counterfactual method of vision transformer model[J]. The Journal of Supercomputing, 2024, 80(5): 6221-6239. DOI: 10.1007/s11227-023-05701-6.
[28] JI R Y, LI J Y, ZHANG L B, et al. Dual transformer with multi-grained assembly for fine-grained visual classification[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(9): 5009-5021. DOI: 10.1109/TCSVT.2023.3248791.
[29] CHEN H Z, ZHANG H M, LIU C, et al. FET-FGVC: feature-enhanced transformer for fine-grained visual classification[J]. Pattern Recognition, 2024, 149: 110265. DOI: 10.1016/j.patcog.2024.110265.
[30] SERRANO S, SMITH N A. Is attention interpretable?[EB/OL]. (2019-06-09)[2024-10-25]. https://arxiv.org/abs/1906.03731. DOI: 10.48550/arXiv.1906.03731.
[31] ABNAR S, ZUIDEMA W. Quantifying attention flow in transformers[EB/OL]. (2020-05-31)[2024-10-25]. https://arxiv.org/abs/2005.00928. DOI: 10.48550/arXiv.2005.00928.
[32] ZHOU M H, BAI Y L, ZHANG W, et al. Look-into-object: self-supervised structure modeling for object recognition[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2020: 11771-11780. DOI: 10.1109/CVPR42600.2020.01179.
[33] WANG J G, LI J, YAU W Y, et al. Boosting dense SIFT descriptors and shape contexts of face images for gender recognition[C]// 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops. Los Alamitos, CA: IEEE Computer Society, 2010: 96-102. DOI: 10.1109/CVPRW.2010.5543238.
[34] ZHUANG P Q, WANG Y L, QIAO Y. Learning attentive pairwise interaction for fine-grained classification[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 13130-13137. DOI: 10.1609/aaai.v34i07.7016.
[35] WAH C, BRANSON S, WELINDER P, et al. Caltech-UCSD birds-200-2011 (CUB-200-2011): CNS-TR-2011-001[DS/OL]. (2011-07-30)[2024-10-25]. https://www.vision.caltech.edu/datasets/cub_200_2011/.
[36] KHOSLA A, JAYADEVAPRAKASH N, YAO B P, et al. Stanford dogs dataset[DS/OL]. (2012-11-21)[2024-10-25]. http://vision.stanford.edu/aditya86/ImageNetDogs/main.html.
[37] VAN HORN G, BRANSON S, FARRELL R, et al. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2015: 595-604. DOI: 10.1109/CVPR.2015.7298658.
[38] LUO W, YANG X T, MO X J, et al. Cross-X learning for fine-grained visual categorization[C]// IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2019: 8241-8250. DOI: 10.1109/ICCV.2019.00833.
[39] LIANG Y Z, ZHU L C, WANG X H, et al. Penalizing the hard example but not too much: a strong baseline for fine-grained visual classification[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(5): 7048-7059. DOI: 10.1109/TNNLS.2022.3213563.
[40] HUANG S L, WANG X C, TAO D C. Stochastic partial swap: enhanced model generalization and interpretability for fine-grained recognition[C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2021: 600-609. DOI: 10.1109/ICCV48922.2021.00066.
[41] ZHANG L B, HUANG S L, LIU W. Learning sequentially diversified representations for fine-grained categorization[J]. Pattern Recognition, 2022, 121: 108219. DOI: 10.1016/j.patcog.2021.108219.
[42] ZHU Q X, LI Z X, KUANG W L, et al. A multichannel location-aware interaction network for visual classification[J]. Applied Intelligence, 2023, 53(20): 23049-23066. DOI: 10.1007/s10489-023-04734-x.
[43] PU Y F, HAN Y Z, WANG Y L, et al. Fine-grained recognition with learnable semantic data augmentation[J]. IEEE Transactions on Image Processing, 2024, 33: 3130-3144. DOI: 10.1109/TIP.2024.3364500.
[44] HU X B, ZHU S N, PENG T L. Hierarchical attention vision transformer for fine-grained visual classification[J]. Journal of Visual Communication and Image Representation, 2023, 91: 103755. DOI: 10.1016/j.jvcir.2023.103755.
[45] LIU X D, WANG L L, HAN X G. Transformer with peak suppression and knowledge guidance for fine-grained image recognition[J]. Neurocomputing, 2022,492: 137-149. DOI: 10.1016/j.neucom.2022.04.037.
[46] YE S, YU S J, WANG Y, et al.R2-trans: fine-grained visual categorization with redundancy reduction[J]. Image and Vision Computing, 2024, 143: 104923. DOI: 10.1016/j.imavis.2024.104923.
[47] ZHANG Z C, CHEN Z D, WANG Y X, et al. A vision transformer for fine-grained classification by reducing noise and enhancing discriminative information[J]. Pattern Recognition, 2024, 145: 109979. DOI: 10.1016/j.patcog.2023.109979.
[1] 禚明, 刘乐源, 周世杰, 杨鹏, 万思敏. 一种空间信息网络抗毁分析的新方法[J]. 广西师范大学学报(自然科学版), 2021, 39(2): 21-31.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 何安康, 陈艳平, 扈应, 黄瑞章, 秦永彬. 融合边界交互信息的命名实体识别方法[J]. 广西师范大学学报(自然科学版), 2025, 43(3): 1 -11 .
[2] 卢展跃, 陈艳平, 杨卫哲, 黄瑞章, 秦永彬. 基于掩码注意力与多特征卷积网络的关系抽取方法[J]. 广西师范大学学报(自然科学版), 2025, 43(3): 12 -22 .
[3] 齐丹丹, 王长征, 郭少茹, 闫智超, 胡志伟, 苏雪峰, 马博翔, 李时钊, 李茹. 基于主题多视图表示的零样本实体检索方法[J]. 广西师范大学学报(自然科学版), 2025, 43(3): 23 -34 .
[4] 黄川洋, 程灿儿, 李松威, 陈鸿东, 张秋楠, 张钊, 邵来鹏, 唐剑, 王咏梅, 郭奎奎, 陆航林, 胡君辉. 带涂覆层的长周期光纤光栅温度传感特性研究[J]. 广西师范大学学报(自然科学版), 2025, 43(3): 35 -42 .
[5] 田晟, 熊辰崟, 龙安洋. 基于改进PointNet++的城市道路点云分类方法[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 1 -14 .
[6] 黎宗孝, 张健, 罗鑫悦, 赵嶷飞, 卢飞. 基于K-means和Adam-LSTM的机场进场航迹预测研究[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 15 -23 .
[7] 宋铭楷, 朱成杰. 基于H-WOA-GWO和区段修正策略的配电网故障定位研究[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 24 -37 .
[8] 陈禹, 陈磊, 张怡, 张志瑞. 基于QMD-LDBO-BiGRU的风速预测模型[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 38 -57 .
[9] 韩烁, 江林峰, 杨建斌. 基于注意力机制PINNs方法求解圣维南方程[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 58 -68 .
[10] 石天怡, 南新元, 郭翔羽, 赵濮, 蔡鑫. 基于改进ConvNeXt的苹果叶片病害分类算法[J]. 广西师范大学学报(自然科学版), 2025, 43(4): 83 -96 .
版权所有 © 广西师范大学学报(自然科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发