|
广西师范大学学报(自然科学版) ›› 2022, Vol. 40 ›› Issue (5): 418-432.doi: 10.16088/j.issn.1001-6600.2022013101
李志欣*, 苏强
LI Zhixin*, SU Qiang
摘要: 为给定图像自动生成符合人类感知的描述语句是人工智能的重要任务之一。大多数现有的基于注意力的方法均探究语句中单词和图像中区域的映射关系,而这种难以预测的匹配方式有时会造成2种模态间不协调的对应,从而降低描述语句的生成质量。针对此问题,本文提出一种文本相关的单词注意力来提高视觉注意力的正确性。这种特殊的单词注意力在模型序列地生成描述语句过程中强调不同单词的重要性,并充分利用训练数据中的内部标注知识来帮助计算视觉注意力。此外,为了揭示图像中不能被机器直接表达出来的隐含信息,将从外部知识图谱中抽取出来的知识注入到编码器—解码器架构中,以生成更新颖自然的图像描述。在MSCOCO和Flickr30k图像描述基准数据集上的实验表明,本方法能够获得良好的性能,并优于许多现有的先进方法。
中图分类号:
[1]李志欣, 魏海洋, 张灿龙, 等. 图像描述生成研究进展[J]. 计算机研究与发展, 2021, 58(9): 1951-1974. DOI:10.7544/issn1000-1239.2021.20200281. [2]石义乐, 杨文忠, 杜慧祥, 等. 基于深度学习的图像描述综述[J]. 电子学报, 2021, 49(10): 2048-2060. DOI:10.12263/DZXB.20200669. [3]许昊, 张凯, 田英杰, 等. 深度神经网络图像描述综述[J]. 计算机工程与应用, 2021, 57(9): 9-22. DOI:10.3778/j.issn.1002-8331.2012-0539. [4]李志欣, 魏海洋, 黄飞成, 等. 结合视觉特征和场景语义的图像描述生成[J]. 计算机学报, 2020, 43(9): 1624-1640. DOI:10.11897/SP.J.1016.2020.01624. [5]HOSSAIN Z, SOHEL F, SHIRATUDDIN M F, et al. A comprehensive survey of deep learning for image captioning[J]. ACM Computing Surveys, 2019, 51(6): 118. DOI:10.1145/3295748. [6]VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ: IEEE, 2015: 3156-3164. DOI:10.1109/CVPR.2015.7298935. [7]JIA X, GAVVES E, FERNANDO B, et al. Guiding the long-short term memory model for image caption generation[C]// 2015 IEEE International Conference on Computer Vision (ICCV). Piscataway, NJ: IEEE, 2015: 2407-2415. DOI:10.1109/ICCV.2015.277. [8]YANG Z L, YUAN Y, WU Y X, et al. Encode, review, and decode: reviewer module for caption generation[EB/OL]. (2016-06-07)[2022-01-31]. https://arxiv.org/abs/1605.07912v3. DOI:10.48550/arXiv.1605.07912. [9]MATHEWS A, XIE L X, HE X M. SentiCap: generating image descriptions with sentiments[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2016, 30(1): 3574-3580. DOI:10.1609/aaai.v30i1.10475. [10]RAMANISHKA V, DAS A, ZHANG J M, et al. Top-down visual saliency guided by captions[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 3135-3144. DOI:10.1109/CVPR.2017.334. [11]XU K, BA J L, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention[J]. Proceedings of Machine Learning Research. 2015, 37: 2048-2057. [12]YOU Q Z, JIN H L, WANG Z W, et al. Image captioning with semantic attention[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2016: 4651-4659. DOI:10.1109/CVPR.2016.503. [13]LIU C X, MAO J H, SHA F, et al. Attention correctness in neural image captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 4176-4182. DOI:10.1609/aaai.v31i1.11197. [14]LU J S, XIONG C M, PARIKH D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 3242-3250. DOI:10.1109/CVPR.2017.345. [15]韦人予, 蒙祖强. 基于注意力特征自适应校正的图像描述模型[J]. 计算机应用, 2020, 40(增刊1): 45-50. [16]张家硕, 洪宇, 李志峰, 等. 基于双向注意力机制的图像描述生成[J]. 中文信息学报, 2020, 34(9): 53-61. DOI:10.3969/j.issn.1003-0077.2020.09.008. [17]李文惠, 曾上游, 王金金. 基于改进注意力机制的图像描述生成算法[J]. 计算机应用, 2021, 41(5): 1262-1267. DOI:10.11772/j.issn.1001-9081.2020071078. [18]LI L H, TANG S, DENG L X, et al. Image caption with global-local attention[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 4133-4139. DOI:10.1609/aaai.v31i1.11236. [19]ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2018: 6077-6086. DOI:10.1109/CVPR.2018.00636. [20]REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]// Advances in Neural Information Processing Systems 28 (NIPS 2015). Red Hook, NY: Curran Associates, Inc., 2015: 91-99. [21]盛豪, 易尧华, 汤梓伟. 融合图像场景与目标显著性特征的图像描述生成方法[J]. 计算机应用研究, 2021, 38(12): 3776-3780. DOI:10.19734/j.issn.1001-3695.2021.02.0124. [22]HENDRICKS L A, VENUGOPALAN S, ROHRBACH M, et al. Deep compositional captioning: describing novel object categories without paired training data[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2016: 1-10. DOI:10.1109/CVPR.2016.8. [23]YAO T, PAN Y W, LI Y H, et al. Incorporating copying mechanism in image captioning for learning novel objects[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 5263-5271. DOI:10.1109/CVPR.2017.559. [24]LI Y H, YAO T, PAN Y W, et al. Pointing novel objects in image captioning[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2019: 12489-12498. DOI:10.1109/CVPR.2019.01278. [25]ZHOU Y M, SUN Y W, HONAVAR V. Improving image captioning by leveraging knowledge graphs[C]// 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). Los Alamitos, CA: IEEE Computer Society, 2019: 283-293. DOI:10.1109/WACV.2019.00036. [26]陈开阳, 徐凡, 王明文. 基于知识图谱和图像描述的虚假新闻检测研究[J]. 江西师范大学学报(自然科学版), 2021, 45(4): 398-402. DOI:10.16357/j.cnki.issn1000-5862.2021.04.12. [27]RANZATO M, CHOPRA S, AULI M, et al. Sequence level training with recurrent neural networks[EB/OL]. (2016-05-06)[2022-01-31]. https://arxiv.org/abs/1511.06732v7. DOI:10.48550/arXiv.1511.06732. [28]REN Z, WANG X Y, ZHANG N, et al. Deep reinforcement learning-based image captioning with embedding reward[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 1151-1159. DOI:10.1109/CVPR.2017.128. [29]RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 1179-1195. DOI:10.1109/CVPR.2017.131. [30]QIN Y, DU J J, ZHANG Y H, et al. Look back and predict forward in image captioning[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2019: 8359-8367. DOI:10.1109/CVPR.2019.00856. [31]PARK C C, KIM B C, KIM G H. Attend to you: personalized image captioning with context sequence memory networks[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 6432-6440. DOI:10.1109/CVPR.2017.681. [32]KIM B, LEE Y H, JUNG H, et al. Distinctive-attribute extraction for image captioning[C]// Computer Vision-ECCV 2018 Workshops: LNCS Volume 11132. Cham: Springer, 2018: 133-144. DOI:10.1007/978-3-030-11018-5_12. [33]BAHDANAU D, CHO K H, BENGIOY. Neural machine translation by jointly learning to align and translate[EB/OL]. (2016-05-19)[2022-01-31]. https://arxiv.org/abs/1409.0473. DOI:10.48550/arXiv.1409.0473. [34]SPEER R, CHIN J, HAVASI C. Conceptnet 5.5: an open multilingual graph of general knowledge[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 4444-4451. DOI:10.1609/aaai.v31i1.11164. [35]BENGIO S, VINYALS O, JAITLY N, et al. Scheduled sampling for sequence prediction with recurrent neural networks[C]// Advances in Neural Information Processing Systems 28 (NIPS 2015). Red Hook, NY: Curran Associates, Inc., 2015: 1171-1179. [36]LIU S Q, ZHU Z H, YE N, et al. Optimization of image description metrics using policy gradient methods[EB/OL]. (2016-12-01)[2022-01-31]. https://arxiv.org/abs/1612.00370v1. DOI:10.48550/arXiv.1612.00370. [37]LIU S Q, ZHU Z H, YE N, et al. Improved image captioning via policy gradient optimization of SPIDEr[C]// 2017 IEEE International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2017: 873-881. DOI:10.1109/ICCV.2017.100. [38]JOHNSON J, KARPATHY A, LI F F. Densecap: fully convolutional localization networks for dense captioning[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2016: 4565-4574. DOI:10.1109/CVPR.2016.494. [39]PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2002: 311-318. DOI:10.3115/1073083.1073135. [40]BANERJEE S, LAVIE A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments[C]// Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg, PA: Association for Computational Linguistics, 2005: 65-72. [41]VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway NJ: IEEE, 2015: 4566-4575. DOI:10.1109/CVPR.2015.7299087. [42]LIN C Y. ROUGE: a package for automatic evaluation of summaries[C]// Text Summarization Branches Out. Stroudsburg, PA: Association for Computational Linguistics, 2004: 74-81. [43]PEDERSOLI M, LUCAS T, SCHMID C, et al. Areas of attention for image captioning[C]// 2017 IEEE International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2017: 1251-1259. DOI:10.1109/ICCV.2017.140. [44]CORNIA M, BARALDI L, SERRA G, et al. Paying more attention to saliency: image captioning with saliency and context attention[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2018, 14(2): 48. DOI:10.1145/3177745. [45]CHEN L, ZHANG H W, XIAO J, et al. SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 6298-6306. DOI:10.1109/CVPR.2017.667. [46]FU K, JIN J Q, CUI R P, et al. Aligning where to see and what to tell: image caption with region-based attention and scene-specific contexts[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2321-2334. DOI:10.1109/TPAMI.2016.2642953. |
[1] | 陈高建, 王菁, 栗倩文, 袁云静, 曹嘉琛. 数据驱动的自动化机器学习流程生成方法[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 185-193. |
[2] | 肖飞, 丁旭升, 王维红. 基于文献计量学分析的好氧颗粒污泥研究进展[J]. 广西师范大学学报(自然科学版), 2022, 40(2): 1-14. |
[3] | 唐峯竹, 唐欣, 李春海, 李晓欢. 基于深度强化学习的多无人机任务动态分配[J]. 广西师范大学学报(自然科学版), 2021, 39(6): 63-71. |
|
版权所有 © 广西师范大学学报(自然科学版)编辑部 地址:广西桂林市三里店育才路15号 邮编:541004 电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn 本系统由北京玛格泰克科技发展有限公司设计开发 |