基于知识辅助的图像描述生成

doi:10.16088/j.issn.1001-6600.2022013101

摘要/Abstract

摘要： 为给定图像自动生成符合人类感知的描述语句是人工智能的重要任务之一。大多数现有的基于注意力的方法均探究语句中单词和图像中区域的映射关系,而这种难以预测的匹配方式有时会造成2种模态间不协调的对应,从而降低描述语句的生成质量。针对此问题,本文提出一种文本相关的单词注意力来提高视觉注意力的正确性。这种特殊的单词注意力在模型序列地生成描述语句过程中强调不同单词的重要性,并充分利用训练数据中的内部标注知识来帮助计算视觉注意力。此外,为了揭示图像中不能被机器直接表达出来的隐含信息,将从外部知识图谱中抽取出来的知识注入到编码器—解码器架构中,以生成更新颖自然的图像描述。在MSCOCO和Flickr30k图像描述基准数据集上的实验表明,本方法能够获得良好的性能,并优于许多现有的先进方法。

关键词: 图像描述生成, 内部知识, 外部知识, 单词注意力, 知识图谱, 强化学习

Abstract: Automatically generating a human-like description for a given image is one of the most important tasks in artificial intelligence. Most of the existing attention-based methods explore the mapping relationships between words in sentence and regions in image. However, the quality of generated captions can be reduced by such unpredictable matching manner which sometimes cause inharmonious alignments. To solve this problem, a new method which uses word attention to improve the correctness of visual attention when generating word-by-word sequential descriptions is proposed. The special word attention emphasizes word importance when focusing on different regions of the input image, and makes full use of the internal annotation knowledge to assist the calculation of visual attention. Furthermore, in order to reveal implied information that cannot be expressed straightforwardly by machines and generate more novel and natural captions, the external knowledge which is extracted from the knowledge graphs is injected to the encoder-decoder framework. Finally, The new method is validated on two available captioning benchmarks i.e. Microsoft COCO dataset and Flickr30k dataset. The experimental results demonstrate that this new approach can achieve a good performance and outperform many of the state-of-the-art approaches.

Key words: image captioning, internal knowledge, external knowledge, word attention, knowledge graph, reinforcement learning

中图分类号:

TP391.41

李志欣, 苏强. 基于知识辅助的图像描述生成[J]. 广西师范大学学报（自然科学版）, 2022, 40(5): 418-432.

LI Zhixin, SU Qiang. Knowledge-aided Image Captioning[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(5): 418-432.

参考文献

[1]李志欣, 魏海洋, 张灿龙, 等. 图像描述生成研究进展[J]. 计算机研究与发展, 2021, 58(9): 1951-1974. DOI:10.7544/issn1000-1239.2021.20200281.
[2]石义乐, 杨文忠, 杜慧祥, 等. 基于深度学习的图像描述综述[J]. 电子学报, 2021, 49(10): 2048-2060. DOI:10.12263/DZXB.20200669.
[3]许昊, 张凯, 田英杰, 等. 深度神经网络图像描述综述[J]. 计算机工程与应用, 2021, 57(9): 9-22. DOI:10.3778/j.issn.1002-8331.2012-0539.
[4]李志欣, 魏海洋, 黄飞成, 等. 结合视觉特征和场景语义的图像描述生成[J]. 计算机学报, 2020, 43(9): 1624-1640. DOI:10.11897/SP.J.1016.2020.01624.
[5]HOSSAIN Z, SOHEL F, SHIRATUDDIN M F, et al. A comprehensive survey of deep learning for image captioning[J]. ACM Computing Surveys, 2019, 51(6): 118. DOI:10.1145/3295748.
[6]VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ: IEEE, 2015: 3156-3164. DOI:10.1109/CVPR.2015.7298935.
[7]JIA X, GAVVES E, FERNANDO B, et al. Guiding the long-short term memory model for image caption generation[C]// 2015 IEEE International Conference on Computer Vision (ICCV). Piscataway, NJ: IEEE, 2015: 2407-2415. DOI:10.1109/ICCV.2015.277.
[8]YANG Z L, YUAN Y, WU Y X, et al. Encode, review, and decode: reviewer module for caption generation[EB/OL]. (2016-06-07)[2022-01-31]. https://arxiv.org/abs/1605.07912v3. DOI:10.48550/arXiv.1605.07912.
[9]MATHEWS A, XIE L X, HE X M. SentiCap: generating image descriptions with sentiments[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2016, 30(1): 3574-3580. DOI:10.1609/aaai.v30i1.10475.
[10]RAMANISHKA V, DAS A, ZHANG J M, et al. Top-down visual saliency guided by captions[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 3135-3144. DOI:10.1109/CVPR.2017.334.
[11]XU K, BA J L, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention[J]. Proceedings of Machine Learning Research. 2015, 37: 2048-2057.
[12]YOU Q Z, JIN H L, WANG Z W, et al. Image captioning with semantic attention[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2016: 4651-4659. DOI:10.1109/CVPR.2016.503.
[13]LIU C X, MAO J H, SHA F, et al. Attention correctness in neural image captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 4176-4182. DOI:10.1609/aaai.v31i1.11197.
[14]LU J S, XIONG C M, PARIKH D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 3242-3250. DOI:10.1109/CVPR.2017.345.
[15]韦人予, 蒙祖强. 基于注意力特征自适应校正的图像描述模型[J]. 计算机应用, 2020, 40(增刊1): 45-50.
[16]张家硕, 洪宇, 李志峰, 等. 基于双向注意力机制的图像描述生成[J]. 中文信息学报, 2020, 34(9): 53-61. DOI:10.3969/j.issn.1003-0077.2020.09.008.
[17]李文惠, 曾上游, 王金金. 基于改进注意力机制的图像描述生成算法[J]. 计算机应用, 2021, 41(5): 1262-1267. DOI:10.11772/j.issn.1001-9081.2020071078.
[18]LI L H, TANG S, DENG L X, et al. Image caption with global-local attention[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 4133-4139. DOI:10.1609/aaai.v31i1.11236.
[19]ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2018: 6077-6086. DOI:10.1109/CVPR.2018.00636.
[20]REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]// Advances in Neural Information Processing Systems 28 (NIPS 2015). Red Hook, NY: Curran Associates, Inc., 2015: 91-99.
[21]盛豪, 易尧华, 汤梓伟. 融合图像场景与目标显著性特征的图像描述生成方法[J]. 计算机应用研究, 2021, 38(12): 3776-3780. DOI:10.19734/j.issn.1001-3695.2021.02.0124.
[22]HENDRICKS L A, VENUGOPALAN S, ROHRBACH M, et al. Deep compositional captioning: describing novel object categories without paired training data[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2016: 1-10. DOI:10.1109/CVPR.2016.8.
[23]YAO T, PAN Y W, LI Y H, et al. Incorporating copying mechanism in image captioning for learning novel objects[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 5263-5271. DOI:10.1109/CVPR.2017.559.
[24]LI Y H, YAO T, PAN Y W, et al. Pointing novel objects in image captioning[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2019: 12489-12498. DOI:10.1109/CVPR.2019.01278.
[25]ZHOU Y M, SUN Y W, HONAVAR V. Improving image captioning by leveraging knowledge graphs[C]// 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). Los Alamitos, CA: IEEE Computer Society, 2019: 283-293. DOI:10.1109/WACV.2019.00036.
[26]陈开阳, 徐凡, 王明文. 基于知识图谱和图像描述的虚假新闻检测研究[J]. 江西师范大学学报(自然科学版), 2021, 45(4): 398-402. DOI:10.16357/j.cnki.issn1000-5862.2021.04.12.
[27]RANZATO M, CHOPRA S, AULI M, et al. Sequence level training with recurrent neural networks[EB/OL]. (2016-05-06)[2022-01-31]. https://arxiv.org/abs/1511.06732v7. DOI:10.48550/arXiv.1511.06732.
[28]REN Z, WANG X Y, ZHANG N, et al. Deep reinforcement learning-based image captioning with embedding reward[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 1151-1159. DOI:10.1109/CVPR.2017.128.
[29]RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 1179-1195. DOI:10.1109/CVPR.2017.131.
[30]QIN Y, DU J J, ZHANG Y H, et al. Look back and predict forward in image captioning[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2019: 8359-8367. DOI:10.1109/CVPR.2019.00856.
[31]PARK C C, KIM B C, KIM G H. Attend to you: personalized image captioning with context sequence memory networks[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 6432-6440. DOI:10.1109/CVPR.2017.681.
[32]KIM B, LEE Y H, JUNG H, et al. Distinctive-attribute extraction for image captioning[C]// Computer Vision-ECCV 2018 Workshops: LNCS Volume 11132. Cham: Springer, 2018: 133-144. DOI:10.1007/978-3-030-11018-5_12.
[33]BAHDANAU D, CHO K H, BENGIOY. Neural machine translation by jointly learning to align and translate[EB/OL]. (2016-05-19)[2022-01-31]. https://arxiv.org/abs/1409.0473. DOI:10.48550/arXiv.1409.0473.
[34]SPEER R, CHIN J, HAVASI C. Conceptnet 5.5: an open multilingual graph of general knowledge[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 4444-4451. DOI:10.1609/aaai.v31i1.11164.
[35]BENGIO S, VINYALS O, JAITLY N, et al. Scheduled sampling for sequence prediction with recurrent neural networks[C]// Advances in Neural Information Processing Systems 28 (NIPS 2015). Red Hook, NY: Curran Associates, Inc., 2015: 1171-1179.
[36]LIU S Q, ZHU Z H, YE N, et al. Optimization of image description metrics using policy gradient methods[EB/OL]. (2016-12-01)[2022-01-31]. https://arxiv.org/abs/1612.00370v1. DOI:10.48550/arXiv.1612.00370.
[37]LIU S Q, ZHU Z H, YE N, et al. Improved image captioning via policy gradient optimization of SPIDEr[C]// 2017 IEEE International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2017: 873-881. DOI:10.1109/ICCV.2017.100.
[38]JOHNSON J, KARPATHY A, LI F F. Densecap: fully convolutional localization networks for dense captioning[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2016: 4565-4574. DOI:10.1109/CVPR.2016.494.
[39]PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2002: 311-318. DOI:10.3115/1073083.1073135.
[40]BANERJEE S, LAVIE A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments[C]// Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg, PA: Association for Computational Linguistics, 2005: 65-72.
[41]VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway NJ: IEEE, 2015: 4566-4575. DOI:10.1109/CVPR.2015.7299087.
[42]LIN C Y. ROUGE: a package for automatic evaluation of summaries[C]// Text Summarization Branches Out. Stroudsburg, PA: Association for Computational Linguistics, 2004: 74-81.
[43]PEDERSOLI M, LUCAS T, SCHMID C, et al. Areas of attention for image captioning[C]// 2017 IEEE International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2017: 1251-1259. DOI:10.1109/ICCV.2017.140.
[44]CORNIA M, BARALDI L, SERRA G, et al. Paying more attention to saliency: image captioning with saliency and context attention[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2018, 14(2): 48. DOI:10.1145/3177745.
[45]CHEN L, ZHANG H W, XIAO J, et al. SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 6298-6306. DOI:10.1109/CVPR.2017.667.
[46]FU K, JIN J Q, CUI R P, et al. Aligning where to see and what to tell: image caption with region-based attention and scene-specific contexts[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2321-2334. DOI:10.1109/TPAMI.2016.2642953.

Metrics

Viewed

Full text

311

HTML			PDF

Just accepted	Online first	Issue	Just accepted	Online first	Issue
0	0	0	0	0	311

From	Others	local

Times	53	258
Rate	17%	83%

Abstract

Just accepted	Online first	Issue

0	0	279

Cited

Shared

Discussed