广西师范大学学报(自然科学版) ›› 2022, Vol. 40 ›› Issue (5): 418-432.doi: 10.16088/j.issn.1001-6600.2022013101

• 研究论文 • 上一篇    下一篇

基于知识辅助的图像描述生成

李志欣*, 苏强   

  1. 广西多源信息挖掘与安全重点实验室(广西师范大学), 广西 桂林 541004
  • 收稿日期:2022-01-31 修回日期:2022-04-15 出版日期:2022-09-25 发布日期:2022-10-18
  • 通讯作者: 李志欣(1971—), 男, 广西桂林人, 广西师范大学教授, 博导。E-mail: lizx@gxnu.edu.cn
  • 基金资助:
    国家自然科学基金(61966004, 61866004); 广西自然科学基金(2019GXNSFDA245018); 广西“八桂学者”工程专项基金

Knowledge-aided Image Captioning

LI Zhixin*, SU Qiang   

  1. Guangxi Key Lab of Multi-source Information Mining and Security (Guangxi Normal University), Guilin Guangxi 541004, China
  • Received:2022-01-31 Revised:2022-04-15 Online:2022-09-25 Published:2022-10-18

摘要: 为给定图像自动生成符合人类感知的描述语句是人工智能的重要任务之一。大多数现有的基于注意力的方法均探究语句中单词和图像中区域的映射关系,而这种难以预测的匹配方式有时会造成2种模态间不协调的对应,从而降低描述语句的生成质量。针对此问题,本文提出一种文本相关的单词注意力来提高视觉注意力的正确性。这种特殊的单词注意力在模型序列地生成描述语句过程中强调不同单词的重要性,并充分利用训练数据中的内部标注知识来帮助计算视觉注意力。此外,为了揭示图像中不能被机器直接表达出来的隐含信息,将从外部知识图谱中抽取出来的知识注入到编码器—解码器架构中,以生成更新颖自然的图像描述。在MSCOCO和Flickr30k图像描述基准数据集上的实验表明,本方法能够获得良好的性能,并优于许多现有的先进方法。

关键词: 图像描述生成, 内部知识, 外部知识, 单词注意力, 知识图谱, 强化学习

Abstract: Automatically generating a human-like description for a given image is one of the most important tasks in artificial intelligence. Most of the existing attention-based methods explore the mapping relationships between words in sentence and regions in image. However, the quality of generated captions can be reduced by such unpredictable matching manner which sometimes cause inharmonious alignments. To solve this problem, a new method which uses word attention to improve the correctness of visual attention when generating word-by-word sequential descriptions is proposed. The special word attention emphasizes word importance when focusing on different regions of the input image, and makes full use of the internal annotation knowledge to assist the calculation of visual attention. Furthermore, in order to reveal implied information that cannot be expressed straightforwardly by machines and generate more novel and natural captions, the external knowledge which is extracted from the knowledge graphs is injected to the encoder-decoder framework. Finally, The new method is validated on two available captioning benchmarks i.e. Microsoft COCO dataset and Flickr30k dataset. The experimental results demonstrate that this new approach can achieve a good performance and outperform many of the state-of-the-art approaches.

Key words: image captioning, internal knowledge, external knowledge, word attention, knowledge graph, reinforcement learning

中图分类号: 

  • TP391.41
[1]李志欣, 魏海洋, 张灿龙, 等. 图像描述生成研究进展[J]. 计算机研究与发展, 2021, 58(9): 1951-1974. DOI:10.7544/issn1000-1239.2021.20200281.
[2]石义乐, 杨文忠, 杜慧祥, 等. 基于深度学习的图像描述综述[J]. 电子学报, 2021, 49(10): 2048-2060. DOI:10.12263/DZXB.20200669.
[3]许昊, 张凯, 田英杰, 等. 深度神经网络图像描述综述[J]. 计算机工程与应用, 2021, 57(9): 9-22. DOI:10.3778/j.issn.1002-8331.2012-0539.
[4]李志欣, 魏海洋, 黄飞成, 等. 结合视觉特征和场景语义的图像描述生成[J]. 计算机学报, 2020, 43(9): 1624-1640. DOI:10.11897/SP.J.1016.2020.01624.
[5]HOSSAIN Z, SOHEL F, SHIRATUDDIN M F, et al. A comprehensive survey of deep learning for image captioning[J]. ACM Computing Surveys, 2019, 51(6): 118. DOI:10.1145/3295748.
[6]VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ: IEEE, 2015: 3156-3164. DOI:10.1109/CVPR.2015.7298935.
[7]JIA X, GAVVES E, FERNANDO B, et al. Guiding the long-short term memory model for image caption generation[C]// 2015 IEEE International Conference on Computer Vision (ICCV). Piscataway, NJ: IEEE, 2015: 2407-2415. DOI:10.1109/ICCV.2015.277.
[8]YANG Z L, YUAN Y, WU Y X, et al. Encode, review, and decode: reviewer module for caption generation[EB/OL]. (2016-06-07)[2022-01-31]. https://arxiv.org/abs/1605.07912v3. DOI:10.48550/arXiv.1605.07912.
[9]MATHEWS A, XIE L X, HE X M. SentiCap: generating image descriptions with sentiments[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2016, 30(1): 3574-3580. DOI:10.1609/aaai.v30i1.10475.
[10]RAMANISHKA V, DAS A, ZHANG J M, et al. Top-down visual saliency guided by captions[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 3135-3144. DOI:10.1109/CVPR.2017.334.
[11]XU K, BA J L, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention[J]. Proceedings of Machine Learning Research. 2015, 37: 2048-2057.
[12]YOU Q Z, JIN H L, WANG Z W, et al. Image captioning with semantic attention[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2016: 4651-4659. DOI:10.1109/CVPR.2016.503.
[13]LIU C X, MAO J H, SHA F, et al. Attention correctness in neural image captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 4176-4182. DOI:10.1609/aaai.v31i1.11197.
[14]LU J S, XIONG C M, PARIKH D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 3242-3250. DOI:10.1109/CVPR.2017.345.
[15]韦人予, 蒙祖强. 基于注意力特征自适应校正的图像描述模型[J]. 计算机应用, 2020, 40(增刊1): 45-50.
[16]张家硕, 洪宇, 李志峰, 等. 基于双向注意力机制的图像描述生成[J]. 中文信息学报, 2020, 34(9): 53-61. DOI:10.3969/j.issn.1003-0077.2020.09.008.
[17]李文惠, 曾上游, 王金金. 基于改进注意力机制的图像描述生成算法[J]. 计算机应用, 2021, 41(5): 1262-1267. DOI:10.11772/j.issn.1001-9081.2020071078.
[18]LI L H, TANG S, DENG L X, et al. Image caption with global-local attention[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 4133-4139. DOI:10.1609/aaai.v31i1.11236.
[19]ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2018: 6077-6086. DOI:10.1109/CVPR.2018.00636.
[20]REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]// Advances in Neural Information Processing Systems 28 (NIPS 2015). Red Hook, NY: Curran Associates, Inc., 2015: 91-99.
[21]盛豪, 易尧华, 汤梓伟. 融合图像场景与目标显著性特征的图像描述生成方法[J]. 计算机应用研究, 2021, 38(12): 3776-3780. DOI:10.19734/j.issn.1001-3695.2021.02.0124.
[22]HENDRICKS L A, VENUGOPALAN S, ROHRBACH M, et al. Deep compositional captioning: describing novel object categories without paired training data[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2016: 1-10. DOI:10.1109/CVPR.2016.8.
[23]YAO T, PAN Y W, LI Y H, et al. Incorporating copying mechanism in image captioning for learning novel objects[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 5263-5271. DOI:10.1109/CVPR.2017.559.
[24]LI Y H, YAO T, PAN Y W, et al. Pointing novel objects in image captioning[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2019: 12489-12498. DOI:10.1109/CVPR.2019.01278.
[25]ZHOU Y M, SUN Y W, HONAVAR V. Improving image captioning by leveraging knowledge graphs[C]// 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). Los Alamitos, CA: IEEE Computer Society, 2019: 283-293. DOI:10.1109/WACV.2019.00036.
[26]陈开阳, 徐凡, 王明文. 基于知识图谱和图像描述的虚假新闻检测研究[J]. 江西师范大学学报(自然科学版), 2021, 45(4): 398-402. DOI:10.16357/j.cnki.issn1000-5862.2021.04.12.
[27]RANZATO M, CHOPRA S, AULI M, et al. Sequence level training with recurrent neural networks[EB/OL]. (2016-05-06)[2022-01-31]. https://arxiv.org/abs/1511.06732v7. DOI:10.48550/arXiv.1511.06732.
[28]REN Z, WANG X Y, ZHANG N, et al. Deep reinforcement learning-based image captioning with embedding reward[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 1151-1159. DOI:10.1109/CVPR.2017.128.
[29]RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 1179-1195. DOI:10.1109/CVPR.2017.131.
[30]QIN Y, DU J J, ZHANG Y H, et al. Look back and predict forward in image captioning[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2019: 8359-8367. DOI:10.1109/CVPR.2019.00856.
[31]PARK C C, KIM B C, KIM G H. Attend to you: personalized image captioning with context sequence memory networks[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 6432-6440. DOI:10.1109/CVPR.2017.681.
[32]KIM B, LEE Y H, JUNG H, et al. Distinctive-attribute extraction for image captioning[C]// Computer Vision-ECCV 2018 Workshops: LNCS Volume 11132. Cham: Springer, 2018: 133-144. DOI:10.1007/978-3-030-11018-5_12.
[33]BAHDANAU D, CHO K H, BENGIOY. Neural machine translation by jointly learning to align and translate[EB/OL]. (2016-05-19)[2022-01-31]. https://arxiv.org/abs/1409.0473. DOI:10.48550/arXiv.1409.0473.
[34]SPEER R, CHIN J, HAVASI C. Conceptnet 5.5: an open multilingual graph of general knowledge[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 4444-4451. DOI:10.1609/aaai.v31i1.11164.
[35]BENGIO S, VINYALS O, JAITLY N, et al. Scheduled sampling for sequence prediction with recurrent neural networks[C]// Advances in Neural Information Processing Systems 28 (NIPS 2015). Red Hook, NY: Curran Associates, Inc., 2015: 1171-1179.
[36]LIU S Q, ZHU Z H, YE N, et al. Optimization of image description metrics using policy gradient methods[EB/OL]. (2016-12-01)[2022-01-31]. https://arxiv.org/abs/1612.00370v1. DOI:10.48550/arXiv.1612.00370.
[37]LIU S Q, ZHU Z H, YE N, et al. Improved image captioning via policy gradient optimization of SPIDEr[C]// 2017 IEEE International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2017: 873-881. DOI:10.1109/ICCV.2017.100.
[38]JOHNSON J, KARPATHY A, LI F F. Densecap: fully convolutional localization networks for dense captioning[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2016: 4565-4574. DOI:10.1109/CVPR.2016.494.
[39]PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2002: 311-318. DOI:10.3115/1073083.1073135.
[40]BANERJEE S, LAVIE A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments[C]// Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg, PA: Association for Computational Linguistics, 2005: 65-72.
[41]VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway NJ: IEEE, 2015: 4566-4575. DOI:10.1109/CVPR.2015.7299087.
[42]LIN C Y. ROUGE: a package for automatic evaluation of summaries[C]// Text Summarization Branches Out. Stroudsburg, PA: Association for Computational Linguistics, 2004: 74-81.
[43]PEDERSOLI M, LUCAS T, SCHMID C, et al. Areas of attention for image captioning[C]// 2017 IEEE International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2017: 1251-1259. DOI:10.1109/ICCV.2017.140.
[44]CORNIA M, BARALDI L, SERRA G, et al. Paying more attention to saliency: image captioning with saliency and context attention[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2018, 14(2): 48. DOI:10.1145/3177745.
[45]CHEN L, ZHANG H W, XIAO J, et al. SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 6298-6306. DOI:10.1109/CVPR.2017.667.
[46]FU K, JIN J Q, CUI R P, et al. Aligning where to see and what to tell: image caption with region-based attention and scene-specific contexts[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2321-2334. DOI:10.1109/TPAMI.2016.2642953.
[1] 陈高建, 王菁, 栗倩文, 袁云静, 曹嘉琛. 数据驱动的自动化机器学习流程生成方法[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 185-193.
[2] 肖飞, 丁旭升, 王维红. 基于文献计量学分析的好氧颗粒污泥研究进展[J]. 广西师范大学学报(自然科学版), 2022, 40(2): 1-14.
[3] 唐峯竹, 唐欣, 李春海, 李晓欢. 基于深度强化学习的多无人机任务动态分配[J]. 广西师范大学学报(自然科学版), 2021, 39(6): 63-71.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 张喜龙, 韩萌, 陈志强, 武红鑫, 李慕航. 面向复杂数据流的集成分类综述[J]. 广西师范大学学报(自然科学版), 2022, 40(4): 1 -21 .
[2] 童凌晨, 李强, 岳鹏鹏. 基于CiteSpace的喀斯特土壤有机碳研究进展[J]. 广西师范大学学报(自然科学版), 2022, 40(4): 22 -34 .
[3] 帖军, 隆娟娟, 郑禄, 牛悦, 宋衍霖. 基于SK-EfficientNet的番茄叶片病害识别模型[J]. 广西师范大学学报(自然科学版), 2022, 40(4): 104 -114 .
[4] 翁烨, 邵德盛, 甘淑. 等式约束病态最小二乘的主成分Liu估计解法[J]. 广西师范大学学报(自然科学版), 2022, 40(4): 115 -125 .
[5] 覃城阜, 莫芬梅. C3-和C4-临界连通图的结构[J]. 广西师范大学学报(自然科学版), 2022, 40(4): 145 -153 .
[6] 贺青, 刘剑, 韦联福. 微弱电磁信号的物理极限检测:单光子探测器及其研究进展[J]. 广西师范大学学报(自然科学版), 2022, 40(5): 1 -23 .
[7] 田芮谦, 宋树祥, 刘振宇, 岑明灿, 蒋品群, 蔡超波. 逐次逼近型模数转换器研究进展[J]. 广西师范大学学报(自然科学版), 2022, 40(5): 24 -35 .
[8] 张师超, 李佳烨. 知识矩阵表示[J]. 广西师范大学学报(自然科学版), 2022, 40(5): 36 -48 .
[9] 梁钰婷, 罗玉玲, 张顺生. 基于压缩感知的混沌图像加密研究综述[J]. 广西师范大学学报(自然科学版), 2022, 40(5): 49 -58 .
[10] 郝雅茹, 董力, 许可, 李先贤. 预训练语言模型的可解释性研究进展[J]. 广西师范大学学报(自然科学版), 2022, 40(5): 59 -71 .
版权所有 © 广西师范大学学报(自然科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发