Journal of Guangxi Normal University(Natural Science Edition) ›› 2022, Vol. 40 ›› Issue (5): 418-432.doi: 10.16088/j.issn.1001-6600.2022013101

Previous Articles     Next Articles

Knowledge-aided Image Captioning

LI Zhixin*, SU Qiang   

  1. Guangxi Key Lab of Multi-source Information Mining and Security (Guangxi Normal University), Guilin Guangxi 541004, China
  • Received:2022-01-31 Revised:2022-04-15 Online:2022-09-25 Published:2022-10-18

Abstract: Automatically generating a human-like description for a given image is one of the most important tasks in artificial intelligence. Most of the existing attention-based methods explore the mapping relationships between words in sentence and regions in image. However, the quality of generated captions can be reduced by such unpredictable matching manner which sometimes cause inharmonious alignments. To solve this problem, a new method which uses word attention to improve the correctness of visual attention when generating word-by-word sequential descriptions is proposed. The special word attention emphasizes word importance when focusing on different regions of the input image, and makes full use of the internal annotation knowledge to assist the calculation of visual attention. Furthermore, in order to reveal implied information that cannot be expressed straightforwardly by machines and generate more novel and natural captions, the external knowledge which is extracted from the knowledge graphs is injected to the encoder-decoder framework. Finally, The new method is validated on two available captioning benchmarks i.e. Microsoft COCO dataset and Flickr30k dataset. The experimental results demonstrate that this new approach can achieve a good performance and outperform many of the state-of-the-art approaches.

Key words: image captioning, internal knowledge, external knowledge, word attention, knowledge graph, reinforcement learning

CLC Number: 

  • TP391.41
[1]李志欣, 魏海洋, 张灿龙, 等. 图像描述生成研究进展[J]. 计算机研究与发展, 2021, 58(9): 1951-1974. DOI:10.7544/issn1000-1239.2021.20200281.
[2]石义乐, 杨文忠, 杜慧祥, 等. 基于深度学习的图像描述综述[J]. 电子学报, 2021, 49(10): 2048-2060. DOI:10.12263/DZXB.20200669.
[3]许昊, 张凯, 田英杰, 等. 深度神经网络图像描述综述[J]. 计算机工程与应用, 2021, 57(9): 9-22. DOI:10.3778/j.issn.1002-8331.2012-0539.
[4]李志欣, 魏海洋, 黄飞成, 等. 结合视觉特征和场景语义的图像描述生成[J]. 计算机学报, 2020, 43(9): 1624-1640. DOI:10.11897/SP.J.1016.2020.01624.
[5]HOSSAIN Z, SOHEL F, SHIRATUDDIN M F, et al. A comprehensive survey of deep learning for image captioning[J]. ACM Computing Surveys, 2019, 51(6): 118. DOI:10.1145/3295748.
[6]VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ: IEEE, 2015: 3156-3164. DOI:10.1109/CVPR.2015.7298935.
[7]JIA X, GAVVES E, FERNANDO B, et al. Guiding the long-short term memory model for image caption generation[C]// 2015 IEEE International Conference on Computer Vision (ICCV). Piscataway, NJ: IEEE, 2015: 2407-2415. DOI:10.1109/ICCV.2015.277.
[8]YANG Z L, YUAN Y, WU Y X, et al. Encode, review, and decode: reviewer module for caption generation[EB/OL]. (2016-06-07)[2022-01-31]. https://arxiv.org/abs/1605.07912v3. DOI:10.48550/arXiv.1605.07912.
[9]MATHEWS A, XIE L X, HE X M. SentiCap: generating image descriptions with sentiments[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2016, 30(1): 3574-3580. DOI:10.1609/aaai.v30i1.10475.
[10]RAMANISHKA V, DAS A, ZHANG J M, et al. Top-down visual saliency guided by captions[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 3135-3144. DOI:10.1109/CVPR.2017.334.
[11]XU K, BA J L, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention[J]. Proceedings of Machine Learning Research. 2015, 37: 2048-2057.
[12]YOU Q Z, JIN H L, WANG Z W, et al. Image captioning with semantic attention[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2016: 4651-4659. DOI:10.1109/CVPR.2016.503.
[13]LIU C X, MAO J H, SHA F, et al. Attention correctness in neural image captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 4176-4182. DOI:10.1609/aaai.v31i1.11197.
[14]LU J S, XIONG C M, PARIKH D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 3242-3250. DOI:10.1109/CVPR.2017.345.
[15]韦人予, 蒙祖强. 基于注意力特征自适应校正的图像描述模型[J]. 计算机应用, 2020, 40(增刊1): 45-50.
[16]张家硕, 洪宇, 李志峰, 等. 基于双向注意力机制的图像描述生成[J]. 中文信息学报, 2020, 34(9): 53-61. DOI:10.3969/j.issn.1003-0077.2020.09.008.
[17]李文惠, 曾上游, 王金金. 基于改进注意力机制的图像描述生成算法[J]. 计算机应用, 2021, 41(5): 1262-1267. DOI:10.11772/j.issn.1001-9081.2020071078.
[18]LI L H, TANG S, DENG L X, et al. Image caption with global-local attention[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 4133-4139. DOI:10.1609/aaai.v31i1.11236.
[19]ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2018: 6077-6086. DOI:10.1109/CVPR.2018.00636.
[20]REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]// Advances in Neural Information Processing Systems 28 (NIPS 2015). Red Hook, NY: Curran Associates, Inc., 2015: 91-99.
[21]盛豪, 易尧华, 汤梓伟. 融合图像场景与目标显著性特征的图像描述生成方法[J]. 计算机应用研究, 2021, 38(12): 3776-3780. DOI:10.19734/j.issn.1001-3695.2021.02.0124.
[22]HENDRICKS L A, VENUGOPALAN S, ROHRBACH M, et al. Deep compositional captioning: describing novel object categories without paired training data[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2016: 1-10. DOI:10.1109/CVPR.2016.8.
[23]YAO T, PAN Y W, LI Y H, et al. Incorporating copying mechanism in image captioning for learning novel objects[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 5263-5271. DOI:10.1109/CVPR.2017.559.
[24]LI Y H, YAO T, PAN Y W, et al. Pointing novel objects in image captioning[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2019: 12489-12498. DOI:10.1109/CVPR.2019.01278.
[25]ZHOU Y M, SUN Y W, HONAVAR V. Improving image captioning by leveraging knowledge graphs[C]// 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). Los Alamitos, CA: IEEE Computer Society, 2019: 283-293. DOI:10.1109/WACV.2019.00036.
[26]陈开阳, 徐凡, 王明文. 基于知识图谱和图像描述的虚假新闻检测研究[J]. 江西师范大学学报(自然科学版), 2021, 45(4): 398-402. DOI:10.16357/j.cnki.issn1000-5862.2021.04.12.
[27]RANZATO M, CHOPRA S, AULI M, et al. Sequence level training with recurrent neural networks[EB/OL]. (2016-05-06)[2022-01-31]. https://arxiv.org/abs/1511.06732v7. DOI:10.48550/arXiv.1511.06732.
[28]REN Z, WANG X Y, ZHANG N, et al. Deep reinforcement learning-based image captioning with embedding reward[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 1151-1159. DOI:10.1109/CVPR.2017.128.
[29]RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 1179-1195. DOI:10.1109/CVPR.2017.131.
[30]QIN Y, DU J J, ZHANG Y H, et al. Look back and predict forward in image captioning[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2019: 8359-8367. DOI:10.1109/CVPR.2019.00856.
[31]PARK C C, KIM B C, KIM G H. Attend to you: personalized image captioning with context sequence memory networks[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 6432-6440. DOI:10.1109/CVPR.2017.681.
[32]KIM B, LEE Y H, JUNG H, et al. Distinctive-attribute extraction for image captioning[C]// Computer Vision-ECCV 2018 Workshops: LNCS Volume 11132. Cham: Springer, 2018: 133-144. DOI:10.1007/978-3-030-11018-5_12.
[33]BAHDANAU D, CHO K H, BENGIOY. Neural machine translation by jointly learning to align and translate[EB/OL]. (2016-05-19)[2022-01-31]. https://arxiv.org/abs/1409.0473. DOI:10.48550/arXiv.1409.0473.
[34]SPEER R, CHIN J, HAVASI C. Conceptnet 5.5: an open multilingual graph of general knowledge[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 4444-4451. DOI:10.1609/aaai.v31i1.11164.
[35]BENGIO S, VINYALS O, JAITLY N, et al. Scheduled sampling for sequence prediction with recurrent neural networks[C]// Advances in Neural Information Processing Systems 28 (NIPS 2015). Red Hook, NY: Curran Associates, Inc., 2015: 1171-1179.
[36]LIU S Q, ZHU Z H, YE N, et al. Optimization of image description metrics using policy gradient methods[EB/OL]. (2016-12-01)[2022-01-31]. https://arxiv.org/abs/1612.00370v1. DOI:10.48550/arXiv.1612.00370.
[37]LIU S Q, ZHU Z H, YE N, et al. Improved image captioning via policy gradient optimization of SPIDEr[C]// 2017 IEEE International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2017: 873-881. DOI:10.1109/ICCV.2017.100.
[38]JOHNSON J, KARPATHY A, LI F F. Densecap: fully convolutional localization networks for dense captioning[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2016: 4565-4574. DOI:10.1109/CVPR.2016.494.
[39]PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2002: 311-318. DOI:10.3115/1073083.1073135.
[40]BANERJEE S, LAVIE A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments[C]// Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg, PA: Association for Computational Linguistics, 2005: 65-72.
[41]VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway NJ: IEEE, 2015: 4566-4575. DOI:10.1109/CVPR.2015.7299087.
[42]LIN C Y. ROUGE: a package for automatic evaluation of summaries[C]// Text Summarization Branches Out. Stroudsburg, PA: Association for Computational Linguistics, 2004: 74-81.
[43]PEDERSOLI M, LUCAS T, SCHMID C, et al. Areas of attention for image captioning[C]// 2017 IEEE International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Society, 2017: 1251-1259. DOI:10.1109/ICCV.2017.140.
[44]CORNIA M, BARALDI L, SERRA G, et al. Paying more attention to saliency: image captioning with saliency and context attention[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2018, 14(2): 48. DOI:10.1145/3177745.
[45]CHEN L, ZHANG H W, XIAO J, et al. SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 6298-6306. DOI:10.1109/CVPR.2017.667.
[46]FU K, JIN J Q, CUI R P, et al. Aligning where to see and what to tell: image caption with region-based attention and scene-specific contexts[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2321-2334. DOI:10.1109/TPAMI.2016.2642953.
[1] WANG Yuhang, ZHANG Canlong, LI Zhixin, WANG Zhiwen. Image Captioning According to User’s Intention and Style [J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(4): 91-103.
[2] CHEN Gaojian, WANG Jing, LI Qianwen, YUAN Yunjing, CAO Jiachen. Data-driven Method for Automatic Machine Learning Pipeline Generation [J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(3): 185-193.
[3] TANG Fengzhu, TANG Xin, LI Chunhai, LI Xiaohuan. Dynamic Task Allocation Method for UAVs Based on Deep Reinforcement Learning [J]. Journal of Guangxi Normal University(Natural Science Edition), 2021, 39(6): 63-71.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] ZHANG Xilong, HAN Meng, CHEN Zhiqiang, WU Hongxin, LI Muhang. Survey of Ensemble Classification Methods for Complex Data Stream[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(4): 1 -21 .
[2] TONG Lingchen, LI Qiang, YUE Pengpeng. Research Progress and Prospects of Karst Soil Organic Carbon Based on CiteSpace[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(4): 22 -34 .
[3] TIE Jun, LONG Juanjuan, ZHENG Lu, NIU Yue, SONG Yanlin. Tomato Leaf Disease Recognition Model Based on SK-EfficientNet[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(4): 104 -114 .
[4] WENG Ye, SHAO Desheng, GAN Shu. Principal Component Liu Estimation Method of the Equation    Constrained Ⅲ-Conditioned Least Squares[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(4): 115 -125 .
[5] QIN Chengfu, MO Fenmei. Structure ofC3-and C4-Critical Graphs[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(4): 145 -153 .
[6] HE Qing, LIU Jian, WEI Lianfu. Single-Photon Detectors as the Physical Limit Detections of Weak Electromagnetic Signals[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(5): 1 -23 .
[7] TIAN Ruiqian, SONG Shuxiang, LIU Zhenyu, CEN Mingcan, JIANG Pinqun, CAI Chaobo. Research Progress of Successive Approximation Register Analog-to-Digital Converter[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(5): 24 -35 .
[8] ZHANG Shichao, LI Jiaye. Knowledge Matrix Representation[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(5): 36 -48 .
[9] LIANG Yuting, LUO Yuling, ZHANG Shunsheng. Review on Chaotic Image Encryption Based on Compressed Sensing[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(5): 49 -58 .
[10] HAO Yaru, DONG Li, XU Ke, LI Xianxian. Interpretability of Pre-trained Language Models: A Survey[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(5): 59 -71 .