Journal of Guangxi Normal University(Natural Science Edition) ›› 2022, Vol. 40 ›› Issue (4): 91-103.doi: 10.16088/j.issn.1001-6600.2021101803

Previous Articles     Next Articles

Image Captioning According to User’s Intention and Style

WANG Yuhang1, ZHANG Canlong1*, LI Zhixin1, WANG Zhiwen2   

  1. 1. Guangxi Key Lab of Multi-source Information Mining & Security (Guangxi Normal University), Guilin Guangxi 541004, China;
    2. College of Computer Science and Communication Engineering, Guangxi University of Science and Technology,Liuzhou Guangxi 545006, China
  • Published:2022-08-05

Abstract: Most of the image captioning models are individuality-agnostic, which cannot generate an individual description according to the user’s intention and language style. To address the above problem, a personalized image captioning model is established in this paper by using fine-grained scene control graph and the style control factors to represent user’s intention and style of speaking, respectively. Firstly, construct a scene control graph, including the object, object attributes and the relationship objects in the scene, which can control the object, object attributes and the relationship between object. Secondly, a multi-graph convolutional neural network is used to encode the context information of the scene, and graph flow attention is employed to control the focus of the description. Then, add the style control module when generating sentences, that is, use keyword search to generate user profile according to user’s gender, age, education level and other information. Finally, the style generator extracts the corresponding style pattern according to the user profile, and the language decoder outputs a personalized image caption. The experimental results on MSCOCO dataset and FlickrStyle dataset show that the proposed method can generate personalized and diverse image caption sentences.

Key words: image captioning, user profile, fine-grained control, style control, attention mechanism

CLC Number: 

  • TP391.1
[1] 李志欣, 魏海洋, 黄飞成, 等. 结合视觉特征和场景语义的图像描述生成[J]. 计算机学报, 2020, 43(9): 1624-1640. DOI: 10.11897/SP.J.1016.2020.01624.
[2]周东明, 张灿龙, 李志欣, 等. 基于多层级视觉融合的图像描述模型[J]. 电子学报, 2021, 49(7): 1286-1290. DOI: 10.12263/DZXB.20191296.
[3]魏忠钰, 范智昊, 王瑞泽, 等. 从视觉到文本:图像描述生成的研究进展综述[J]. 中文信息学报, 2020, 34(7): 19-29. DOI: 10.3969/j.issn.1003-0077.2020.07.002.
[4]HARSH A, KARAN D, CHEN X, et al. nocaps: novel object captioning at scale[C]// Proceedings of the IEEE International Conference on Computer Vision. Los Alamitos, CA: IEEE Computer Society, 2019: 8947-8956. DOI: 10.1109/ICCV.2019.00904.
[5]CORNIA M, BARALDI L, CUCCHIARA R. Show, control and tell: a framework for generating controllable and grounded captions[C]// Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Los Alamitos, CA: IEEE Computer Society, 2019: 8299-8308. DOI: 10.1109/CVPR.2019.00850.
[6]王俊豪, 罗轶凤. 通过细粒度的语义特征与Transformer丰富图像描述[J]. 华东师范大学学报(自然科学版), 2020(5): 56-67. DOI: 10.3969/j.issn.1000-5641.202091004.
[7]ENGILBERGE M, CHEVALLIER L, PREZ P, et al. Finding beans in burgers: Deep semantic-visual embedding with localization[C]// Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2018: 3984-3993. DOI: 10.1109/CVPR.2018.00419.
[8]GU J, CAI J, JOTY S, et al. Look, imagine and match: improving textual-visual cross-modal retrieval with generative models[C]// Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2018: 7181-7189. DOI: 10.1109/CVPR.2018.00750.
[9]张家硕, 洪宇, 李志峰, 等. 基于双向注意力机制的图像描述生成[J]. 中文信息学报, 2020, 34(9): 53-61. DOI: 10.3969/j.issn.1003-0077.2020.09.008.
[10]马书磊, 张国宾, 焦阳,等. 一种改进的全局注意机制图像描述方法[J]. 西安电子科技大学学报, 2019, 46(2): 17-22. DOI: 10.19665/j.issn1001-2400.2019.02.004.
[11]马坤阳, 林金朝, 庞宇. 结合引导解码和视觉注意力的图像语义描述模型[J]. 计算机应用研究, 2020, 37(11): 3504-3506, 3515. DOI: 10.19734/j.issn.1001-3695.2019.06.0243.
[12]黄友文, 游亚东, 赵朋. 融合卷积注意力机制的图像描述生成模型[J]. 计算机应用, 2020, 40(1): 23-27. DOI: 10.11772/j.issn.1001-9081.2019050943.
[13]石义乐, 杨文忠, 杜慧祥, 等. 基于深度学习的图像描述综述[J]. 电子学报, 2021, 49(10): 2048-2060. DOI: 10. 12263/DZXB.20200669.
[14]邓珍荣, 张宝军, 蒋周琴, 等. 融合word2vec和注意力机制的图像描述模型[J]. 计算机科学, 2019, 46(4): 268-273. DOI: 10.11896/j.issn.1002-137x.2019.04.042.
[15]黄远, 白琮, 李宏凯, 等. 基于条件生成对抗网络的图像描述生成方法[J]. 计算机辅助设计与图形学学报, 2020, 32(6): 911-918. DOI: 10.3724/SP.J.1089.2020.18003.
[16]KARPATHY A, FEI-FEI L. Deep visual-semantic alignments for generating image descriptions[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2015: 3128-3137. DOI: 10.1109/CVPR.2015.7298932.
[17]VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell:lessons learned from the 2015 mscoco image captioning challenge[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 39(4): 652-663. DOI: 10.1109/ TPAMI.2016.2587640.
[18]WU Q, SHEN C, LIU L, et al. What value do explicit high level concepts have in vision to language problems?[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2016: 203-212. DOI: 10.1109/CVPR.2016.29.
[19]XU K V, BA J L, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention[C]// Proceedings of the 32nd International Conference on Machine Learning: Volume 37. Lile: PMLR, 2015: 2048-2057.
[20]LU J, XIONG C, PARIKH D, et al. Knowing when to look:adaptive attention via a visual sentinel for image captioning[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2017: 375-383. DOI: 10.1109/CVPR.2017.345.
[21]CHEN L, ZHANG H W, XIAO J, et al. SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2017: 5659-5667. DOI: 10.1109/CVPR.2017.667.
[22]RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2017: 7008-7024. DOI: 10.1109/CVPR.2017.131.
[23]ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]// 2018 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society,2018: 6077-6086. DOI: 10.1109/CVPR.2018.00636.
[24]WANG Y S, LIU C X, ZENG X H, et al. Scene graph parsing as dependency parsing[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Stroudsburg, PA: Association for Computational Linguistics, 2018: 397-407. DOI: 10.18653/v1/N18-1037.
[25]ZELLERS R, YATSKAR M, THOMSON S, et al. Neural motifs:scene graph parsing with global context[C]// 2018 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2018: 5831-5840. DOI: 10.1109/CVPR.2018.00611.
[26]YANG X, TANG K H, ZHANG H W, et al. Auto-encoding scene graphs for image captioning[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2019: 10677-10686. DOI: 10.1109/CVPR.2019.01094.
[27]PARK C C, KIM B C, KIM G H. Attend to you: personalized image captioning with context sequence memory networks[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 895-903. DOI: 10.1109/CVPR.2017.681.
[28]GAN C, GAN Z, HE X, et al. Stylenet: Generating attractive visual captions with styles[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2017: 3137-3146. DOI: 10.1109/CVPR.2017.108.
[29]SHUSTER K, HUMEAU S, HU H, et al. Engaging image captioning via personality[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2019: 12516-12526. DOI: 10.1109/CVPR.2019.01280.
[30]CHEN S, JIN Q, WANG P, et al. Say as you wish:fine-grained control of image caption generation with abstract scene graphs[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2020: 9962-9971. DOI: 10.1109/CVPR42600.2020.00998.
[31]LIU B, FU J, KATO M P, et al. Beyond narrative description: generating poetry from images by multi-adversarial training[C]// Proceedings of the 26th ACM international conference on Multimedia. New York, NY: Association for Computing Machinery, 2018: 783-791. DOI: 10.1145/3240508.3240587.
[32]VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2015: 4566-4575. DOI: 10.1109/CVPR.2015.7299087.
[33]ANDERSON P, FERNANDO B, JOHNSON M, et al. SPICE:semantic propositional image caption evaluation[C]// Computer Vision– ECCV 2016: Lecture Notes in Computer Science 9909. Cham: Springer International Publishing AG, 2016: 382-398. DOI: 10.1007/978-3-319-46454-1_24.
[34]LIN C Y. ROUGE: a package for automatic evaluation of summaries[C]// Text Summarization Branches Out. Stroudsburg, PA: Association for Computational Linguistics, 2004: 74-81.
[35]WANG Q, CHAN A B. Describing like humans: on diversity in image captioning[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2019: 4195-4203. DOI: 10.1109/CVPR.2019.00432.
[36]CORNIA M, STEFANINI M, BARALDI L, et al. Meshed-memory transformer for image captioning[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2020: 10578-10587. DOI: 10.1109/CVPR42600.2020.01059.
[1] LI Zhengguang, CHEN Heng, LIN Hongfei. Identification of Adverse Drug Reaction on Social Media Using Bi-directional Language Model [J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(3): 40-48.
[2] WAN Liming, ZHANG Xiaoqian, LIU Zhigui, SONG Lin, ZHOU Ying, LI Li. CT Image Segmentation of UNet Pulmonary Nodules Based on Efficient Channel Attention [J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(3): 66-75.
[3] ZHANG Ping, XU Qiaozhi. Segmentation of Lung Nodules Based on Multi-receptive Field and Grouping Attention Mechanism [J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(3): 76-87.
[4] WU Jun, OUYANG Aijia, ZHANG Lin. Phosphorylation Site Prediction Model Based on Multi-head Attention Mechanism [J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(3): 161-171.
[5] LI Weiyong, LIU Bin, ZHANG Wei, CHEN Yunfang. An Automatic Summarization Model Based on Deep Learning for Chinese [J]. Journal of Guangxi Normal University(Natural Science Edition), 2020, 38(2): 51-63.
[6] WANG Jian, ZHENG Qifan, LI Chao, SHI Jing. Remote Supervision Relationship Extraction Based on Encoder and Attention Mechanism [J]. Journal of Guangxi Normal University(Natural Science Edition), 2019, 37(4): 53-60.
[7] WU Wenya,CHEN Yufeng,XU Jin’an,ZHANG Yujie. High-level Semantic Attention-based Convolutional Neural Networks for Chinese Relation Extraction [J]. Journal of Guangxi Normal University(Natural Science Edition), 2019, 37(1): 32-41.
[8] YUE Tianchi, ZHANG Shaowu, YANG Liang, LIN Hongfei, YU Kai. Stance Detection Method Based on Two-Stage Attention Mechanism [J]. Journal of Guangxi Normal University(Natural Science Edition), 2019, 37(1): 42-49.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!