广西师范大学学报(自然科学版) ›› 2022, Vol. 40 ›› Issue (4): 91-103.doi: 10.16088/j.issn.1001-6600.2021101803

• 研究论文 • 上一篇    下一篇

体现用户意图和风格的图像描述生成

王宇航1, 张灿龙1*, 李志欣1, 王智文2   

  1. 1. 广西多源信息挖掘与安全重点实验室(广西师范大学),广西桂林 541004;
    2. 广西科技大学计算机科学与通信工程学院,广西柳州 545006
  • 发布日期:2022-08-05
  • 通讯作者: 张灿龙(1975—), 男, 湖南双峰人, 广西师范大学教授, 博士。E-mail: zcltyp@163.com
  • 基金资助:
    国家自然科学基金(61866004, 61966004, 61962007); 广西自然科学基金(2018GXNSFDA281009, 2019GXNSFDA245018, 2018GXNSFDA294001); 广西多源信息挖掘与安全重点实验室系统性研究课题基金(20-A-03-01); 广西“八桂学者”创新研究团队

Image Captioning According to User’s Intention and Style

WANG Yuhang1, ZHANG Canlong1*, LI Zhixin1, WANG Zhiwen2   

  1. 1. Guangxi Key Lab of Multi-source Information Mining & Security (Guangxi Normal University), Guilin Guangxi 541004, China;
    2. College of Computer Science and Communication Engineering, Guangxi University of Science and Technology,Liuzhou Guangxi 545006, China
  • Published:2022-08-05

摘要: 现有的图像描述模型大多不能根据用户的意图和用语风格生成个性化的描述。针对这一问题,本文提出一种能体现用户意图和风格的个性化图像描述方法。首先,构建一个关于场景中目标、目标属性以及目标间关系的结构图,通过该图来控制用户所希望表达的目标对象、目标属性以及各目标之间的相互关系;然后,在编码器中加入多关系图卷积神经网络对场景的上下文信息进行编码,并利用图流动注意力来控制描述的侧重点;最后,在生成语句时加入用户风格控制模块,即利用关键词搜索生成包含性别、年龄、受教育程度等信息的用户画像,并结合该画像来控制风格生成器,提取对应的风格样式,最终生成体现用户意图和风格的个性化图像描述。在MSCOCO和FlickrStyle数据集上的实验结果表明,所提出的方法能较好地生成个性化和多样性图像描述语句。

关键词: 图像描述, 用户画像, 细粒度场景控制, 风格控制, 注意力机制

Abstract: Most of the image captioning models are individuality-agnostic, which cannot generate an individual description according to the user’s intention and language style. To address the above problem, a personalized image captioning model is established in this paper by using fine-grained scene control graph and the style control factors to represent user’s intention and style of speaking, respectively. Firstly, construct a scene control graph, including the object, object attributes and the relationship objects in the scene, which can control the object, object attributes and the relationship between object. Secondly, a multi-graph convolutional neural network is used to encode the context information of the scene, and graph flow attention is employed to control the focus of the description. Then, add the style control module when generating sentences, that is, use keyword search to generate user profile according to user’s gender, age, education level and other information. Finally, the style generator extracts the corresponding style pattern according to the user profile, and the language decoder outputs a personalized image caption. The experimental results on MSCOCO dataset and FlickrStyle dataset show that the proposed method can generate personalized and diverse image caption sentences.

Key words: image captioning, user profile, fine-grained control, style control, attention mechanism

中图分类号: 

  • TP391.1
[1] 李志欣, 魏海洋, 黄飞成, 等. 结合视觉特征和场景语义的图像描述生成[J]. 计算机学报, 2020, 43(9): 1624-1640. DOI: 10.11897/SP.J.1016.2020.01624.
[2]周东明, 张灿龙, 李志欣, 等. 基于多层级视觉融合的图像描述模型[J]. 电子学报, 2021, 49(7): 1286-1290. DOI: 10.12263/DZXB.20191296.
[3]魏忠钰, 范智昊, 王瑞泽, 等. 从视觉到文本:图像描述生成的研究进展综述[J]. 中文信息学报, 2020, 34(7): 19-29. DOI: 10.3969/j.issn.1003-0077.2020.07.002.
[4]HARSH A, KARAN D, CHEN X, et al. nocaps: novel object captioning at scale[C]// Proceedings of the IEEE International Conference on Computer Vision. Los Alamitos, CA: IEEE Computer Society, 2019: 8947-8956. DOI: 10.1109/ICCV.2019.00904.
[5]CORNIA M, BARALDI L, CUCCHIARA R. Show, control and tell: a framework for generating controllable and grounded captions[C]// Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Los Alamitos, CA: IEEE Computer Society, 2019: 8299-8308. DOI: 10.1109/CVPR.2019.00850.
[6]王俊豪, 罗轶凤. 通过细粒度的语义特征与Transformer丰富图像描述[J]. 华东师范大学学报(自然科学版), 2020(5): 56-67. DOI: 10.3969/j.issn.1000-5641.202091004.
[7]ENGILBERGE M, CHEVALLIER L, PREZ P, et al. Finding beans in burgers: Deep semantic-visual embedding with localization[C]// Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2018: 3984-3993. DOI: 10.1109/CVPR.2018.00419.
[8]GU J, CAI J, JOTY S, et al. Look, imagine and match: improving textual-visual cross-modal retrieval with generative models[C]// Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2018: 7181-7189. DOI: 10.1109/CVPR.2018.00750.
[9]张家硕, 洪宇, 李志峰, 等. 基于双向注意力机制的图像描述生成[J]. 中文信息学报, 2020, 34(9): 53-61. DOI: 10.3969/j.issn.1003-0077.2020.09.008.
[10]马书磊, 张国宾, 焦阳,等. 一种改进的全局注意机制图像描述方法[J]. 西安电子科技大学学报, 2019, 46(2): 17-22. DOI: 10.19665/j.issn1001-2400.2019.02.004.
[11]马坤阳, 林金朝, 庞宇. 结合引导解码和视觉注意力的图像语义描述模型[J]. 计算机应用研究, 2020, 37(11): 3504-3506, 3515. DOI: 10.19734/j.issn.1001-3695.2019.06.0243.
[12]黄友文, 游亚东, 赵朋. 融合卷积注意力机制的图像描述生成模型[J]. 计算机应用, 2020, 40(1): 23-27. DOI: 10.11772/j.issn.1001-9081.2019050943.
[13]石义乐, 杨文忠, 杜慧祥, 等. 基于深度学习的图像描述综述[J]. 电子学报, 2021, 49(10): 2048-2060. DOI: 10. 12263/DZXB.20200669.
[14]邓珍荣, 张宝军, 蒋周琴, 等. 融合word2vec和注意力机制的图像描述模型[J]. 计算机科学, 2019, 46(4): 268-273. DOI: 10.11896/j.issn.1002-137x.2019.04.042.
[15]黄远, 白琮, 李宏凯, 等. 基于条件生成对抗网络的图像描述生成方法[J]. 计算机辅助设计与图形学学报, 2020, 32(6): 911-918. DOI: 10.3724/SP.J.1089.2020.18003.
[16]KARPATHY A, FEI-FEI L. Deep visual-semantic alignments for generating image descriptions[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2015: 3128-3137. DOI: 10.1109/CVPR.2015.7298932.
[17]VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell:lessons learned from the 2015 mscoco image captioning challenge[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 39(4): 652-663. DOI: 10.1109/ TPAMI.2016.2587640.
[18]WU Q, SHEN C, LIU L, et al. What value do explicit high level concepts have in vision to language problems?[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2016: 203-212. DOI: 10.1109/CVPR.2016.29.
[19]XU K V, BA J L, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention[C]// Proceedings of the 32nd International Conference on Machine Learning: Volume 37. Lile: PMLR, 2015: 2048-2057.
[20]LU J, XIONG C, PARIKH D, et al. Knowing when to look:adaptive attention via a visual sentinel for image captioning[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2017: 375-383. DOI: 10.1109/CVPR.2017.345.
[21]CHEN L, ZHANG H W, XIAO J, et al. SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2017: 5659-5667. DOI: 10.1109/CVPR.2017.667.
[22]RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2017: 7008-7024. DOI: 10.1109/CVPR.2017.131.
[23]ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]// 2018 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society,2018: 6077-6086. DOI: 10.1109/CVPR.2018.00636.
[24]WANG Y S, LIU C X, ZENG X H, et al. Scene graph parsing as dependency parsing[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Stroudsburg, PA: Association for Computational Linguistics, 2018: 397-407. DOI: 10.18653/v1/N18-1037.
[25]ZELLERS R, YATSKAR M, THOMSON S, et al. Neural motifs:scene graph parsing with global context[C]// 2018 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2018: 5831-5840. DOI: 10.1109/CVPR.2018.00611.
[26]YANG X, TANG K H, ZHANG H W, et al. Auto-encoding scene graphs for image captioning[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2019: 10677-10686. DOI: 10.1109/CVPR.2019.01094.
[27]PARK C C, KIM B C, KIM G H. Attend to you: personalized image captioning with context sequence memory networks[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 895-903. DOI: 10.1109/CVPR.2017.681.
[28]GAN C, GAN Z, HE X, et al. Stylenet: Generating attractive visual captions with styles[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2017: 3137-3146. DOI: 10.1109/CVPR.2017.108.
[29]SHUSTER K, HUMEAU S, HU H, et al. Engaging image captioning via personality[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2019: 12516-12526. DOI: 10.1109/CVPR.2019.01280.
[30]CHEN S, JIN Q, WANG P, et al. Say as you wish:fine-grained control of image caption generation with abstract scene graphs[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2020: 9962-9971. DOI: 10.1109/CVPR42600.2020.00998.
[31]LIU B, FU J, KATO M P, et al. Beyond narrative description: generating poetry from images by multi-adversarial training[C]// Proceedings of the 26th ACM international conference on Multimedia. New York, NY: Association for Computing Machinery, 2018: 783-791. DOI: 10.1145/3240508.3240587.
[32]VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2015: 4566-4575. DOI: 10.1109/CVPR.2015.7299087.
[33]ANDERSON P, FERNANDO B, JOHNSON M, et al. SPICE:semantic propositional image caption evaluation[C]// Computer Vision– ECCV 2016: Lecture Notes in Computer Science 9909. Cham: Springer International Publishing AG, 2016: 382-398. DOI: 10.1007/978-3-319-46454-1_24.
[34]LIN C Y. ROUGE: a package for automatic evaluation of summaries[C]// Text Summarization Branches Out. Stroudsburg, PA: Association for Computational Linguistics, 2004: 74-81.
[35]WANG Q, CHAN A B. Describing like humans: on diversity in image captioning[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2019: 4195-4203. DOI: 10.1109/CVPR.2019.00432.
[36]CORNIA M, STEFANINI M, BARALDI L, et al. Meshed-memory transformer for image captioning[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2020: 10578-10587. DOI: 10.1109/CVPR42600.2020.01059.
[1] 李正光, 陈恒, 林鸿飞. 基于双向语言模型的社交媒体药物不良反应识别[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 40-48.
[2] 万黎明, 张小乾, 刘知贵, 宋林, 周莹, 李理. 基于高效通道注意力的UNet肺结节CT图像分割[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 66-75.
[3] 张萍, 徐巧枝. 基于多感受野与分组混合注意力机制的肺结节分割研究[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 76-87.
[4] 孔亚钰, 卢玉洁, 孙中天, 肖敬先, 侯昊辰, 陈廷伟. 面向强化当前兴趣的图神经网络推荐算法研究[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 151-160.
[5] 吴军, 欧阳艾嘉, 张琳. 基于多头注意力机制的磷酸化位点预测模型[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 161-171.
[6] 邓文轩, 杨航, 靳婷. 基于注意力机制的图像分类降维方法[J]. 广西师范大学学报(自然科学版), 2021, 39(2): 32-40.
[7] 李维勇, 柳斌, 张伟, 陈云芳. 一种基于深度学习的中文生成式自动摘要方法[J]. 广西师范大学学报(自然科学版), 2020, 38(2): 51-63.
[8] 王健, 郑七凡, 李超, 石晶. 基于ENCODER_ATT机制的远程监督关系抽取[J]. 广西师范大学学报(自然科学版), 2019, 37(4): 53-60.
[9] 武文雅, 陈钰枫, 徐金安, 张玉洁. 基于高层语义注意力机制的中文实体关系抽取[J]. 广西师范大学学报(自然科学版), 2019, 37(1): 32-41.
[10] 岳天驰, 张绍武, 杨亮, 林鸿飞, 于凯. 基于两阶段注意力机制的立场检测方法[J]. 广西师范大学学报(自然科学版), 2019, 37(1): 42-49.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!
版权所有 © 广西师范大学学报(自然科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发