体现用户意图和风格的图像描述生成

doi:10.16088/j.issn.1001-6600.2021101803

摘要/Abstract

摘要： 现有的图像描述模型大多不能根据用户的意图和用语风格生成个性化的描述。针对这一问题,本文提出一种能体现用户意图和风格的个性化图像描述方法。首先,构建一个关于场景中目标、目标属性以及目标间关系的结构图,通过该图来控制用户所希望表达的目标对象、目标属性以及各目标之间的相互关系;然后,在编码器中加入多关系图卷积神经网络对场景的上下文信息进行编码,并利用图流动注意力来控制描述的侧重点;最后,在生成语句时加入用户风格控制模块,即利用关键词搜索生成包含性别、年龄、受教育程度等信息的用户画像,并结合该画像来控制风格生成器,提取对应的风格样式,最终生成体现用户意图和风格的个性化图像描述。在MSCOCO和FlickrStyle数据集上的实验结果表明,所提出的方法能较好地生成个性化和多样性图像描述语句。

关键词: 图像描述, 用户画像, 细粒度场景控制, 风格控制, 注意力机制

Abstract: Most of the image captioning models are individuality-agnostic, which cannot generate an individual description according to the user’s intention and language style. To address the above problem, a personalized image captioning model is established in this paper by using fine-grained scene control graph and the style control factors to represent user’s intention and style of speaking, respectively. Firstly, construct a scene control graph, including the object, object attributes and the relationship objects in the scene, which can control the object, object attributes and the relationship between object. Secondly, a multi-graph convolutional neural network is used to encode the context information of the scene, and graph flow attention is employed to control the focus of the description. Then, add the style control module when generating sentences, that is, use keyword search to generate user profile according to user’s gender, age, education level and other information. Finally, the style generator extracts the corresponding style pattern according to the user profile, and the language decoder outputs a personalized image caption. The experimental results on MSCOCO dataset and FlickrStyle dataset show that the proposed method can generate personalized and diverse image caption sentences.

Key words: image captioning, user profile, fine-grained control, style control, attention mechanism

中图分类号:

TP391.1

王宇航, 张灿龙, 李志欣, 王智文. 体现用户意图和风格的图像描述生成[J]. 广西师范大学学报（自然科学版）, 2022, 40(4): 91-103.

WANG Yuhang, ZHANG Canlong, LI Zhixin, WANG Zhiwen. Image Captioning According to User’s Intention and Style[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(4): 91-103.

参考文献

[1] 李志欣, 魏海洋, 黄飞成, 等. 结合视觉特征和场景语义的图像描述生成[J]. 计算机学报, 2020, 43(9): 1624-1640. DOI: 10.11897/SP.J.1016.2020.01624.
[2]周东明, 张灿龙, 李志欣, 等. 基于多层级视觉融合的图像描述模型[J]. 电子学报, 2021, 49(7): 1286-1290. DOI: 10.12263/DZXB.20191296.
[3]魏忠钰, 范智昊, 王瑞泽, 等. 从视觉到文本:图像描述生成的研究进展综述[J]. 中文信息学报, 2020, 34(7): 19-29. DOI: 10.3969/j.issn.1003-0077.2020.07.002.
[4]HARSH A, KARAN D, CHEN X, et al. nocaps: novel object captioning at scale[C]// Proceedings of the IEEE International Conference on Computer Vision. Los Alamitos, CA: IEEE Computer Society, 2019: 8947-8956. DOI: 10.1109/ICCV.2019.00904.
[5]CORNIA M, BARALDI L, CUCCHIARA R. Show, control and tell: a framework for generating controllable and grounded captions[C]// Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Los Alamitos, CA: IEEE Computer Society, 2019: 8299-8308. DOI: 10.1109/CVPR.2019.00850.
[6]王俊豪, 罗轶凤. 通过细粒度的语义特征与Transformer丰富图像描述[J]. 华东师范大学学报(自然科学版), 2020(5): 56-67. DOI: 10.3969/j.issn.1000-5641.202091004.
[7]ENGILBERGE M, CHEVALLIER L, PREZ P, et al. Finding beans in burgers: Deep semantic-visual embedding with localization[C]// Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2018: 3984-3993. DOI: 10.1109/CVPR.2018.00419.
[8]GU J, CAI J, JOTY S, et al. Look, imagine and match: improving textual-visual cross-modal retrieval with generative models[C]// Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2018: 7181-7189. DOI: 10.1109/CVPR.2018.00750.
[9]张家硕, 洪宇, 李志峰, 等. 基于双向注意力机制的图像描述生成[J]. 中文信息学报, 2020, 34(9): 53-61. DOI: 10.3969/j.issn.1003-0077.2020.09.008.
[10]马书磊, 张国宾, 焦阳,等. 一种改进的全局注意机制图像描述方法[J]. 西安电子科技大学学报, 2019, 46(2): 17-22. DOI: 10.19665/j.issn1001-2400.2019.02.004.
[11]马坤阳, 林金朝, 庞宇. 结合引导解码和视觉注意力的图像语义描述模型[J]. 计算机应用研究, 2020, 37(11): 3504-3506, 3515. DOI: 10.19734/j.issn.1001-3695.2019.06.0243.
[12]黄友文, 游亚东, 赵朋. 融合卷积注意力机制的图像描述生成模型[J]. 计算机应用, 2020, 40(1): 23-27. DOI: 10.11772/j.issn.1001-9081.2019050943.
[13]石义乐, 杨文忠, 杜慧祥, 等. 基于深度学习的图像描述综述[J]. 电子学报, 2021, 49(10): 2048-2060. DOI: 10. 12263/DZXB.20200669.
[14]邓珍荣, 张宝军, 蒋周琴, 等. 融合word2vec和注意力机制的图像描述模型[J]. 计算机科学, 2019, 46(4): 268-273. DOI: 10.11896/j.issn.1002-137x.2019.04.042.
[15]黄远, 白琮, 李宏凯, 等. 基于条件生成对抗网络的图像描述生成方法[J]. 计算机辅助设计与图形学学报, 2020, 32(6): 911-918. DOI: 10.3724/SP.J.1089.2020.18003.
[16]KARPATHY A, FEI-FEI L. Deep visual-semantic alignments for generating image descriptions[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2015: 3128-3137. DOI: 10.1109/CVPR.2015.7298932.
[17]VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell:lessons learned from the 2015 mscoco image captioning challenge[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 39(4): 652-663. DOI: 10.1109/ TPAMI.2016.2587640.
[18]WU Q, SHEN C, LIU L, et al. What value do explicit high level concepts have in vision to language problems?[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2016: 203-212. DOI: 10.1109/CVPR.2016.29.
[19]XU K V, BA J L, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention[C]// Proceedings of the 32nd International Conference on Machine Learning: Volume 37. Lile: PMLR, 2015: 2048-2057.
[20]LU J, XIONG C, PARIKH D, et al. Knowing when to look:adaptive attention via a visual sentinel for image captioning[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2017: 375-383. DOI: 10.1109/CVPR.2017.345.
[21]CHEN L, ZHANG H W, XIAO J, et al. SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2017: 5659-5667. DOI: 10.1109/CVPR.2017.667.
[22]RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2017: 7008-7024. DOI: 10.1109/CVPR.2017.131.
[23]ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]// 2018 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society,2018: 6077-6086. DOI: 10.1109/CVPR.2018.00636.
[24]WANG Y S, LIU C X, ZENG X H, et al. Scene graph parsing as dependency parsing[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Stroudsburg, PA: Association for Computational Linguistics, 2018: 397-407. DOI: 10.18653/v1/N18-1037.
[25]ZELLERS R, YATSKAR M, THOMSON S, et al. Neural motifs:scene graph parsing with global context[C]// 2018 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2018: 5831-5840. DOI: 10.1109/CVPR.2018.00611.
[26]YANG X, TANG K H, ZHANG H W, et al. Auto-encoding scene graphs for image captioning[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2019: 10677-10686. DOI: 10.1109/CVPR.2019.01094.
[27]PARK C C, KIM B C, KIM G H. Attend to you: personalized image captioning with context sequence memory networks[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA: IEEE Computer Society, 2017: 895-903. DOI: 10.1109/CVPR.2017.681.
[28]GAN C, GAN Z, HE X, et al. Stylenet: Generating attractive visual captions with styles[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2017: 3137-3146. DOI: 10.1109/CVPR.2017.108.
[29]SHUSTER K, HUMEAU S, HU H, et al. Engaging image captioning via personality[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2019: 12516-12526. DOI: 10.1109/CVPR.2019.01280.
[30]CHEN S, JIN Q, WANG P, et al. Say as you wish:fine-grained control of image caption generation with abstract scene graphs[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2020: 9962-9971. DOI: 10.1109/CVPR42600.2020.00998.
[31]LIU B, FU J, KATO M P, et al. Beyond narrative description: generating poetry from images by multi-adversarial training[C]// Proceedings of the 26th ACM international conference on Multimedia. New York, NY: Association for Computing Machinery, 2018: 783-791. DOI: 10.1145/3240508.3240587.
[32]VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2015: 4566-4575. DOI: 10.1109/CVPR.2015.7299087.
[33]ANDERSON P, FERNANDO B, JOHNSON M, et al. SPICE:semantic propositional image caption evaluation[C]// Computer Vision– ECCV 2016: Lecture Notes in Computer Science 9909. Cham: Springer International Publishing AG, 2016: 382-398. DOI: 10.1007/978-3-319-46454-1_24.
[34]LIN C Y. ROUGE: a package for automatic evaluation of summaries[C]// Text Summarization Branches Out. Stroudsburg, PA: Association for Computational Linguistics, 2004: 74-81.
[35]WANG Q, CHAN A B. Describing like humans: on diversity in image captioning[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2019: 4195-4203. DOI: 10.1109/CVPR.2019.00432.
[36]CORNIA M, STEFANINI M, BARALDI L, et al. Meshed-memory transformer for image captioning[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society, 2020: 10578-10587. DOI: 10.1109/CVPR42600.2020.01059.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed