中文多模态知识库构建

doi:10.16088/j.issn.1001-6600.2021091504

摘要/Abstract

摘要： 多模态融合旨在将多个模态信息整合以得到一致、公共的模型输出,是多模态领域的一个基本问题。通过多模态信息的融合能获得更全面的特征并且提高模型鲁棒性,目前多模态融合技术已成为多模态领域核心研究课题之一。本文基于ImageNet、HowNet和CCD,通过人工标注构建了一个新的多模态知识库,已完成校准ImageNet中21 455个名词及动词概念的映射,有效地将HowNet以及CCD中概念映射到ImageNet中。该数据集能够应用于自然语言处理任务和计算机视觉任务,并通过图片信息和概念信息提高任务效果。在图片分类中,通过增加HowNet和ImageNet概念能够融合更多的图片特征来辅助分类;在语义理解中,通过映射增加图片信息可以更好地理解语义。

关键词: 多模态信息, 多模态融合, ImageNet, HowNet, CCD

Abstract: Multi-modal fusion aims to integrate multiple modal information to obtain a consistent and common model output, which is a basic problem in the multi-modal field. Through the fusion of multimodal information, more comprehensive features can be obtained and the robustness of the model can be improved. At present, multimodal fusion technology has become one of the core research topics in the field of multimodality. Based on Imagenet, HowNet and CCD, this paper constructs a new multimodal knowledge base through manual annotation. The calibration has completed the mapping of 21 455 noun concepts in ImageNet, effectively mapping the concepts in HowNet and CCD to ImageNet. The data set can be applied to natural language processing tasks and computer vision tasks, and improve the task effect through picture information and concept information. In image classification, by adding HowNet and ImageNet concepts, more image features can be integrated to assist classification. In semantic understanding, image information can be better understood by adding image information through mapping.

Key words: multimodal infomation, multimodal fusion, ImageNet, HowNet, CCD

中图分类号:

TP391.1

晁睿, 张坤丽, 王佳佳, 胡斌, 张维聪, 韩英杰, 昝红英. 中文多模态知识库构建[J]. 广西师范大学学报（自然科学版）, 2022, 40(3): 31-39.

CHAO Rui, ZHANG Kunli, WANG Jiajia, HU Bin, ZHANG Weicong, HAN Yingjie, ZAN Hongying. Construction of Chinese Multimodal Knowledge Base[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(3): 31-39.

参考文献

[1]陈鹏, 李擎, 张德政, 等. 多模态学习方法综述[J]. 工程科学学报, 2020, 42(5): 557-569.
[2]RAMACHANDRAM D, TAYLORG W. Deep multimodal learning: a survey on recent advances and trends[J]. IEEE Signal Processing Magazine, 2017, 34(6): 96-108. DOI: 10.1109/MSP.2017.2738401.
[3]DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]// 2009 IEEE Conference on Computer Vision and Pattern Recognition.Piscataway, NJ: IEEE Press, 2009: 248-255. DOI: 10.1109/CVPR.2009.5206848.
[4]董振东, 董强. 知网和汉语研究[J]. 当代语言学, 2001, 3(1): 33-44.
[5]刘杨, 俞士汶, 于江生. CCD语义知识库的构造研究[J]. 小型微型计算机系统, 2005, 26(8): 1411-1415.
[6]赵京胜, 宋梦雪, 高祥. 自然语言处理发展及应用综述[J]. 信息技术与信息化, 2019(7): 142-145.
[7]XIE R B, LIU Z Y, LUAN H B, et al. Image-embodied knowledge representation learning[C]// Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence. Melbourne, Australia: IJCAI, 2017: 3140-3146. DOI: 10.24963/ijcai.2017/438.
[8]ZHANG Q, FU J, LIU X, et al. Adaptive co-attention network for named entity recognition in tweets[C]// Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2018: 5674-5681.
[9]李霞, 卢官明, 闫静杰, 等. 多模态维度情感预测综述[J]. 自动化学报, 2018, 44(12): 2142-2159.
[10]NIU Z X, ZHOU M, WANG L, et al. Hierarchical multimodal LSTM for dense visual-semantic embedding[C]// 2017 IEEE International conference on Computer Vision(ICCV). Los Alamitos, CA: IEEE Computer Society, 2017: 1899-1907. DOI: 10.1109/ICCV.2017.208.
[11]孙影影, 贾振堂, 朱昊宇. 多模态深度学习综述[J]. 计算机工程与应用, 2020, 56(21): 1-10.
[12]MROUEH Y, MARCHERET E, GOEL V. Deep multimodal learning for audio-visual speech recognition[C]// 2015 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP).Piscataway, NJ: IEEE, 2015: 2130-2134. DOI: 10.1109/ICASSP.2015.7178347.
[13]LEI J, WANG L W, SHEN Y L, et al. Mart: memory-augmented recurrent transformer for coherent video paragraph captioning[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2020: 2603-2614. DOI: 10.18653/v1/2020.acl-main.233.
[14]CORNIA M, STEFANINI M, BARALDI L, et al. Meshed-memory transformer for image captioning[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).Los Alamitos, CA: IEEE Computer Society, 2020: 10578-10587. DOI: 10.1109/CVPR42600.2020.01059.
[15]韩晶. 基于视听信息融合的语音识别研究[D]. 哈尔滨: 哈尔滨理工大学, 2011.
[16]邓佩,谭长庚. 基于转移变量的图文融合微博情感分析[J]. 计算机应用研究, 2018, 21(7): 124-127.
[17]HUANG F R, ZHANG X M, ZHAO Z H, et al. Image-text sentiment analysis via deep multimodal attentive fusion[J]. Knowledge-Based Systems, 2019, 167: 26-37. DOI: 10.1016/j.knosys.2019.01.019.
[18]TIAN F, WANG Q G, LI X, et al. Heterogeneous multimedia cooperative annotation based on multimodal correlation learning[J]. Journal of Visual Communication and Image Representation, 2019, 58: 544-553. DOI: 10.1016/j.jvcir.2018.12.028.
[19]CHEN C, JAFARI R, KEHTARNAVAZ N. UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor[C]// 2015 IEEE International conference on image processing(ICIP).Piscataway, NJ: IEEE, 2015: 168-172. DOI: 10.1109/ICIP.2015.7350781.
[20]RINGEVAL F, SONDEREGGER A, SAUER J, et al.Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions[C]// 2013 10th IEEE international conference and workshops on automatic face and gesture recognition(FG). Piscataway, NJ: IEEE, 2013: 1-8. DOI: 10.1109/FG.2013.6553805.
[21]AGRAWAL A, LU J S, ANTOL S, et al. VQA:visual question answering[J]. International Journal of Computer Vision, 2017, 123(1): 4-31. DOI: 10.1007/s11263-016-0966-6.
[22]NILSBACK M E, ZISSERMAN A. Automated flower classification over a large number of classes[C]// 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing. Los Alamitos, CA: IEEE Computer Society, 2008: 722-729. DOI: 10.1109/ICVGIP.2008.47.
[23]HEILBRON F C, ESCORCIA V, GHANEM B, et al. Activitynet: a large-scale video benchmark for human activity understanding[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Piscataway, NJ: IEEE, 2015: 961-970. DOI: 10.1109/CVPR.2015.7298698.
[24]HRIPCSAK G, ROTHSCHILD A S. Agreement, the F-measure, and reliability in information retrieval[J]. Journal of the American Medical Informatics Association, 2005, 12(3): 296-298. DOI: 10.1197/jamia.M1733.
[25]CARLETTA J. Assessing agreement on classification tasks: the kappa statistic[J].Computational Linguistics. 1996, 22(2): 249-254.
[26]ARTSTEIN R, POESIO M. Inter-coder agreement for computational linguistics[J]. Computational Linguistics. 2008, 34(4): 555-596. DOI: 10.1162/coli.07-034-R2.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed