|
广西师范大学学报(自然科学版) ›› 2021, Vol. 39 ›› Issue (3): 20-26.doi: 10.16088/j.issn.1001-6600.2020051802
吕惠炼, 胡维平*
LÜ Huilian, HU Weiping*
摘要: 语音情感识别是实现自然人机交互的重要组成部分,传统语音情感识别系统主要集中于特征提取和模型构建。本文提出一种将深度神经网络直接应用于原始信号的语音情感识别方法。原始语音数据携带了语音信号的情感信息、二维空间信息和时序上下文信息。建立的模型以端到端的方式进行训练,网络自动学习原始语音信号的特征表示,无需手工特征提取步骤。该网络模型同时兼顾了CNN和BLSTM 2种神经网络的优点。利用CNN从原始语音数据中学习空间特征,其后添加一个BLSTM学习上下文特征。为了评估该模型的有效性,在IEMOCAP数据库上进行识别测试,获得的WA和UA分别为71.39%、61.06%。此外,与基线模型进行对比,验证了提出方法的有效性。
中图分类号:
[1]韩文静,李海峰,阮华斌,等.语音情感识别研究进展综述[J]. 软件学报,2014,25(1):37-50. DOI:10.13328/j.cnki.jos.004497. [2]SATT A,ROZENBERG S,HOORY R. Efficient emotion recognition from speech using deep learning on spectrograms[C]// Interspeech 2017. BAIXAS: International Speech Communication Association,2017:1089-1093. [3]桑立锋,吴朝晖,杨莹春. 基于GMM的语音帧得分上的重优化[J]. 广西师范大学学报(自然科学版),2003,21(1):180-184. [4]GHOSH S,LAKSANA E,MORENCY L P,et al. Representation learning for speech emotion recognition[C]// Interspeech 2016. BAIXAS: International Speech Communication Association, 2016:3603-3607. DOI:10.21437/Interspeech.2016-692. [5]ALDENEH Z,PROVOST E M. Using regional saliency for speech emotion recognition[C]// 2017 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Piscataway,NJ: IEEE Press,2017:2741-2745. [6]CUMMINS N,AMIRIPARIAN S,HAGERER G,et al. An image-based deep spectrum feature representation for the recognition of emotional speech[C]// Proceedings of the 25th ACM international conference on Multimedia. New York, NY: Association for Computing Machinery, 2017:478-484. [7]WANG K X,AN N,LI B N,et al. Speech emotion recognition using Fourier parameters[J]. IEEE Transactions on Affective Computing,2015,6(1):69-75. [8]TRIGEORGIS G,RINGEVAL F,BRUECKNER R,et al. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network[C]// 2016 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Piscataway,NJ: IEEE Press,2016:5200-5204. DOI:10.1109/ICASSP.2016.7472669. [9]LATIF S,RANA R,KHALIFA S,et al. Direct modelling of speech emotion from raw speech[EB/OL].(2019-07-03)[2020-05-18]. https://arxiv.org/pdf/1904.03833v3.pdf. [10]LI P C,SONG Y,MCLOUGHLIN V I,et al. An attention pooling based representation learning method for speech emotion recognition[C]// Interspeech 2018. BAIXAS: International Speech Communication Association, 2018:3087-3091. [11]LIM W,JANG D,LEE T. Speech emotion recognition using convolutional and recurrent neural networks[C]// 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference(APSIPA). Piscataway, NJ: IEEE Press, 2017. [12]李彦东,郝宗波,雷航.卷积神经网络研究综述[J]. 计算机应用,2016,36(9):2508-2515,2565. DOI:10.11772/j.issn.1001-9081.2016.09.2508. [13]HUANG C W,NARAYANAN S S. Characterizing types of convolution in deep convolutional recurrent neural networks for robust speech emotion recognition[EB/OL].(2018-06-13)[2020-05-18]. https://arxiv.org/pdf/1706.02901.pdf. [14]NEUMANN M,VU N T. Attentive convolutional neural network based speech emotion recognition:a study on the impact of input features,signal length,and acted speech[C]// Interspeech 2017. BAIXAS: International Speech Communication Association, 2017: 1263-1267. DOI:10.21437/Interspeech.2017-917. [15]HOCHREITER S,SCHMIDHUBER J. Long Short-Term Memory[J]. Neural computation,1997,9(8):1735-1780. DOI:10.1007/978-3-642-24797-2_4. [16]GRAVES A,FERNÁNDEZ S,SCHMIDHUBER J. Bidirectional LSTM networks for improved phoneme classification and recognition[M]// DUCH W, KACPRZYK J, ZADRONY S. Artificial Neural Networks:Formal Models and Their Applications-ICANN 2005. Berlin: Springer,2005:799-804. [17]KIM J W,SAUROUS R. Emotion recognition from human speech using temporal information and deep learning[C]// Interspeech 2018. BAIXAS: International Speech Communication Association, 2018:937-940. [18]BUSSO C,BULUT M,LEE C C,et al. IEMOCAP:interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation,2008,42(4):335-359. DOI:10.1007/s10579-008-9076-6. [19]TZIRAKIS P,ZHANG J H,SCHULLER B W. End-to-end speech emotion recognition using deep neural networks[C]// 2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Piscataway, NJ: IEEE, 2018:5089-5093. [20]ZHANG S Q, ZHANG S L, HUANG T J. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching[J]. IEEE Transactions on Multimedia,2018,20(6):1576-1590. [21]CHEN M Y,HE X J,YANG J,et al. 3-D Convolutional recurrent neural networks with attention model for speech emotion recognition[J]. IEEE Signal Processing Letters,2018,25(10):1440-1444. [22]MA X,WU Z Y,JIA J,et al. Emotion recognition from variable-length speech segments using deep learning on spectrograms[C]// Interspeech 2018. BAIXAS: International Speech Communication Association, 2018:3683-3687. DOI:10.21437/Interspeech.2018-2228. |
[1] | 白捷, 高海力, 王永众, 杨来邦, 项晓航, 楼雄伟. 基于多路特征融合的Faster R-CNN与迁移学习的学生课堂行为检测[J]. 广西师范大学学报(自然科学版), 2020, 38(5): 1-11. |
[2] | 刘英璇, 伍锡如, 雪刚刚. 基于深度学习的道路交通标志多目标实时检测[J]. 广西师范大学学报(自然科学版), 2020, 38(2): 96-106. |
|
版权所有 © 广西师范大学学报(自然科学版)编辑部 地址:广西桂林市三里店育才路15号 邮编:541004 电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn 本系统由北京玛格泰克科技发展有限公司设计开发 |