基于端到端深度神经网络的语音情感识别研究

doi:10.16088/j.issn.1001-6600.2020051802

摘要/Abstract

摘要： 语音情感识别是实现自然人机交互的重要组成部分,传统语音情感识别系统主要集中于特征提取和模型构建。本文提出一种将深度神经网络直接应用于原始信号的语音情感识别方法。原始语音数据携带了语音信号的情感信息、二维空间信息和时序上下文信息。建立的模型以端到端的方式进行训练,网络自动学习原始语音信号的特征表示,无需手工特征提取步骤。该网络模型同时兼顾了CNN和BLSTM 2种神经网络的优点。利用CNN从原始语音数据中学习空间特征,其后添加一个BLSTM学习上下文特征。为了评估该模型的有效性,在IEMOCAP数据库上进行识别测试,获得的WA和UA分别为71.39%、61.06%。此外,与基线模型进行对比,验证了提出方法的有效性。

关键词: 语音情感识别, CNN, BLSTM, 端到端, 原始语音

Abstract: Speech emotion recognition is an important part of natural human-computer interaction. The traditional speech emotion recognition system mainly focuses on feature extraction and model construction. This paper proposes a speech emotion recognition method that directly applies deep neural network to the raw signal. The raw speech data carry the emotional information, two-dimensional spatial information and temporal context information of the speech signal. The model proposed is trained in an end-to-end manner, and the network automatically learns the feature representation of the raw speech signal without the need for manual feature extraction. The network model takes into account the advantages of both CNN and BLSTM neural networks. CNN is used to learn spatial features from the raw speech data, and then a BLSTM learning context feature is added. In order to evaluate the effectiveness of the model, recognition tests are carried out on IEMOCAP database, and the WA and UA obtained are 71.39% and 61.06% respectively. In addition, compared with the baseline model, the effectiveness of the proposed method is verified.

Key words: speech emotion recognition, CNN, BLSTM, end-to-end, raw speech

中图分类号:

TN912.34

吕惠炼, 胡维平. 基于端到端深度神经网络的语音情感识别研究[J]. 广西师范大学学报（自然科学版）, 2021, 39(3): 20-26.

LÜ Huilian, HU Weiping. Research on Speech Emotion Recognition Based on End-to-End Deep Neural Network[J]. Journal of Guangxi Normal University(Natural Science Edition), 2021, 39(3): 20-26.

参考文献

[1]韩文静,李海峰,阮华斌,等.语音情感识别研究进展综述[J]. 软件学报,2014,25(1):37-50. DOI:10.13328/j.cnki.jos.004497.
[2]SATT A,ROZENBERG S,HOORY R. Efficient emotion recognition from speech using deep learning on spectrograms[C]// Interspeech 2017. BAIXAS: International Speech Communication Association,2017:1089-1093.
[3]桑立锋,吴朝晖,杨莹春. 基于GMM的语音帧得分上的重优化[J]. 广西师范大学学报(自然科学版),2003,21(1):180-184.
[4]GHOSH S,LAKSANA E,MORENCY L P,et al. Representation learning for speech emotion recognition[C]// Interspeech 2016. BAIXAS: International Speech Communication Association, 2016:3603-3607. DOI:10.21437/Interspeech.2016-692.
[5]ALDENEH Z,PROVOST E M. Using regional saliency for speech emotion recognition[C]// 2017 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Piscataway,NJ: IEEE Press,2017:2741-2745.
[6]CUMMINS N,AMIRIPARIAN S,HAGERER G,et al. An image-based deep spectrum feature representation for the recognition of emotional speech[C]// Proceedings of the 25th ACM international conference on Multimedia. New York, NY: Association for Computing Machinery, 2017:478-484.
[7]WANG K X,AN N,LI B N,et al. Speech emotion recognition using Fourier parameters[J]. IEEE Transactions on Affective Computing,2015,6(1):69-75.
[8]TRIGEORGIS G,RINGEVAL F,BRUECKNER R,et al. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network[C]// 2016 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Piscataway,NJ: IEEE Press,2016:5200-5204. DOI:10.1109/ICASSP.2016.7472669.
[9]LATIF S,RANA R,KHALIFA S,et al. Direct modelling of speech emotion from raw speech[EB/OL].(2019-07-03)[2020-05-18]. https://arxiv.org/pdf/1904.03833v3.pdf.
[10]LI P C,SONG Y,MCLOUGHLIN V I,et al. An attention pooling based representation learning method for speech emotion recognition[C]// Interspeech 2018. BAIXAS: International Speech Communication Association, 2018:3087-3091.
[11]LIM W,JANG D,LEE T. Speech emotion recognition using convolutional and recurrent neural networks[C]// 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference(APSIPA). Piscataway, NJ: IEEE Press, 2017.
[12]李彦东,郝宗波,雷航.卷积神经网络研究综述[J]. 计算机应用,2016,36(9):2508-2515,2565. DOI:10.11772/j.issn.1001-9081.2016.09.2508.
[13]HUANG C W,NARAYANAN S S. Characterizing types of convolution in deep convolutional recurrent neural networks for robust speech emotion recognition[EB/OL].(2018-06-13)[2020-05-18]. https://arxiv.org/pdf/1706.02901.pdf.
[14]NEUMANN M,VU N T. Attentive convolutional neural network based speech emotion recognition:a study on the impact of input features,signal length,and acted speech[C]// Interspeech 2017. BAIXAS: International Speech Communication Association, 2017: 1263-1267. DOI:10.21437/Interspeech.2017-917.
[15]HOCHREITER S,SCHMIDHUBER J. Long Short-Term Memory[J]. Neural computation,1997,9(8):1735-1780. DOI:10.1007/978-3-642-24797-2_4.
[16]GRAVES A,FERNÁNDEZ S,SCHMIDHUBER J. Bidirectional LSTM networks for improved phoneme classification and recognition[M]// DUCH W, KACPRZYK J, ZADRONY S. Artificial Neural Networks:Formal Models and Their Applications-ICANN 2005. Berlin: Springer,2005:799-804.
[17]KIM J W,SAUROUS R. Emotion recognition from human speech using temporal information and deep learning[C]// Interspeech 2018. BAIXAS: International Speech Communication Association, 2018:937-940.
[18]BUSSO C,BULUT M,LEE C C,et al. IEMOCAP:interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation,2008,42(4):335-359. DOI:10.1007/s10579-008-9076-6.
[19]TZIRAKIS P,ZHANG J H,SCHULLER B W. End-to-end speech emotion recognition using deep neural networks[C]// 2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Piscataway, NJ: IEEE, 2018:5089-5093.
[20]ZHANG S Q, ZHANG S L, HUANG T J. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching[J]. IEEE Transactions on Multimedia,2018,20(6):1576-1590.
[21]CHEN M Y,HE X J,YANG J,et al. 3-D Convolutional recurrent neural networks with attention model for speech emotion recognition[J]. IEEE Signal Processing Letters,2018,25(10):1440-1444.
[22]MA X,WU Z Y,JIA J,et al. Emotion recognition from variable-length speech segments using deep learning on spectrograms[C]// Interspeech 2018. BAIXAS: International Speech Communication Association, 2018:3683-3687. DOI:10.21437/Interspeech.2018-2228.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed