基于端到端深度神经网络的语音情感识别研究

doi:10.16088/j.issn.1001-6600.2020051802

Abstract

Abstract: Speech emotion recognition is an important part of natural human-computer interaction. The traditional speech emotion recognition system mainly focuses on feature extraction and model construction. This paper proposes a speech emotion recognition method that directly applies deep neural network to the raw signal. The raw speech data carry the emotional information, two-dimensional spatial information and temporal context information of the speech signal. The model proposed is trained in an end-to-end manner, and the network automatically learns the feature representation of the raw speech signal without the need for manual feature extraction. The network model takes into account the advantages of both CNN and BLSTM neural networks. CNN is used to learn spatial features from the raw speech data, and then a BLSTM learning context feature is added. In order to evaluate the effectiveness of the model, recognition tests are carried out on IEMOCAP database, and the WA and UA obtained are 71.39% and 61.06% respectively. In addition, compared with the baseline model, the effectiveness of the proposed method is verified.

Key words: speech emotion recognition, CNN, BLSTM, end-to-end, raw speech

CLC Number:

TN912.34

LÜ Huilian, HU Weiping. Research on Speech Emotion Recognition Based on End-to-End Deep Neural Network[J].Journal of Guangxi Normal University(Natural Science Edition), 2021, 39(3): 20-26.

References

[1]韩文静,李海峰,阮华斌,等.语音情感识别研究进展综述[J]. 软件学报,2014,25(1):37-50. DOI:10.13328/j.cnki.jos.004497.
[2]SATT A,ROZENBERG S,HOORY R. Efficient emotion recognition from speech using deep learning on spectrograms[C]// Interspeech 2017. BAIXAS: International Speech Communication Association,2017:1089-1093.
[3]桑立锋,吴朝晖,杨莹春. 基于GMM的语音帧得分上的重优化[J]. 广西师范大学学报(自然科学版),2003,21(1):180-184.
[4]GHOSH S,LAKSANA E,MORENCY L P,et al. Representation learning for speech emotion recognition[C]// Interspeech 2016. BAIXAS: International Speech Communication Association, 2016:3603-3607. DOI:10.21437/Interspeech.2016-692.
[5]ALDENEH Z,PROVOST E M. Using regional saliency for speech emotion recognition[C]// 2017 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Piscataway,NJ: IEEE Press,2017:2741-2745.
[6]CUMMINS N,AMIRIPARIAN S,HAGERER G,et al. An image-based deep spectrum feature representation for the recognition of emotional speech[C]// Proceedings of the 25th ACM international conference on Multimedia. New York, NY: Association for Computing Machinery, 2017:478-484.
[7]WANG K X,AN N,LI B N,et al. Speech emotion recognition using Fourier parameters[J]. IEEE Transactions on Affective Computing,2015,6(1):69-75.
[8]TRIGEORGIS G,RINGEVAL F,BRUECKNER R,et al. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network[C]// 2016 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Piscataway,NJ: IEEE Press,2016:5200-5204. DOI:10.1109/ICASSP.2016.7472669.
[9]LATIF S,RANA R,KHALIFA S,et al. Direct modelling of speech emotion from raw speech[EB/OL].(2019-07-03)[2020-05-18]. https://arxiv.org/pdf/1904.03833v3.pdf.
[10]LI P C,SONG Y,MCLOUGHLIN V I,et al. An attention pooling based representation learning method for speech emotion recognition[C]// Interspeech 2018. BAIXAS: International Speech Communication Association, 2018:3087-3091.
[11]LIM W,JANG D,LEE T. Speech emotion recognition using convolutional and recurrent neural networks[C]// 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference(APSIPA). Piscataway, NJ: IEEE Press, 2017.
[12]李彦东,郝宗波,雷航.卷积神经网络研究综述[J]. 计算机应用,2016,36(9):2508-2515,2565. DOI:10.11772/j.issn.1001-9081.2016.09.2508.
[13]HUANG C W,NARAYANAN S S. Characterizing types of convolution in deep convolutional recurrent neural networks for robust speech emotion recognition[EB/OL].(2018-06-13)[2020-05-18]. https://arxiv.org/pdf/1706.02901.pdf.
[14]NEUMANN M,VU N T. Attentive convolutional neural network based speech emotion recognition:a study on the impact of input features,signal length,and acted speech[C]// Interspeech 2017. BAIXAS: International Speech Communication Association, 2017: 1263-1267. DOI:10.21437/Interspeech.2017-917.
[15]HOCHREITER S,SCHMIDHUBER J. Long Short-Term Memory[J]. Neural computation,1997,9(8):1735-1780. DOI:10.1007/978-3-642-24797-2_4.
[16]GRAVES A,FERNÁNDEZ S,SCHMIDHUBER J. Bidirectional LSTM networks for improved phoneme classification and recognition[M]// DUCH W, KACPRZYK J, ZADRONY S. Artificial Neural Networks:Formal Models and Their Applications-ICANN 2005. Berlin: Springer,2005:799-804.
[17]KIM J W,SAUROUS R. Emotion recognition from human speech using temporal information and deep learning[C]// Interspeech 2018. BAIXAS: International Speech Communication Association, 2018:937-940.
[18]BUSSO C,BULUT M,LEE C C,et al. IEMOCAP:interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation,2008,42(4):335-359. DOI:10.1007/s10579-008-9076-6.
[19]TZIRAKIS P,ZHANG J H,SCHULLER B W. End-to-end speech emotion recognition using deep neural networks[C]// 2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Piscataway, NJ: IEEE, 2018:5089-5093.
[20]ZHANG S Q, ZHANG S L, HUANG T J. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching[J]. IEEE Transactions on Multimedia,2018,20(6):1576-1590.
[21]CHEN M Y,HE X J,YANG J,et al. 3-D Convolutional recurrent neural networks with attention model for speech emotion recognition[J]. IEEE Signal Processing Letters,2018,25(10):1440-1444.
[22]MA X,WU Z Y,JIA J,et al. Emotion recognition from variable-length speech segments using deep learning on spectrograms[C]// Interspeech 2018. BAIXAS: International Speech Communication Association, 2018:3683-3687. DOI:10.21437/Interspeech.2018-2228.

Metrics

Viewed

Full text

624

HTML			PDF

Just accepted	Online first	Issue	Just accepted	Online first	Issue
0	0	0	0	0	624

From	Others	local

Times	12	612
Rate	2%	98%

Abstract

282

Just accepted	Online first	Issue

0	0	282

From	Others	local

Times	261	21
Rate	93%	7%

Cited

Web of Science	Crossref	ScienceDirect	Search for Citations in Google Scholar >>


This page requires you have already subscribed to WoS.

Shared

Discussed

Comments

Recommended 10

[1]	ZHANG Junwen. Determination of Oleanic Acid and Uosolic Acid in Gentiana straminea Maxim. by MEKC[J]. Journal of Guangxi Normal University(Natural Science Edition), 2018, 36(1): 99 -104 .
[2]	HAN Caihong, LI Lüe, HUANG Lili. Global Asymptotic Stability of a Class of Difference Equations[J]. Journal of Guangxi Normal University(Natural Science Edition), 2017, 35(1): 53 -57 .
[3]	LIU Xiao, YU Quanzhou, LIU Yujie , ZHANG Jinping,ZHANG Huaizhen, JIANG Xichao, ZHANG Erxun. Spatial-temporal Expansion Characteristics of a Small Cityin the North China Plain: A Case Study of Liaocheng City[J]. Journal of Guangxi Normal University(Natural Science Edition), 2017, 35(4): 136 -144 .
[4]	WANG Kai-ming, ZHOU Hai-yan, GUO Jia-liang, YANG Xiao-jing, WANG Gang, ZHONG Ning. Analysis of Depression Electroencephalogram Basedon Statistics Distribution Entropy[J]. Journal of Guangxi Normal University(Natural Science Edition), 2015, 33(2): 29 -35 .
[5]	WANG Xun, LUO Xiao-shu, CHEN Jin. Fast Focusing Algorithm Based on Image Processing[J]. Journal of Guangxi Normal University(Natural Science Edition), 2015, 33(3): 23 -27 .
[6]	LUO Qiang, HU San-gen, ZANG Xiao-dong, GONG Hua-wei. Design of Monitoring and Control System on Greenhouse Environment Factor Based on ZigBee Technology[J]. Journal of Guangxi Normal University(Natural Science Edition), 2015, 33(3): 28 -33 .
[7]	CHENG Dan, CHEN Zhi-lin, ZHOU Shan-yi. An Analysis on the Ant Fauna of Jinzhongshan Nature Reserve in Guangxi, China[J]. Journal of Guangxi Normal University(Natural Science Edition), 2015, 33(3): 129 -137 .
[8]	ZHANG Lin-lan, LIU Qing. Bilateral Automated Negotiation Based on Fuzzy Method with Incomplete Information[J]. Journal of Guangxi Normal University(Natural Science Edition), 2015, 33(4): 38 -42 .
[9]	YANG Pu, FU Zhe. Stability of Spike Periodic Propagating in Chemical Coupled Neural Loop[J]. Journal of Guangxi Normal University(Natural Science Edition), 2015, 33(4): 96 -102 .
[10]	LIU Dianting, WU Lina. Domain Experts Recommendation in Social Network Basedon the LDA Theme Model of Trust[J]. Journal of Guangxi Normal University(Natural Science Edition), 2018, 36(4): 51 -58 .

Research on Speech Emotion Recognition Based on End-to-End Deep Neural Network

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 2

Metrics

Comments

Recommended 10

[1]	BAI Jie, GAO Haili, WANG Yongzhong, YANG Laibang, XIANG Xiaohang, LOU Xiongwei. Detection of Students’ Classroom Performance Based on Faster R-CNN and Transfer Learning with Multi-Channel Feature Fusion [J]. Journal of Guangxi Normal University(Natural Science Edition), 2020, 38(5): 1-11.
[2]	LIU Yingxuan, WU Xiru, XUE Ganggang. Multi-target Real-time Detection for Road Traffic SignsBased on Deep Learning [J]. Journal of Guangxi Normal University(Natural Science Edition), 2020, 38(2): 96-106.