广西师范大学学报(自然科学版) ›› 2021, Vol. 39 ›› Issue (3): 20-26.doi: 10.16088/j.issn.1001-6600.2020051802

• • 上一篇    下一篇

基于端到端深度神经网络的语音情感识别研究

吕惠炼, 胡维平*   

  1. 广西师范大学 电子工程学院, 广西 桂林 541004
  • 收稿日期:2020-05-18 修回日期:2020-10-17 发布日期:2021-05-13
  • 通讯作者: 胡维平(1963—),男,广西桂林人,广西师范大学教授,博士。E-mail: huwp@gxnu.edu.cn
  • 基金资助:
    国家自然科学基金(61861005)

Research on Speech Emotion Recognition Based on End-to-End Deep Neural Network

LÜ Huilian, HU Weiping*   

  1. College of Electronic Engineering, Guangxi Normal Nniversity, Guilin Guangxi 541004, China
  • Received:2020-05-18 Revised:2020-10-17 Published:2021-05-13

摘要: 语音情感识别是实现自然人机交互的重要组成部分,传统语音情感识别系统主要集中于特征提取和模型构建。本文提出一种将深度神经网络直接应用于原始信号的语音情感识别方法。原始语音数据携带了语音信号的情感信息、二维空间信息和时序上下文信息。建立的模型以端到端的方式进行训练,网络自动学习原始语音信号的特征表示,无需手工特征提取步骤。该网络模型同时兼顾了CNN和BLSTM 2种神经网络的优点。利用CNN从原始语音数据中学习空间特征,其后添加一个BLSTM学习上下文特征。为了评估该模型的有效性,在IEMOCAP数据库上进行识别测试,获得的WA和UA分别为71.39%、61.06%。此外,与基线模型进行对比,验证了提出方法的有效性。

关键词: 语音情感识别, CNN, BLSTM, 端到端, 原始语音

Abstract: Speech emotion recognition is an important part of natural human-computer interaction. The traditional speech emotion recognition system mainly focuses on feature extraction and model construction. This paper proposes a speech emotion recognition method that directly applies deep neural network to the raw signal. The raw speech data carry the emotional information, two-dimensional spatial information and temporal context information of the speech signal. The model proposed is trained in an end-to-end manner, and the network automatically learns the feature representation of the raw speech signal without the need for manual feature extraction. The network model takes into account the advantages of both CNN and BLSTM neural networks. CNN is used to learn spatial features from the raw speech data, and then a BLSTM learning context feature is added. In order to evaluate the effectiveness of the model, recognition tests are carried out on IEMOCAP database, and the WA and UA obtained are 71.39% and 61.06% respectively. In addition, compared with the baseline model, the effectiveness of the proposed method is verified.

Key words: speech emotion recognition, CNN, BLSTM, end-to-end, raw speech

中图分类号: 

  • TN912.34
[1]韩文静,李海峰,阮华斌,等.语音情感识别研究进展综述[J]. 软件学报,2014,25(1):37-50. DOI:10.13328/j.cnki.jos.004497.
[2]SATT A,ROZENBERG S,HOORY R. Efficient emotion recognition from speech using deep learning on spectrograms[C]// Interspeech 2017. BAIXAS: International Speech Communication Association,2017:1089-1093.
[3]桑立锋,吴朝晖,杨莹春. 基于GMM的语音帧得分上的重优化[J]. 广西师范大学学报(自然科学版),2003,21(1):180-184.
[4]GHOSH S,LAKSANA E,MORENCY L P,et al. Representation learning for speech emotion recognition[C]// Interspeech 2016. BAIXAS: International Speech Communication Association, 2016:3603-3607. DOI:10.21437/Interspeech.2016-692.
[5]ALDENEH Z,PROVOST E M. Using regional saliency for speech emotion recognition[C]// 2017 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Piscataway,NJ: IEEE Press,2017:2741-2745.
[6]CUMMINS N,AMIRIPARIAN S,HAGERER G,et al. An image-based deep spectrum feature representation for the recognition of emotional speech[C]// Proceedings of the 25th ACM international conference on Multimedia. New York, NY: Association for Computing Machinery, 2017:478-484.
[7]WANG K X,AN N,LI B N,et al. Speech emotion recognition using Fourier parameters[J]. IEEE Transactions on Affective Computing,2015,6(1):69-75.
[8]TRIGEORGIS G,RINGEVAL F,BRUECKNER R,et al. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network[C]// 2016 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Piscataway,NJ: IEEE Press,2016:5200-5204. DOI:10.1109/ICASSP.2016.7472669.
[9]LATIF S,RANA R,KHALIFA S,et al. Direct modelling of speech emotion from raw speech[EB/OL].(2019-07-03)[2020-05-18]. https://arxiv.org/pdf/1904.03833v3.pdf.
[10]LI P C,SONG Y,MCLOUGHLIN V I,et al. An attention pooling based representation learning method for speech emotion recognition[C]// Interspeech 2018. BAIXAS: International Speech Communication Association, 2018:3087-3091.
[11]LIM W,JANG D,LEE T. Speech emotion recognition using convolutional and recurrent neural networks[C]// 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference(APSIPA). Piscataway, NJ: IEEE Press, 2017.
[12]李彦东,郝宗波,雷航.卷积神经网络研究综述[J]. 计算机应用,2016,36(9):2508-2515,2565. DOI:10.11772/j.issn.1001-9081.2016.09.2508.
[13]HUANG C W,NARAYANAN S S. Characterizing types of convolution in deep convolutional recurrent neural networks for robust speech emotion recognition[EB/OL].(2018-06-13)[2020-05-18]. https://arxiv.org/pdf/1706.02901.pdf.
[14]NEUMANN M,VU N T. Attentive convolutional neural network based speech emotion recognition:a study on the impact of input features,signal length,and acted speech[C]// Interspeech 2017. BAIXAS: International Speech Communication Association, 2017: 1263-1267. DOI:10.21437/Interspeech.2017-917.
[15]HOCHREITER S,SCHMIDHUBER J. Long Short-Term Memory[J]. Neural computation,1997,9(8):1735-1780. DOI:10.1007/978-3-642-24797-2_4.
[16]GRAVES A,FERNÁNDEZ S,SCHMIDHUBER J. Bidirectional LSTM networks for improved phoneme classification and recognition[M]// DUCH W, KACPRZYK J, ZADRONY S. Artificial Neural Networks:Formal Models and Their Applications-ICANN 2005. Berlin: Springer,2005:799-804.
[17]KIM J W,SAUROUS R. Emotion recognition from human speech using temporal information and deep learning[C]// Interspeech 2018. BAIXAS: International Speech Communication Association, 2018:937-940.
[18]BUSSO C,BULUT M,LEE C C,et al. IEMOCAP:interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation,2008,42(4):335-359. DOI:10.1007/s10579-008-9076-6.
[19]TZIRAKIS P,ZHANG J H,SCHULLER B W. End-to-end speech emotion recognition using deep neural networks[C]// 2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Piscataway, NJ: IEEE, 2018:5089-5093.
[20]ZHANG S Q, ZHANG S L, HUANG T J. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching[J]. IEEE Transactions on Multimedia,2018,20(6):1576-1590.
[21]CHEN M Y,HE X J,YANG J,et al. 3-D Convolutional recurrent neural networks with attention model for speech emotion recognition[J]. IEEE Signal Processing Letters,2018,25(10):1440-1444.
[22]MA X,WU Z Y,JIA J,et al. Emotion recognition from variable-length speech segments using deep learning on spectrograms[C]// Interspeech 2018. BAIXAS: International Speech Communication Association, 2018:3683-3687. DOI:10.21437/Interspeech.2018-2228.
[1] 白捷, 高海力, 王永众, 杨来邦, 项晓航, 楼雄伟. 基于多路特征融合的Faster R-CNN与迁移学习的学生课堂行为检测[J]. 广西师范大学学报(自然科学版), 2020, 38(5): 1-11.
[2] 刘英璇, 伍锡如, 雪刚刚. 基于深度学习的道路交通标志多目标实时检测[J]. 广西师范大学学报(自然科学版), 2020, 38(2): 96-106.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 张军文,李成思,卢永昌,史生辉. MEKC测定麻花艽中齐墩果酸和熊果酸含量[J]. 广西师范大学学报(自然科学版), 2018, 36(1): 99 -104 .
[2] 韩彩虹, 李略, 黄丽丽. 一类差分方程的全局渐近稳定性[J]. 广西师范大学学报(自然科学版), 2017, 35(1): 53 -57 .
[3] 刘 晓,于泉洲,刘煜杰,张金萍,张怀珍,蒋习超,张二勋. 华北平原中小城市扩展的时空特征研究——以聊城市为例[J]. 广西师范大学学报(自然科学版), 2017, 35(4): 136 -144 .
[4] 王凯明, 周海燕, 郭家梁, 杨孝敬, 王刚, 钟宁. 基于统计分布熵的抑郁症脑电信号分析[J]. 广西师范大学学报(自然科学版), 2015, 33(2): 29 -35 .
[5] 王勋, 罗晓曙, 陈锦. 基于图像处理的快速调焦算法[J]. 广西师范大学学报(自然科学版), 2015, 33(3): 23 -27 .
[6] 罗强, 胡三根, 臧晓冬, 龚华炜. 基于ZigBee技术的温室环境因子远程监控系统设计[J]. 广西师范大学学报(自然科学版), 2015, 33(3): 28 -33 .
[7] 程丹, 陈志林, 周善义. 广西金钟山自然保护区蚁科昆虫区系分析[J]. 广西师范大学学报(自然科学版), 2015, 33(3): 129 -137 .
[8] 张林兰, 刘青. 基于模糊准则的不完全信息双边协商研究[J]. 广西师范大学学报(自然科学版), 2015, 33(4): 38 -42 .
[9] 杨浦, 付喆. 发放脉冲在化学耦合神经环路的周期传播稳定性[J]. 广西师范大学学报(自然科学版), 2015, 33(4): 96 -102 .
[10] 刘电霆, 吴丽娜. 社会网络中基于信任的LDA主题模型领域专家推荐[J]. 广西师范大学学报(自然科学版), 2018, 36(4): 51 -58 .
版权所有 © 广西师范大学学报(自然科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发