广西师范大学学报(自然科学版) ›› 2023, Vol. 41 ›› Issue (2): 1-18.doi: 10.16088/j.issn.1001-6600.2022083002

• 综述 •    下一篇

声音事件检测综述

杨烁祯1, 张珑1*, 王建华2, 张恒远1   

  1. 1.天津师范大学 计算机与信息工程学院,天津 300387;
    2.广州华立科技职业学院 计算机信息工程学院,广东 广州 511325
  • 收稿日期:2022-08-30 修回日期:2022-10-25 出版日期:2023-03-25 发布日期:2023-04-25
  • 通讯作者: 张珑(1978—),男,江苏邳州人,天津师范大学教授,博士。E-mail:zhanglong@tjnu.edu.cn
  • 基金资助:
    国家自然科学基金(61771173);天津市自然科学基金(20JCZDJC00400)

Review of Sound Event Detection

YANG Shuozhen1, ZHANG Long1*, WANG Jianhua2, ZHANG Hengyuan1   

  1. 1. College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China;
    2. College of Computer Information Engineering, Guangzhou Huali Vocational College of Science and Technology, Guangzhou Guangdong 511325, China
  • Received:2022-08-30 Revised:2022-10-25 Online:2023-03-25 Published:2023-04-25

摘要: 声音事件检测技术能够识别出一个音频段中存在的事件类别并标注出各事件的起止时间,在智能城市、医疗监控、野生动物保护等应用场景有巨大潜力,是机器听觉领域的一个重要研究课题。本文从监督学习和半监督学习2个方面对声音事件检测方法进行综述,汇总和分析现有研究中使用的特征、检测模型及其性能。对于监督学习,重点介绍机器学习方法和深度学习方法。对于半监督学习,总结基于均值教师、协同训练、多尺度卷积和注意力机制等4种有效方法。最后,介绍常用数据集和评价指标,并讨论未来可能的研究方向,包括声音分离预处理、合成数据和真实数据域适应、自注意力模型优化、特征选择和融合、流式系统建模等问题。

关键词: 声音事件检测, 机器学习, 深度学习, 神经网络, 监督学习, 半监督学习

Abstract: Sound event detection technology can identify the types of events in an audio segment and mark the start and end time of each event. It has great potential in application scenarios such as smart cities, medical monitoring, and wildlife protection. It is an important issue in the field of machine hearing subject research. Firstly, this paper reviews sound event detection methods from two aspects of supervised learning and semi-supervised learning. Then, summarizes and analyzes the features, detection models and their performances used in existing research. For supervised learning, machine learning methods and deep learning methods are highlighted. For semi-supervised learning, four effective methods based on mean teacher, co-training, multi-scale convolution and attention mechanism are summarized. Finally, common datasets and evaluation metrics are introduced, and possible future research directions are discussed, including voice separation preprocessing, domain adaptation of synthetic data and real data, self-attention model optimization, feature selection and fusion, streaming system modeling and other issues.

Key words: sound event detection, machine learning, deep learning, neural networks, supervised learning, semi-supervised learning

中图分类号: 

  • TN912.3
[1] TAN E L, KARNAPI F A, NG L J, et al. Extracting urban sound information for residential areas in smart cities using an end-to-end IoT system[J]. IEEE Internet of Things Journal, 2021, 8(18): 14308-14321. DOI: 10.1109/JIOT.2021.3068755.
[2] PANDYA S, GHAYVAT H. Ambient acoustic event assistive framework for identification, detection, and recognition of unknown acoustic events of a residence[J]. Advanced Engineering Informatics, 2021, 47: 101238. DOI: 10.1016/j.aei.2020.101238.
[3] 李玲俐. 家庭保健监测系统中环境声音事件的识别[J]. 重庆师范大学学报(自然科学版), 2016, 33(4): 118-122.
[4] 张丽君. 公共场所异常声音识别算法设计与研究[D]. 重庆: 重庆大学, 2017.
[5] ARSLAN Y, CANBOLAT H. Sound based alarming based video surveillance system design[J]. Multimedia Tools and Applications, 2022, 81(6): 7969-7991. DOI: 10.1007/s11042-022-12028-6.
[6] MOUAWAD P, DUBNOV T, DUBNOV S. Robust detection of COVID-19 in cough sounds[J]. SN Computer Science, 2021, 2(1): 34. DOI: 10.1007/s42979-020-00422-6.
[7] 苏映新. 自适应粒子群优化匹配追踪声音事件识别算法[J]. 激光与光电子学进展, 2020, 57(10): 101502. DOI: 10.3788/LOP57.101502.
[8] TANG T T, LIANG Y H, LONG Y H. Two improved architectures based on prototype network for few-shot bioacoustic event detection[R/OL]. (2021-06-10)[2022-08-30].https://dcase.community/documents/challenge2021/technical_reports/DCASE2021_Tang_54_task5.pdf.
[9] HEITTOLA T, MESAROS A, VIRTANEN T, et al. Supervised model training for overlapping sound events based on unsupervised source separation[C]// 2013 IEEE international conference on acoustics, speech and signal processing. Piscataway, NJ: IEEE, 2013: 8677-8681. DOI: 10.1109/ICASSP.2013.6639360.
[10] DE BENITO-GORRON D, SEGOVIA S, RAMOS D, et al. Multiple feature resolutions for different polyphonic sound detection score scenarios in DCASE 2021 Task 4[C]// Proceedings of the 6th Workshop on Detection and Classification of Acoustic Scenes and Events 2021 (DCASE 2021). Barcelona: DCASE, 2021: 65-69. DOI: 10.5281/zenodo.5770113.
[11] 石自强, 韩纪庆, 郑铁然. 鲁棒声学事件检测综述[J]. 智能计算机与应用, 2012, 2(6): 31-35.
[12] DANG A, VU T H, WANG J C. A survey of deep learning for polyphonic sound event detection[C]// 2017 International Conference on Orange Technologies (ICOT). Piscataway, NJ: IEEE, 2017: 75-78. DOI: 10.1109/ICOT.2017.8336092.
[13] XIA X J, TOGNERI R, SOHEL F, et al. A survey: neural network-based deep learning for acoustic event detection[J]. Circuits, Systems, and Signal Processing, 2019, 38(8): 3433-3453. DOI: 10.1007/s00034-019-01094-1.
[14] EDDY S R. What is a hidden Markov model?[J]. Nature biotechnology, 2004, 22(10): 1315-1316. DOI: 10.1038/nbt1004-1315.
[15] REYNOLDS D. Gaussian mixture models[M]// LI S Z, JAIN A K. Encyclopedia of Biometrics. Boston, MA: Springer, 2009: 659-663. DOI: 10.1007/978-0-387-73003-5_196.
[16] XIANG Y, SHI L M, HØJVANG J L, et al. A speech enhancement algorithm based on a non-negative hidden Markov model and Kullback-Leibler divergence[J]. EURASIP Journal on Audio, Speech, and Music Processing, 2022, 2022: 22. DOI: 10.1186/s13636-022-00256-5.
[17] MESAROS A, HEITTOLA T, ERONEN A, et al. Acoustic event detection in real life recordings[C]// 2010 18th European Signal Processing Conference. Piscataway, NJ: IEEE, 2010: 1267-1271.
[18] MAHMOOD A, KÖSE U. Speech recognition based on convolutional neural networks and MFCC algorithm[J]. Advances in Artificial Intelligence Research, 2021, 1(1): 6-12.
[19] FORNEY G D. The viterbi algorithm[J]. Proceedings of the IEEE, 1973, 61(3): 268-278. DOI: 10.1109/PROC.1973.9030.
[20] HEITTOLA T, MESAROS A, ERONEN A, et al. Context-dependent sound event detection[J]. EURASIP Journal on Audio, Speech, and Music Processing, 2013, 2013: 1. DOI: 10.1186/1687-4722-2013-1.
[21] ERONEN A J, PELTONEN V T, TUOMI J T, et al. Audio-based context recognition[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2006, 14(1): 321-329. DOI: 10.1109/TSA.2005.854103.
[22] RYYNANEN M P, KLAPURI A. Polyphonic music transcription using note event modeling[C]// IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005. Piscataway, NJ: IEEE, 2005: 319-322. DOI: 10.1109/ASPAA.2005.1540233.
[23] 徐慧敏. 非负矩阵分解算法及应用研究[D]. 无锡:江南大学, 2020. DOI: 10.27169/d.cnki.gwqgu.2020.000755.
[24] HEITTOLA T, MESAROS A, VIRTANEN T, et al. Sound event detection in multisource environments using source separation[C]// First International Workshop on Machine Listening in Multisource Environments (CHiME 2011). Florence: CHiME,2011: 36-40.
[25] CAKIR E, HEITTOLA T, HUTTUNEN H, et al. Polyphonic sound event detection using multi label deep neural networks[C]// 2015 International Joint Conference on Neural Networks (IJCNN). Piscataway, NJ: IEEE, 2015: 1-7. DOI: 10.1109/IJCNN.2015.7280624.
[26] 李先苦. 基于深度学习的声学场景分类与声音事件检测[D]. 广州:华南理工大学, 2019. DOI: 10.27151/d.cnki.ghnlu.2019.001370.
[27] 杨利平, 郝峻永, 辜小花, 等. 音频标记一致性约束 CRNN 声音事件检测[J]. 电子与信息学报, 2022, 44(3): 1102-1110. DOI: 10.11999/JEIT210131.
[28] HEITTOLA T, MESAROS A, ERONEN A, et al. Audio context recognition using audio event histograms[C]// 2010 18th European Signal Processing Conference. Piscataway, NJ: IEEE, 2010: 1272-1276.
[29] PARASCANDOLO G, HUTTUNEN H, VIRTANEN T. Recurrent neural networks for polyphonic sound event detection in real life recordings[C]// 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). Piscataway, NJ: IEEE, 2016: 6440-6444. DOI: 10.1109/ICASSP.2016.7472917.
[30] XIA X J, TOGNERI R, SOHEL F, et al. Confidence based acoustic event detection[C]// 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2018: 306-310. DOI: 10.1109/ICASSP.2018.8461845.
[31] VESPERINI F, GABRIELLI L, PRINCIPI E, et al. Polyphonic sound event detection by using capsule neural networks[J]. IEEE Journal of Selected Topics in Signal Processing, 2019, 13(2): 310-322. DOI: 10.1109/JSTSP.2019.2902305.
[32] SABOUR S, FROSST N, HINTON G E. Dynamic routing between capsules[C]// Advances in Neural Information Processing Systems30 (NIPS 2017). Red Hook, NY: Curran Associates Inc.,2017:3859-3869.
[33] 杨巨成, 韩书杰, 毛磊, 等. 胶囊网络模型综述[J]. 山东大学学报(工学版), 2019, 49(6): 1-10.
[34] 刘亚明. 基于深层神经网络的多声音事件检测方法研究[D]. 合肥:中国科学技术大学, 2019.
[35] CAKIR E, PARASCANDOLO G, HEITTOLA T, et al. Convolutional recurrent neural networks for polyphonic sound event detection[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(6): 1291-1303. DOI: 10.1109/TASLP.2017.2690575.
[36] WANG Y B, ZHAO G H, XIONG K, et al. Multi-scale and single-scale fully convolutional networks for sound event detection[J]. Neurocomputing, 2021, 421: 51-65. DOI: 10.1016/j.neucom.2020.09.038.
[37] WANG Y B, ZHAO G H, XIONG K, et al. MSFF-Net: Multi-scale feature fusing networks with dilated mixed convolution and cascaded parallel framework for sound event detection[J]. Digital Signal Processing, 2022, 122: 103319. DOI: 10.1016/j.dsp.2021.103319.
[38] XIA X J, TOGNERI R, SOHEL F, et al. Auxiliary classifier generative adversarial network with soft labels in imbalanced acoustic event detection[J]. IEEE Transactions on Multimedia, 2018, 21(6): 1359-1371. DOI: 10.1109/TMM.2018.2879750.
[39] CRESWELL A, WHITE T, DUMOULIN V, et al. Generative adversarial networks: an overview[J]. IEEE signal processing magazine, 2018, 35(1): 53-65. DOI: 10.1109/MSP.2017.2765202.
[40] TARVAINEN A, VALPOLA H. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results[C]// Advances in Neural Information Processing Systems30 (NIPS 2017). Red Hook, NY: Curran Associates Inc.,2017: 1195-1204.
[41] NING X, WANG X R, XU S H, et al. A review of research on co-training[J/OL]. Concurrency and Computation: Practice and Experience, 2021: e6276[2022-08-30].https://onlinelibrary.wiley.com/doi/10.1002/cpe.6276.
[42] MABUDE K, MALELA-MAJIKA J C, CASTAGLIOLA P, et al. Generally weighted moving average monitoring schemes: overview and perspectives[J]. Quality and Reliability Engineering International, 2021, 37(2): 409-432. DOI: 10.1002/qre.2765.
[43] KIM N K, KIM H K. Polyphonic sound event detection based on residual convolutional recurrent neural network with semi-supervised loss function [J]. IEEE Access, 2021, 9: 7564-7575. DOI: 10.1109/ACCESS.2020.3048675.
[44] LIU Y Z, CHEN H T, ZHAO Q W, et al. Master-Teacher-Student: a weakly labelled semi-supervised framework for audio tagging and sound event detection[J]. IEICE Transactions on Information and Systems, 2022, 105(4): 828-831. DOI: 10.1587/transinf.2021EDL8082.
[45] ZHENG X, FU C, XIE H Y, et al. Uncertainty-aware deep co-training for semi-supervised medical image segmentation[J]. Computers in Biology and Medicine, 2022, 149: 106051. DOI: 10.1016/j.compbiomed.2022.106051.
[46] ZHENG X, SONG Y, DAI L R, et al. An effective mutual mean teaching based domain adaptation method for sound event detection[C]// Proceedings of Interspeech 2021. Baixas: International Speech Communication Association,2021: 556-560. DOI: 10.21437/Interspeech.2021-281.
[47] DONG S S, LIU C. Sentiment classification for financial texts based on deep learning[J]. Computational Intelligence and Neuroscience, 2021, 2021: 9524705. DOI: 10.1155/2021/9524705.
[48] FARAHANI A, VOGHOEI S, RASHEED K, et al. A brief review of domain adaptation[C]// Advances in Data Science and Information Engineering: Proceedings from ICDATA 2020 and IKE 2020. Cham: Springer Nature Switzerland AG, 2021: 877-894. DOI: 10.1007/978-3-030-71704-9_65.
[49] IMOTO K, MISHIMA S, ARAI Y, et al. Impact of data imbalance caused by inactive frames and difference in sound duration on sound event detection performance[J]. Applied Acoustics, 2022, 196: 108882. DOI: 10.1016/j.apacoust.2022.108882.
[50] 郑伟哲, 仇鹏, 韦娟. 弱标签环境下基于多尺度注意力融合的声音识别检测[J]. 计算机科学, 2020, 47(5): 120-123.
[51] KIM S J, CHUNG Y J. Multi-scale features for transformer model to improve the performance of sound event detection[J]. Applied Sciences, 2022, 12(5): 2626. DOI: 10.3390/app12052626.
[52] ZHOU Q, WANG J, LIU J, et al. RSANet: towards real-time object detection with residual semantic-guided attention feature pyramid network[J]. Mobile Networks and Applications, 2021, 26(1): 77-87. DOI: 10.1007/s11036-020-01723-z.
[53] KOH C Y, CHEN Y S, LIU Y W, et al. Sound event detection by consistency training and pseudo-labeling with feature-pyramid convolutional recurrent neural networks[C]// 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 376-380. DOI: 10.1109/ICASSP39728.2021.9414350.
[54] VERMA V, KAWAGUCHI K, LAMB A, et al. Interpolation consistency training for semi-supervised learning[J]. Neural Networks, 2022, 145: 90-106. DOI: 10.1016/j.neunet.2021.10.008.
[55] JIN Y, WANG M, LUO L Y, et al. Polyphonic sound event detection using temporal-frequency attention and feature space attention[J]. Sensors, 2022, 22(18): 6818. DOI: 10.3390/s22186818.
[56] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Advances in Neural Information Processing Systems30 (NIPS 2017). Red Hook, NY: Curran Associates Inc.,2017: 6000-6010.
[57] MIYAZAKI K, KOMATSU T, HAYASHI T, et al. Weakly-supervised sound event detection with self-attention[C]// 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 66-70. DOI: 10.1109/ICASSP40776.2020.9053609.
[58] GEMMEKE J F, ELLIS D P W, FREEDMAN D, et al. Audio set: an ontology and human-labeled dataset for audio events[C]// 2017 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2017: 776-780. DOI: 10.1109/ICASSP.2017.7952261.
[59] MESAROS A, HEITTOLA T, VIRTANEN T. TUT database for acoustic scene classification and sound event detection[C]// 2016 24th European Signal Processing Conference (EUSIPCO). Piscataway, NJ: IEEE, 2016: 1128-1132. DOI: 10.1109/EUSIPCO.2016.7760424.
[60] SALAMON J, MACCONNELL D, CARTWRIGHT M, et al. Scaper: a library for soundscape synthesis and augmentation[C]// 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Piscataway, NJ: IEEE, 2017: 344-348. DOI: 10.1109/WASPAA.2017.8170052.
[61] DEKKERS G, LAUWEREINS S, THOEN B, et al. The SINS database for detection of daily activities in a home environment using an acoustic sensor network[C]// Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017). Tampere: Tampere University of Technology, 2017: 32-36.
[62] MIYAZAKI K, KOMATSU T, HAYASHI T, et al. Convolution-augmented transformer for semi-supervised sound event detection[R/OL]. (2020-06-10)[2022-08-30].https://dcase.community/documents/challenge2020/technical_reports/DCASE2020_Miyazaki_108.pdf.
[63] KÜÇÜKBAY S E, YAZICI A, KALKAN S. Hand-crafted versus learned representations for audio event detection[J]. Multimedia Tools and Applications, 2022, 81(21): 30911-30930. DOI: 10.1007/s11042-022-12873-5.
[1] 王鲁娜, 杜洪波, 朱立军. 基于流形正则的堆叠胶囊自编码器优化算法[J]. 广西师范大学学报(自然科学版), 2023, 41(2): 76-85.
[2] 潘海明, 陈庆锋, 邱杰, 何乃旭, 刘春雨, 杜晓敬. 基于卷积推理的多跳知识图谱问答算法[J]. 广西师范大学学报(自然科学版), 2023, 41(1): 102-112.
[3] 张涛, 杜建民. 基于无人机遥感的荒漠草原微斑块识别研究[J]. 广西师范大学学报(自然科学版), 2022, 40(6): 50-58.
[4] 肖飞, 康增彦, 王维红. 两种算法用于预测A2/O工艺脱氮条件[J]. 广西师范大学学报(自然科学版), 2022, 40(6): 173-184.
[5] 郝雅茹, 董力, 许可, 李先贤. 预训练语言模型的可解释性研究进展[J]. 广西师范大学学报(自然科学版), 2022, 40(5): 59-71.
[6] 田晟, 宋霖. 基于CNN和Bagging集成的交通标志识别[J]. 广西师范大学学报(自然科学版), 2022, 40(4): 35-46.
[7] 周圣凯, 富丽贞, 宋文爱. 基于深度学习的短文本语义相似度计算模型[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 49-56.
[8] 张萍, 徐巧枝. 基于多感受野与分组混合注意力机制的肺结节分割研究[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 76-87.
[9] 李永杰, 周桂红, 刘博. 基于YOLOv3模型的人脸检测与头部姿态估计融合算法[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 95-103.
[10] 彭涛, 唐经, 何凯, 胡新荣, 刘军平, 何儒汉. 基于多步态特征融合的情感识别[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 104-111.
[11] 马新娜, 赵猛, 祁琳. 基于卷积脉冲神经网络的故障诊断方法研究[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 112-120.
[12] 段美玲, 潘巨龙. 基于双向LSTM神经网络可穿戴跌倒检测研究[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 141-150.
[13] 孔亚钰, 卢玉洁, 孙中天, 肖敬先, 侯昊辰, 陈廷伟. 面向强化当前兴趣的图神经网络推荐算法研究[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 151-160.
[14] 吴军, 欧阳艾嘉, 张琳. 基于多头注意力机制的磷酸化位点预测模型[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 161-171.
[15] 闫龙川, 李妍, 宋浒, 邹昊东, 王丽君. 基于Prophet-DeepAR模型的Web流量预测[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 172-184.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周正春. 互补序列研究进展[J]. 广西师范大学学报(自然科学版), 2023, 41(1): 1 -16 .
[2] 杨生龙, 母庆闯, 张志华, 刘葵. 废旧锂离子电池回收利用技术进展[J]. 广西师范大学学报(自然科学版), 2023, 41(2): 19 -26 .
[3] 李康良, 邱彩雄, 何爽, 黄春华, 伍冠一. 白介素-31参与瘙痒的研究进展[J]. 广西师范大学学报(自然科学版), 2023, 41(2): 27 -35 .
[4] 卢许孟, 南新元, 夏斯博. 无模型坐标补偿积分滑模约束的自动驾驶汽车轨迹跟踪控制[J]. 广西师范大学学报(自然科学版), 2023, 41(2): 36 -48 .
[5] 张伟健, 邴其春, 沈富鑫, 胡嫣然, 高鹏. 城市快速路路段行程时间估计方法[J]. 广西师范大学学报(自然科学版), 2023, 41(2): 49 -57 .
[6] 杨秀, 韦笃取. 基于单状态变量的永磁同步电机混沌跟踪控制[J]. 广西师范大学学报(自然科学版), 2023, 41(2): 58 -66 .
[7] 赵媛, 宋树祥, 刘振宇, 岑明灿, 蔡超波, 蒋品群. 一种新型电流镜运算跨导放大器的设计[J]. 广西师范大学学报(自然科学版), 2023, 41(2): 67 -75 .
[8] 王鲁娜, 杜洪波, 朱立军. 基于流形正则的堆叠胶囊自编码器优化算法[J]. 广西师范大学学报(自然科学版), 2023, 41(2): 76 -85 .
[9] 赵明, 罗秋莲, 陈蔚萌, 陈嘉妮. 控制时机和力度对传染病传播的影响[J]. 广西师范大学学报(自然科学版), 2023, 41(2): 86 -97 .
[10] 杨秀凤, 范江华. 向量平衡问题强有效解集的连通性[J]. 广西师范大学学报(自然科学版), 2023, 41(2): 98 -105 .
版权所有 © 广西师范大学学报(自然科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发