声音事件检测综述

doi:10.16088/j.issn.1001-6600.2022083002

摘要/Abstract

摘要： 声音事件检测技术能够识别出一个音频段中存在的事件类别并标注出各事件的起止时间,在智能城市、医疗监控、野生动物保护等应用场景有巨大潜力,是机器听觉领域的一个重要研究课题。本文从监督学习和半监督学习2个方面对声音事件检测方法进行综述,汇总和分析现有研究中使用的特征、检测模型及其性能。对于监督学习,重点介绍机器学习方法和深度学习方法。对于半监督学习,总结基于均值教师、协同训练、多尺度卷积和注意力机制等4种有效方法。最后,介绍常用数据集和评价指标,并讨论未来可能的研究方向,包括声音分离预处理、合成数据和真实数据域适应、自注意力模型优化、特征选择和融合、流式系统建模等问题。

关键词: 声音事件检测, 机器学习, 深度学习, 神经网络, 监督学习, 半监督学习

Abstract: Sound event detection technology can identify the types of events in an audio segment and mark the start and end time of each event. It has great potential in application scenarios such as smart cities, medical monitoring, and wildlife protection. It is an important issue in the field of machine hearing subject research. Firstly, this paper reviews sound event detection methods from two aspects of supervised learning and semi-supervised learning. Then, summarizes and analyzes the features, detection models and their performances used in existing research. For supervised learning, machine learning methods and deep learning methods are highlighted. For semi-supervised learning, four effective methods based on mean teacher, co-training, multi-scale convolution and attention mechanism are summarized. Finally, common datasets and evaluation metrics are introduced, and possible future research directions are discussed, including voice separation preprocessing, domain adaptation of synthetic data and real data, self-attention model optimization, feature selection and fusion, streaming system modeling and other issues.

Key words: sound event detection, machine learning, deep learning, neural networks, supervised learning, semi-supervised learning

中图分类号:

TN912.3

杨烁祯, 张珑, 王建华, 张恒远. 声音事件检测综述[J]. 广西师范大学学报（自然科学版）, 2023, 41(2): 1-18.

YANG Shuozhen, ZHANG Long, WANG Jianhua, ZHANG Hengyuan. Review of Sound Event Detection[J]. Journal of Guangxi Normal University(Natural Science Edition), 2023, 41(2): 1-18.

参考文献

[1] TAN E L, KARNAPI F A, NG L J, et al. Extracting urban sound information for residential areas in smart cities using an end-to-end IoT system[J]. IEEE Internet of Things Journal, 2021, 8(18): 14308-14321. DOI: 10.1109/JIOT.2021.3068755.
[2] PANDYA S, GHAYVAT H. Ambient acoustic event assistive framework for identification, detection, and recognition of unknown acoustic events of a residence[J]. Advanced Engineering Informatics, 2021, 47: 101238. DOI: 10.1016/j.aei.2020.101238.
[3] 李玲俐. 家庭保健监测系统中环境声音事件的识别[J]. 重庆师范大学学报(自然科学版), 2016, 33(4): 118-122.
[4] 张丽君. 公共场所异常声音识别算法设计与研究[D]. 重庆: 重庆大学, 2017.
[5] ARSLAN Y, CANBOLAT H. Sound based alarming based video surveillance system design[J]. Multimedia Tools and Applications, 2022, 81(6): 7969-7991. DOI: 10.1007/s11042-022-12028-6.
[6] MOUAWAD P, DUBNOV T, DUBNOV S. Robust detection of COVID-19 in cough sounds[J]. SN Computer Science, 2021, 2(1): 34. DOI: 10.1007/s42979-020-00422-6.
[7] 苏映新. 自适应粒子群优化匹配追踪声音事件识别算法[J]. 激光与光电子学进展, 2020, 57(10): 101502. DOI: 10.3788/LOP57.101502.
[8] TANG T T, LIANG Y H, LONG Y H. Two improved architectures based on prototype network for few-shot bioacoustic event detection[R/OL]. (2021-06-10)[2022-08-30].https://dcase.community/documents/challenge2021/technical_reports/DCASE2021_Tang_54_task5.pdf.
[9] HEITTOLA T, MESAROS A, VIRTANEN T, et al. Supervised model training for overlapping sound events based on unsupervised source separation[C]// 2013 IEEE international conference on acoustics, speech and signal processing. Piscataway, NJ: IEEE, 2013: 8677-8681. DOI: 10.1109/ICASSP.2013.6639360.
[10] DE BENITO-GORRON D, SEGOVIA S, RAMOS D, et al. Multiple feature resolutions for different polyphonic sound detection score scenarios in DCASE 2021 Task 4[C]// Proceedings of the 6th Workshop on Detection and Classification of Acoustic Scenes and Events 2021 (DCASE 2021). Barcelona: DCASE, 2021: 65-69. DOI: 10.5281/zenodo.5770113.
[11] 石自强, 韩纪庆, 郑铁然. 鲁棒声学事件检测综述[J]. 智能计算机与应用, 2012, 2(6): 31-35.
[12] DANG A, VU T H, WANG J C. A survey of deep learning for polyphonic sound event detection[C]// 2017 International Conference on Orange Technologies (ICOT). Piscataway, NJ: IEEE, 2017: 75-78. DOI: 10.1109/ICOT.2017.8336092.
[13] XIA X J, TOGNERI R, SOHEL F, et al. A survey: neural network-based deep learning for acoustic event detection[J]. Circuits, Systems, and Signal Processing, 2019, 38(8): 3433-3453. DOI: 10.1007/s00034-019-01094-1.
[14] EDDY S R. What is a hidden Markov model?[J]. Nature biotechnology, 2004, 22(10): 1315-1316. DOI: 10.1038/nbt1004-1315.
[15] REYNOLDS D. Gaussian mixture models[M]// LI S Z, JAIN A K. Encyclopedia of Biometrics. Boston, MA: Springer, 2009: 659-663. DOI: 10.1007/978-0-387-73003-5_196.
[16] XIANG Y, SHI L M, HØJVANG J L, et al. A speech enhancement algorithm based on a non-negative hidden Markov model and Kullback-Leibler divergence[J]. EURASIP Journal on Audio, Speech, and Music Processing, 2022, 2022: 22. DOI: 10.1186/s13636-022-00256-5.
[17] MESAROS A, HEITTOLA T, ERONEN A, et al. Acoustic event detection in real life recordings[C]// 2010 18th European Signal Processing Conference. Piscataway, NJ: IEEE, 2010: 1267-1271.
[18] MAHMOOD A, KÖSE U. Speech recognition based on convolutional neural networks and MFCC algorithm[J]. Advances in Artificial Intelligence Research, 2021, 1(1): 6-12.
[19] FORNEY G D. The viterbi algorithm[J]. Proceedings of the IEEE, 1973, 61(3): 268-278. DOI: 10.1109/PROC.1973.9030.
[20] HEITTOLA T, MESAROS A, ERONEN A, et al. Context-dependent sound event detection[J]. EURASIP Journal on Audio, Speech, and Music Processing, 2013, 2013: 1. DOI: 10.1186/1687-4722-2013-1.
[21] ERONEN A J, PELTONEN V T, TUOMI J T, et al. Audio-based context recognition[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2006, 14(1): 321-329. DOI: 10.1109/TSA.2005.854103.
[22] RYYNANEN M P, KLAPURI A. Polyphonic music transcription using note event modeling[C]// IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005. Piscataway, NJ: IEEE, 2005: 319-322. DOI: 10.1109/ASPAA.2005.1540233.
[23] 徐慧敏. 非负矩阵分解算法及应用研究[D]. 无锡:江南大学, 2020. DOI: 10.27169/d.cnki.gwqgu.2020.000755.
[24] HEITTOLA T, MESAROS A, VIRTANEN T, et al. Sound event detection in multisource environments using source separation[C]// First International Workshop on Machine Listening in Multisource Environments (CHiME 2011). Florence: CHiME,2011: 36-40.
[25] CAKIR E, HEITTOLA T, HUTTUNEN H, et al. Polyphonic sound event detection using multi label deep neural networks[C]// 2015 International Joint Conference on Neural Networks (IJCNN). Piscataway, NJ: IEEE, 2015: 1-7. DOI: 10.1109/IJCNN.2015.7280624.
[26] 李先苦. 基于深度学习的声学场景分类与声音事件检测[D]. 广州:华南理工大学, 2019. DOI: 10.27151/d.cnki.ghnlu.2019.001370.
[27] 杨利平, 郝峻永, 辜小花, 等. 音频标记一致性约束 CRNN 声音事件检测[J]. 电子与信息学报, 2022, 44(3): 1102-1110. DOI: 10.11999/JEIT210131.
[28] HEITTOLA T, MESAROS A, ERONEN A, et al. Audio context recognition using audio event histograms[C]// 2010 18th European Signal Processing Conference. Piscataway, NJ: IEEE, 2010: 1272-1276.
[29] PARASCANDOLO G, HUTTUNEN H, VIRTANEN T. Recurrent neural networks for polyphonic sound event detection in real life recordings[C]// 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). Piscataway, NJ: IEEE, 2016: 6440-6444. DOI: 10.1109/ICASSP.2016.7472917.
[30] XIA X J, TOGNERI R, SOHEL F, et al. Confidence based acoustic event detection[C]// 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2018: 306-310. DOI: 10.1109/ICASSP.2018.8461845.
[31] VESPERINI F, GABRIELLI L, PRINCIPI E, et al. Polyphonic sound event detection by using capsule neural networks[J]. IEEE Journal of Selected Topics in Signal Processing, 2019, 13(2): 310-322. DOI: 10.1109/JSTSP.2019.2902305.
[32] SABOUR S, FROSST N, HINTON G E. Dynamic routing between capsules[C]// Advances in Neural Information Processing Systems30 (NIPS 2017). Red Hook, NY: Curran Associates Inc.,2017:3859-3869.
[33] 杨巨成, 韩书杰, 毛磊, 等. 胶囊网络模型综述[J]. 山东大学学报(工学版), 2019, 49(6): 1-10.
[34] 刘亚明. 基于深层神经网络的多声音事件检测方法研究[D]. 合肥:中国科学技术大学, 2019.
[35] CAKIR E, PARASCANDOLO G, HEITTOLA T, et al. Convolutional recurrent neural networks for polyphonic sound event detection[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(6): 1291-1303. DOI: 10.1109/TASLP.2017.2690575.
[36] WANG Y B, ZHAO G H, XIONG K, et al. Multi-scale and single-scale fully convolutional networks for sound event detection[J]. Neurocomputing, 2021, 421: 51-65. DOI: 10.1016/j.neucom.2020.09.038.
[37] WANG Y B, ZHAO G H, XIONG K, et al. MSFF-Net: Multi-scale feature fusing networks with dilated mixed convolution and cascaded parallel framework for sound event detection[J]. Digital Signal Processing, 2022, 122: 103319. DOI: 10.1016/j.dsp.2021.103319.
[38] XIA X J, TOGNERI R, SOHEL F, et al. Auxiliary classifier generative adversarial network with soft labels in imbalanced acoustic event detection[J]. IEEE Transactions on Multimedia, 2018, 21(6): 1359-1371. DOI: 10.1109/TMM.2018.2879750.
[39] CRESWELL A, WHITE T, DUMOULIN V, et al. Generative adversarial networks: an overview[J]. IEEE signal processing magazine, 2018, 35(1): 53-65. DOI: 10.1109/MSP.2017.2765202.
[40] TARVAINEN A, VALPOLA H. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results[C]// Advances in Neural Information Processing Systems30 (NIPS 2017). Red Hook, NY: Curran Associates Inc.,2017: 1195-1204.
[41] NING X, WANG X R, XU S H, et al. A review of research on co-training[J/OL]. Concurrency and Computation: Practice and Experience, 2021: e6276[2022-08-30].https://onlinelibrary.wiley.com/doi/10.1002/cpe.6276.
[42] MABUDE K, MALELA-MAJIKA J C, CASTAGLIOLA P, et al. Generally weighted moving average monitoring schemes: overview and perspectives[J]. Quality and Reliability Engineering International, 2021, 37(2): 409-432. DOI: 10.1002/qre.2765.
[43] KIM N K, KIM H K. Polyphonic sound event detection based on residual convolutional recurrent neural network with semi-supervised loss function [J]. IEEE Access, 2021, 9: 7564-7575. DOI: 10.1109/ACCESS.2020.3048675.
[44] LIU Y Z, CHEN H T, ZHAO Q W, et al. Master-Teacher-Student: a weakly labelled semi-supervised framework for audio tagging and sound event detection[J]. IEICE Transactions on Information and Systems, 2022, 105(4): 828-831. DOI: 10.1587/transinf.2021EDL8082.
[45] ZHENG X, FU C, XIE H Y, et al. Uncertainty-aware deep co-training for semi-supervised medical image segmentation[J]. Computers in Biology and Medicine, 2022, 149: 106051. DOI: 10.1016/j.compbiomed.2022.106051.
[46] ZHENG X, SONG Y, DAI L R, et al. An effective mutual mean teaching based domain adaptation method for sound event detection[C]// Proceedings of Interspeech 2021. Baixas: International Speech Communication Association,2021: 556-560. DOI: 10.21437/Interspeech.2021-281.
[47] DONG S S, LIU C. Sentiment classification for financial texts based on deep learning[J]. Computational Intelligence and Neuroscience, 2021, 2021: 9524705. DOI: 10.1155/2021/9524705.
[48] FARAHANI A, VOGHOEI S, RASHEED K, et al. A brief review of domain adaptation[C]// Advances in Data Science and Information Engineering: Proceedings from ICDATA 2020 and IKE 2020. Cham: Springer Nature Switzerland AG, 2021: 877-894. DOI: 10.1007/978-3-030-71704-9_65.
[49] IMOTO K, MISHIMA S, ARAI Y, et al. Impact of data imbalance caused by inactive frames and difference in sound duration on sound event detection performance[J]. Applied Acoustics, 2022, 196: 108882. DOI: 10.1016/j.apacoust.2022.108882.
[50] 郑伟哲, 仇鹏, 韦娟. 弱标签环境下基于多尺度注意力融合的声音识别检测[J]. 计算机科学, 2020, 47(5): 120-123.
[51] KIM S J, CHUNG Y J. Multi-scale features for transformer model to improve the performance of sound event detection[J]. Applied Sciences, 2022, 12(5): 2626. DOI: 10.3390/app12052626.
[52] ZHOU Q, WANG J, LIU J, et al. RSANet: towards real-time object detection with residual semantic-guided attention feature pyramid network[J]. Mobile Networks and Applications, 2021, 26(1): 77-87. DOI: 10.1007/s11036-020-01723-z.
[53] KOH C Y, CHEN Y S, LIU Y W, et al. Sound event detection by consistency training and pseudo-labeling with feature-pyramid convolutional recurrent neural networks[C]// 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 376-380. DOI: 10.1109/ICASSP39728.2021.9414350.
[54] VERMA V, KAWAGUCHI K, LAMB A, et al. Interpolation consistency training for semi-supervised learning[J]. Neural Networks, 2022, 145: 90-106. DOI: 10.1016/j.neunet.2021.10.008.
[55] JIN Y, WANG M, LUO L Y, et al. Polyphonic sound event detection using temporal-frequency attention and feature space attention[J]. Sensors, 2022, 22(18): 6818. DOI: 10.3390/s22186818.
[56] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Advances in Neural Information Processing Systems30 (NIPS 2017). Red Hook, NY: Curran Associates Inc.,2017: 6000-6010.
[57] MIYAZAKI K, KOMATSU T, HAYASHI T, et al. Weakly-supervised sound event detection with self-attention[C]// 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 66-70. DOI: 10.1109/ICASSP40776.2020.9053609.
[58] GEMMEKE J F, ELLIS D P W, FREEDMAN D, et al. Audio set: an ontology and human-labeled dataset for audio events[C]// 2017 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2017: 776-780. DOI: 10.1109/ICASSP.2017.7952261.
[59] MESAROS A, HEITTOLA T, VIRTANEN T. TUT database for acoustic scene classification and sound event detection[C]// 2016 24th European Signal Processing Conference (EUSIPCO). Piscataway, NJ: IEEE, 2016: 1128-1132. DOI: 10.1109/EUSIPCO.2016.7760424.
[60] SALAMON J, MACCONNELL D, CARTWRIGHT M, et al. Scaper: a library for soundscape synthesis and augmentation[C]// 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Piscataway, NJ: IEEE, 2017: 344-348. DOI: 10.1109/WASPAA.2017.8170052.
[61] DEKKERS G, LAUWEREINS S, THOEN B, et al. The SINS database for detection of daily activities in a home environment using an acoustic sensor network[C]// Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017). Tampere: Tampere University of Technology, 2017: 32-36.
[62] MIYAZAKI K, KOMATSU T, HAYASHI T, et al. Convolution-augmented transformer for semi-supervised sound event detection[R/OL]. (2020-06-10)[2022-08-30].https://dcase.community/documents/challenge2020/technical_reports/DCASE2020_Miyazaki_108.pdf.
[63] KÜÇÜKBAY S E, YAZICI A, KALKAN S. Hand-crafted versus learned representations for audio event detection[J]. Multimedia Tools and Applications, 2022, 81(21): 30911-30930. DOI: 10.1007/s11042-022-12873-5.

Metrics

Viewed

Full text

166

HTML			PDF

Just accepted	Online first	Issue	Just accepted	Online first	Issue
0	0	0	0	0	166

	From	local

	Times	166
	Rate	100%

Abstract

515

Just accepted	Online first	Issue

0	0	515

From	Others	local

Times	467	48
Rate	91%	9%

Cited

Web of Science	Crossref	ScienceDirect	Search for Citations in Google Scholar >>


This page requires you have already subscribed to WoS.

Shared

Discussed