Journal of Guangxi Normal University(Natural Science Edition) ›› 2023, Vol. 41 ›› Issue (2): 1-18.doi: 10.16088/j.issn.1001-6600.2022083002

    Next Articles

Review of Sound Event Detection

YANG Shuozhen1, ZHANG Long1*, WANG Jianhua2, ZHANG Hengyuan1   

  1. 1. College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China;
    2. College of Computer Information Engineering, Guangzhou Huali Vocational College of Science and Technology, Guangzhou Guangdong 511325, China
  • Received:2022-08-30 Revised:2022-10-25 Online:2023-03-25 Published:2023-04-25

Abstract: Sound event detection technology can identify the types of events in an audio segment and mark the start and end time of each event. It has great potential in application scenarios such as smart cities, medical monitoring, and wildlife protection. It is an important issue in the field of machine hearing subject research. Firstly, this paper reviews sound event detection methods from two aspects of supervised learning and semi-supervised learning. Then, summarizes and analyzes the features, detection models and their performances used in existing research. For supervised learning, machine learning methods and deep learning methods are highlighted. For semi-supervised learning, four effective methods based on mean teacher, co-training, multi-scale convolution and attention mechanism are summarized. Finally, common datasets and evaluation metrics are introduced, and possible future research directions are discussed, including voice separation preprocessing, domain adaptation of synthetic data and real data, self-attention model optimization, feature selection and fusion, streaming system modeling and other issues.

Key words: sound event detection, machine learning, deep learning, neural networks, supervised learning, semi-supervised learning

CLC Number: 

  • TN912.3
[1] TAN E L, KARNAPI F A, NG L J, et al. Extracting urban sound information for residential areas in smart cities using an end-to-end IoT system[J]. IEEE Internet of Things Journal, 2021, 8(18): 14308-14321. DOI: 10.1109/JIOT.2021.3068755.
[2] PANDYA S, GHAYVAT H. Ambient acoustic event assistive framework for identification, detection, and recognition of unknown acoustic events of a residence[J]. Advanced Engineering Informatics, 2021, 47: 101238. DOI: 10.1016/j.aei.2020.101238.
[3] 李玲俐. 家庭保健监测系统中环境声音事件的识别[J]. 重庆师范大学学报(自然科学版), 2016, 33(4): 118-122.
[4] 张丽君. 公共场所异常声音识别算法设计与研究[D]. 重庆: 重庆大学, 2017.
[5] ARSLAN Y, CANBOLAT H. Sound based alarming based video surveillance system design[J]. Multimedia Tools and Applications, 2022, 81(6): 7969-7991. DOI: 10.1007/s11042-022-12028-6.
[6] MOUAWAD P, DUBNOV T, DUBNOV S. Robust detection of COVID-19 in cough sounds[J]. SN Computer Science, 2021, 2(1): 34. DOI: 10.1007/s42979-020-00422-6.
[7] 苏映新. 自适应粒子群优化匹配追踪声音事件识别算法[J]. 激光与光电子学进展, 2020, 57(10): 101502. DOI: 10.3788/LOP57.101502.
[8] TANG T T, LIANG Y H, LONG Y H. Two improved architectures based on prototype network for few-shot bioacoustic event detection[R/OL]. (2021-06-10)[2022-08-30].https://dcase.community/documents/challenge2021/technical_reports/DCASE2021_Tang_54_task5.pdf.
[9] HEITTOLA T, MESAROS A, VIRTANEN T, et al. Supervised model training for overlapping sound events based on unsupervised source separation[C]// 2013 IEEE international conference on acoustics, speech and signal processing. Piscataway, NJ: IEEE, 2013: 8677-8681. DOI: 10.1109/ICASSP.2013.6639360.
[10] DE BENITO-GORRON D, SEGOVIA S, RAMOS D, et al. Multiple feature resolutions for different polyphonic sound detection score scenarios in DCASE 2021 Task 4[C]// Proceedings of the 6th Workshop on Detection and Classification of Acoustic Scenes and Events 2021 (DCASE 2021). Barcelona: DCASE, 2021: 65-69. DOI: 10.5281/zenodo.5770113.
[11] 石自强, 韩纪庆, 郑铁然. 鲁棒声学事件检测综述[J]. 智能计算机与应用, 2012, 2(6): 31-35.
[12] DANG A, VU T H, WANG J C. A survey of deep learning for polyphonic sound event detection[C]// 2017 International Conference on Orange Technologies (ICOT). Piscataway, NJ: IEEE, 2017: 75-78. DOI: 10.1109/ICOT.2017.8336092.
[13] XIA X J, TOGNERI R, SOHEL F, et al. A survey: neural network-based deep learning for acoustic event detection[J]. Circuits, Systems, and Signal Processing, 2019, 38(8): 3433-3453. DOI: 10.1007/s00034-019-01094-1.
[14] EDDY S R. What is a hidden Markov model?[J]. Nature biotechnology, 2004, 22(10): 1315-1316. DOI: 10.1038/nbt1004-1315.
[15] REYNOLDS D. Gaussian mixture models[M]// LI S Z, JAIN A K. Encyclopedia of Biometrics. Boston, MA: Springer, 2009: 659-663. DOI: 10.1007/978-0-387-73003-5_196.
[16] XIANG Y, SHI L M, HØJVANG J L, et al. A speech enhancement algorithm based on a non-negative hidden Markov model and Kullback-Leibler divergence[J]. EURASIP Journal on Audio, Speech, and Music Processing, 2022, 2022: 22. DOI: 10.1186/s13636-022-00256-5.
[17] MESAROS A, HEITTOLA T, ERONEN A, et al. Acoustic event detection in real life recordings[C]// 2010 18th European Signal Processing Conference. Piscataway, NJ: IEEE, 2010: 1267-1271.
[18] MAHMOOD A, KÖSE U. Speech recognition based on convolutional neural networks and MFCC algorithm[J]. Advances in Artificial Intelligence Research, 2021, 1(1): 6-12.
[19] FORNEY G D. The viterbi algorithm[J]. Proceedings of the IEEE, 1973, 61(3): 268-278. DOI: 10.1109/PROC.1973.9030.
[20] HEITTOLA T, MESAROS A, ERONEN A, et al. Context-dependent sound event detection[J]. EURASIP Journal on Audio, Speech, and Music Processing, 2013, 2013: 1. DOI: 10.1186/1687-4722-2013-1.
[21] ERONEN A J, PELTONEN V T, TUOMI J T, et al. Audio-based context recognition[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2006, 14(1): 321-329. DOI: 10.1109/TSA.2005.854103.
[22] RYYNANEN M P, KLAPURI A. Polyphonic music transcription using note event modeling[C]// IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005. Piscataway, NJ: IEEE, 2005: 319-322. DOI: 10.1109/ASPAA.2005.1540233.
[23] 徐慧敏. 非负矩阵分解算法及应用研究[D]. 无锡:江南大学, 2020. DOI: 10.27169/d.cnki.gwqgu.2020.000755.
[24] HEITTOLA T, MESAROS A, VIRTANEN T, et al. Sound event detection in multisource environments using source separation[C]// First International Workshop on Machine Listening in Multisource Environments (CHiME 2011). Florence: CHiME,2011: 36-40.
[25] CAKIR E, HEITTOLA T, HUTTUNEN H, et al. Polyphonic sound event detection using multi label deep neural networks[C]// 2015 International Joint Conference on Neural Networks (IJCNN). Piscataway, NJ: IEEE, 2015: 1-7. DOI: 10.1109/IJCNN.2015.7280624.
[26] 李先苦. 基于深度学习的声学场景分类与声音事件检测[D]. 广州:华南理工大学, 2019. DOI: 10.27151/d.cnki.ghnlu.2019.001370.
[27] 杨利平, 郝峻永, 辜小花, 等. 音频标记一致性约束 CRNN 声音事件检测[J]. 电子与信息学报, 2022, 44(3): 1102-1110. DOI: 10.11999/JEIT210131.
[28] HEITTOLA T, MESAROS A, ERONEN A, et al. Audio context recognition using audio event histograms[C]// 2010 18th European Signal Processing Conference. Piscataway, NJ: IEEE, 2010: 1272-1276.
[29] PARASCANDOLO G, HUTTUNEN H, VIRTANEN T. Recurrent neural networks for polyphonic sound event detection in real life recordings[C]// 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). Piscataway, NJ: IEEE, 2016: 6440-6444. DOI: 10.1109/ICASSP.2016.7472917.
[30] XIA X J, TOGNERI R, SOHEL F, et al. Confidence based acoustic event detection[C]// 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2018: 306-310. DOI: 10.1109/ICASSP.2018.8461845.
[31] VESPERINI F, GABRIELLI L, PRINCIPI E, et al. Polyphonic sound event detection by using capsule neural networks[J]. IEEE Journal of Selected Topics in Signal Processing, 2019, 13(2): 310-322. DOI: 10.1109/JSTSP.2019.2902305.
[32] SABOUR S, FROSST N, HINTON G E. Dynamic routing between capsules[C]// Advances in Neural Information Processing Systems30 (NIPS 2017). Red Hook, NY: Curran Associates Inc.,2017:3859-3869.
[33] 杨巨成, 韩书杰, 毛磊, 等. 胶囊网络模型综述[J]. 山东大学学报(工学版), 2019, 49(6): 1-10.
[34] 刘亚明. 基于深层神经网络的多声音事件检测方法研究[D]. 合肥:中国科学技术大学, 2019.
[35] CAKIR E, PARASCANDOLO G, HEITTOLA T, et al. Convolutional recurrent neural networks for polyphonic sound event detection[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(6): 1291-1303. DOI: 10.1109/TASLP.2017.2690575.
[36] WANG Y B, ZHAO G H, XIONG K, et al. Multi-scale and single-scale fully convolutional networks for sound event detection[J]. Neurocomputing, 2021, 421: 51-65. DOI: 10.1016/j.neucom.2020.09.038.
[37] WANG Y B, ZHAO G H, XIONG K, et al. MSFF-Net: Multi-scale feature fusing networks with dilated mixed convolution and cascaded parallel framework for sound event detection[J]. Digital Signal Processing, 2022, 122: 103319. DOI: 10.1016/j.dsp.2021.103319.
[38] XIA X J, TOGNERI R, SOHEL F, et al. Auxiliary classifier generative adversarial network with soft labels in imbalanced acoustic event detection[J]. IEEE Transactions on Multimedia, 2018, 21(6): 1359-1371. DOI: 10.1109/TMM.2018.2879750.
[39] CRESWELL A, WHITE T, DUMOULIN V, et al. Generative adversarial networks: an overview[J]. IEEE signal processing magazine, 2018, 35(1): 53-65. DOI: 10.1109/MSP.2017.2765202.
[40] TARVAINEN A, VALPOLA H. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results[C]// Advances in Neural Information Processing Systems30 (NIPS 2017). Red Hook, NY: Curran Associates Inc.,2017: 1195-1204.
[41] NING X, WANG X R, XU S H, et al. A review of research on co-training[J/OL]. Concurrency and Computation: Practice and Experience, 2021: e6276[2022-08-30].https://onlinelibrary.wiley.com/doi/10.1002/cpe.6276.
[42] MABUDE K, MALELA-MAJIKA J C, CASTAGLIOLA P, et al. Generally weighted moving average monitoring schemes: overview and perspectives[J]. Quality and Reliability Engineering International, 2021, 37(2): 409-432. DOI: 10.1002/qre.2765.
[43] KIM N K, KIM H K. Polyphonic sound event detection based on residual convolutional recurrent neural network with semi-supervised loss function [J]. IEEE Access, 2021, 9: 7564-7575. DOI: 10.1109/ACCESS.2020.3048675.
[44] LIU Y Z, CHEN H T, ZHAO Q W, et al. Master-Teacher-Student: a weakly labelled semi-supervised framework for audio tagging and sound event detection[J]. IEICE Transactions on Information and Systems, 2022, 105(4): 828-831. DOI: 10.1587/transinf.2021EDL8082.
[45] ZHENG X, FU C, XIE H Y, et al. Uncertainty-aware deep co-training for semi-supervised medical image segmentation[J]. Computers in Biology and Medicine, 2022, 149: 106051. DOI: 10.1016/j.compbiomed.2022.106051.
[46] ZHENG X, SONG Y, DAI L R, et al. An effective mutual mean teaching based domain adaptation method for sound event detection[C]// Proceedings of Interspeech 2021. Baixas: International Speech Communication Association,2021: 556-560. DOI: 10.21437/Interspeech.2021-281.
[47] DONG S S, LIU C. Sentiment classification for financial texts based on deep learning[J]. Computational Intelligence and Neuroscience, 2021, 2021: 9524705. DOI: 10.1155/2021/9524705.
[48] FARAHANI A, VOGHOEI S, RASHEED K, et al. A brief review of domain adaptation[C]// Advances in Data Science and Information Engineering: Proceedings from ICDATA 2020 and IKE 2020. Cham: Springer Nature Switzerland AG, 2021: 877-894. DOI: 10.1007/978-3-030-71704-9_65.
[49] IMOTO K, MISHIMA S, ARAI Y, et al. Impact of data imbalance caused by inactive frames and difference in sound duration on sound event detection performance[J]. Applied Acoustics, 2022, 196: 108882. DOI: 10.1016/j.apacoust.2022.108882.
[50] 郑伟哲, 仇鹏, 韦娟. 弱标签环境下基于多尺度注意力融合的声音识别检测[J]. 计算机科学, 2020, 47(5): 120-123.
[51] KIM S J, CHUNG Y J. Multi-scale features for transformer model to improve the performance of sound event detection[J]. Applied Sciences, 2022, 12(5): 2626. DOI: 10.3390/app12052626.
[52] ZHOU Q, WANG J, LIU J, et al. RSANet: towards real-time object detection with residual semantic-guided attention feature pyramid network[J]. Mobile Networks and Applications, 2021, 26(1): 77-87. DOI: 10.1007/s11036-020-01723-z.
[53] KOH C Y, CHEN Y S, LIU Y W, et al. Sound event detection by consistency training and pseudo-labeling with feature-pyramid convolutional recurrent neural networks[C]// 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 376-380. DOI: 10.1109/ICASSP39728.2021.9414350.
[54] VERMA V, KAWAGUCHI K, LAMB A, et al. Interpolation consistency training for semi-supervised learning[J]. Neural Networks, 2022, 145: 90-106. DOI: 10.1016/j.neunet.2021.10.008.
[55] JIN Y, WANG M, LUO L Y, et al. Polyphonic sound event detection using temporal-frequency attention and feature space attention[J]. Sensors, 2022, 22(18): 6818. DOI: 10.3390/s22186818.
[56] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Advances in Neural Information Processing Systems30 (NIPS 2017). Red Hook, NY: Curran Associates Inc.,2017: 6000-6010.
[57] MIYAZAKI K, KOMATSU T, HAYASHI T, et al. Weakly-supervised sound event detection with self-attention[C]// 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 66-70. DOI: 10.1109/ICASSP40776.2020.9053609.
[58] GEMMEKE J F, ELLIS D P W, FREEDMAN D, et al. Audio set: an ontology and human-labeled dataset for audio events[C]// 2017 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2017: 776-780. DOI: 10.1109/ICASSP.2017.7952261.
[59] MESAROS A, HEITTOLA T, VIRTANEN T. TUT database for acoustic scene classification and sound event detection[C]// 2016 24th European Signal Processing Conference (EUSIPCO). Piscataway, NJ: IEEE, 2016: 1128-1132. DOI: 10.1109/EUSIPCO.2016.7760424.
[60] SALAMON J, MACCONNELL D, CARTWRIGHT M, et al. Scaper: a library for soundscape synthesis and augmentation[C]// 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Piscataway, NJ: IEEE, 2017: 344-348. DOI: 10.1109/WASPAA.2017.8170052.
[61] DEKKERS G, LAUWEREINS S, THOEN B, et al. The SINS database for detection of daily activities in a home environment using an acoustic sensor network[C]// Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017). Tampere: Tampere University of Technology, 2017: 32-36.
[62] MIYAZAKI K, KOMATSU T, HAYASHI T, et al. Convolution-augmented transformer for semi-supervised sound event detection[R/OL]. (2020-06-10)[2022-08-30].https://dcase.community/documents/challenge2020/technical_reports/DCASE2020_Miyazaki_108.pdf.
[63] KÜÇÜKBAY S E, YAZICI A, KALKAN S. Hand-crafted versus learned representations for audio event detection[J]. Multimedia Tools and Applications, 2022, 81(21): 30911-30930. DOI: 10.1007/s11042-022-12873-5.
[1] WANG Luna, DU Hongbo, ZHU Lijun. Stacked Capsule Autoencoders Optimization Algorithm Based on Manifold Regularization [J]. Journal of Guangxi Normal University(Natural Science Edition), 2023, 41(2): 76-85.
[2] HAO Yaru, DONG Li, XU Ke, LI Xianxian. Interpretability of Pre-trained Language Models: A Survey [J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(5): 59-71.
[3] ZHANG Ping, XU Qiaozhi. Segmentation of Lung Nodules Based on Multi-receptive Field and Grouping Attention Mechanism [J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(3): 76-87.
[4] LI Yongjie, ZHOU Guihong, LIU Bo. Fusion Algorithm of Face Detection and Head Pose Estimation Based on YOLOv3 Model [J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(3): 95-103.
[5] WU Jun, OUYANG Aijia, ZHANG Lin. Phosphorylation Site Prediction Model Based on Multi-head Attention Mechanism [J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(3): 161-171.
[6] YAN Longchuan, LI Yan, SONG Hu, ZOU Haodong, WANG Lijun. Web Traffic Prediction Based on Prophet-DeepAR [J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(3): 172-184.
[7] CHEN Gaojian, WANG Jing, LI Qianwen, YUAN Yunjing, CAO Jiachen. Data-driven Method for Automatic Machine Learning Pipeline Generation [J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(3): 185-193.
[8] LIN Peiqun, HE Huohua, LIN Xukun. Multi-scale Prediction of Expressways' Arrival Volume of Large and Medium-sized Trucks Based on System Relevance [J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(2): 15-26.
[9] YANG Di, FANG Yangxin, ZHOU Yan. New Category Classification Research Based on MEB and SVM Methods [J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(1): 57-67.
[10] LU Kaifeng, YANG Yilong, LI Zhi. A Web Service Classification Method Using BERT and DPCNN [J]. Journal of Guangxi Normal University(Natural Science Edition), 2021, 39(6): 87-98.
[11] WU Lingyu, LAN Yang, XIA Haiying. Retinal Image Registration Using Convolutional Neural Network [J]. Journal of Guangxi Normal University(Natural Science Edition), 2021, 39(5): 122-133.
[12] CHEN Wenkang, LU Shenglian, LIU Binghao, LI Guo, LIU Xiaoyu, CHEN Ming. Real-time Citrus Recognition under Orchard Environment by Improved YOLOv4 [J]. Journal of Guangxi Normal University(Natural Science Edition), 2021, 39(5): 134-146.
[13] YANG Zhou, FAN Yixing, ZHU Xiaofei, GUO Jiafeng, WANG Yue. Survey on Modeling Factors of Neural Information Retrieval Model [J]. Journal of Guangxi Normal University(Natural Science Edition), 2021, 39(2): 1-12.
[14] DENG Wenxuan, YANG Hang, JIN Ting. A Dimensionality-reduction Method Based on Attention Mechanismon Image Classification [J]. Journal of Guangxi Normal University(Natural Science Edition), 2021, 39(2): 32-40.
[15] XUE Tao, QIU Senhui, LU Hao, QIN Xingsheng. Exchange Rate Prediction Based on Empirical Mode Decomposition and Multi-branch LSTM Network [J]. Journal of Guangxi Normal University(Natural Science Edition), 2021, 39(2): 41-50.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] ZHOU Zhengchun. Research Progress of Complementary Sequences[J]. Journal of Guangxi Normal University(Natural Science Edition), 2023, 41(1): 1 -16 .
[2] YANG Shenglong, MU Qingchuang, ZHANG Zhihua, LIU Kui. Technical Progress in Recovery and Utilization of Spent Lithium-ion Batteries[J]. Journal of Guangxi Normal University(Natural Science Edition), 2023, 41(2): 19 -26 .
[3] LI Kangliang, QIU Caixiong, HE Shuang, HUANG Chunhua, WU Guanyi. Research Progress of IL-31 in Itch[J]. Journal of Guangxi Normal University(Natural Science Edition), 2023, 41(2): 27 -35 .
[4] LU Xumeng, NAN Xinyuan, XIA Sibo. Trajectory Tracking Control Based on Model-Free Coordinate Compensation Integral Sliding Mode Constraints[J]. Journal of Guangxi Normal University(Natural Science Edition), 2023, 41(2): 36 -48 .
[5] ZHANG Weijian, BING Qichun, SHEN Fuxin, HU Yanran, GAO Peng. Travel Time Estimation Method of Urban Expressway Section[J]. Journal of Guangxi Normal University(Natural Science Edition), 2023, 41(2): 49 -57 .
[6] YANG Xiu, WEI Duqu. Chaos Tracking Control of Permanent Magnet Synchronous Motor Based on Single State Variable[J]. Journal of Guangxi Normal University(Natural Science Edition), 2023, 41(2): 58 -66 .
[7] ZHAO Yuan, SONG Shuxiang, LIU Zhenyu, CEN Mingcan, CAI Chaobo, JIANG Pinqun. Design of a Novel Current-Mirror Operational Transconductance Amplifier[J]. Journal of Guangxi Normal University(Natural Science Edition), 2023, 41(2): 67 -75 .
[8] WANG Luna, DU Hongbo, ZHU Lijun. Stacked Capsule Autoencoders Optimization Algorithm Based on Manifold Regularization[J]. Journal of Guangxi Normal University(Natural Science Edition), 2023, 41(2): 76 -85 .
[9] ZHAO Ming, LUO Qiulian, CHEN Weimeng, CHEN Jiani. Influence of Control Timing and Strength on the Spreading of Epidemic[J]. Journal of Guangxi Normal University(Natural Science Edition), 2023, 41(2): 86 -97 .
[10] YANG Xiufeng, FAN Jianghua. Connectedness of the Strong Efficient Solution Set for Vector Equilibrium Problems[J]. Journal of Guangxi Normal University(Natural Science Edition), 2023, 41(2): 98 -105 .