|
广西师范大学学报(自然科学版) ›› 2024, Vol. 42 ›› Issue (4): 11-21.doi: 10.16088/j.issn.1001-6600.2023111303
高盛祥1,2,3*, 杨元樟1,2, 王琳钦1,2, 莫尚斌1,2, 余正涛1,2,3, 董凌1,2,3
GAO Shengxiang1,2,3*, YANG Yuanzhang1,2, WANG Linqin1,2, MO Shangbin1,2, YU Zhengtao1,2,3, DONG Ling1,2,3
摘要: 个性化语音合成任务旨在合成特定说话人音色的语音。传统方法在合成域外说话人语音时,与真实语音存在明显音色差异,解耦说话人特征仍较为困难。本文提出面向训练时未出现的域外说话人适应场景下的多层级解耦个性化语音合成方法,通过不同粒度特征融合,有效提升零资源条件下域外说话人语音合成性能。本文方法采用快速傅里叶卷积提取说话人全局特征,以提高模型对域外说话人的泛化能力,实现句子粒度的说话人解耦;借助语音识别模型解耦音素粒度说话人特征,并通过注意力机制捕捉音素级音色特征,实现音素粒度的说话人解耦。实验结果表明:在公开数据集AISHELL3上,本文方法对域外说话人在客观评价指标说话人特征向量余弦相似度上达到0.697,相比基线模型提高6.25%,有效提升对域外说话人音色特征建模能力。
中图分类号: TN912.33
[1] WANG Y X, SKERRY-RYAN R J, STANTON D, et al. Tacotron: towards end-to-end speech synthesis[C] //18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017). Red Hook, NY: Curran Associates Inc., 2017: 4006-4010. DOI: 10.21437/Interspeech.2017-1452. [2] SHEN J, PANG R M, WEISS R J, et al. Natural TTS synthesis by conditioning Wavenet on MEL spectrogram predictions[C] //2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2018: 4779-4783. DOI: 10.1109/ICASSP.2018.8461368. [3] 邱泽宇, 屈丹, 张连海. 基于WaveNet的端到端语音合成方法[J]. 计算机应用, 2019, 39(5): 1325-1329. DOI: 10.11772/j.issn.1001-9081.2018102131. [4] ZHANG H T, LIN Y. Unsupervised learning for sequence-to-sequence text-to-speech for low-resource languages[C] //21st Annual Conference of the International Speech Communication Association (INTERSPEECH 2020). Red Hook, NY: Curran Associates Inc., 2020: 3161-3165. DOI: 10.21437/Interspeech.2020-1403. [5] 张小峰, 谢钧, 罗健欣, 等. 深度学习语音合成技术综述[J]. 计算机工程与应用, 2021, 57(9): 50-59. DOI: 10.3778/j.issn.1002-8331.2101-0044. [6] 张佳琳, 买日旦·吾守尔, 古兰拜尔·吐尔洪. 低资源条件下的语音合成方法综述[J]. 计算机工程与应用, 2023, 59(15): 1-16. DOI: 10.3778/j.issn.1002-8331.2211-0322. [7] 孙志宏, 叶焱, 刘太君, 等. 基于迁移学习的自适应语音合成[J]. 数据通信, 2021(5): 47-51. DOI: 10.3969/j.issn.1002-5057.2021.05.011. [8] KUMAR K, KUMAR R, DE BOISSIERE T, et al. MelGAN: generative adversarial networks for conditional waveform synthesis[C] //Advances in Neural Information Processing Systems 32 (NeurIPS 2019). Red Hook, NY: Curran Associates Inc., 2019: 14910-14921. [9] 王智, 刘银华. 基于深度学习的中文情感语音合成方法[J]. 自动化与仪器仪表, 2022(9): 10-15. DOI: 10.14016/j.cnki.1001-9227.2022.09.010. [10] HUANG R J, CUI C Y, CHEN F Y, et al. SingGAN: generative adversarial network for high-fidelity singing voice generation[C] //Proceedings of the 30th ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2022: 2525-2535. DOI: 10.1145/3503161.3547854. [11] 李建文, 王咿卜. 函数拟合实现带声调的语音合成[J]. 计算机应用与软件, 2022, 39(9): 193-200. DOI: 10.3969/j.issn.1000-386x.2022.09.029. [12] 李嘉欣, 张连海, 李宜亭. 基于音色一致的语音克隆说话人特征提取方法[J]. 信号处理, 2023, 39(4): 719-729. DOI: 10.16798/j.issn.1003-0530.2023.04.013. [13] GIBIANSKY A, ARIK S, DIAMOS G, et al. Deep voice 2: multi-speaker neural text-to-speech[C] //Advances in Neural Information Processing Systems 30 (NIPS 2017). Red Hook, NY: Curran Associates, Inc., 2017: 2962-2970. [14] YU D, YAO K S, SU H, et al. KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition[C] //2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Los Alamitos, CA: IEEE Computer Society, 2013: 7893-7897. DOI: 10.1109/ICASSP.2013.6639201. [15] MIAO Y J, METZE F. On speaker adaptation of long short-term memory recurrent neural networks[C] //16th Annual Conference of the International Speech Communication Association (INTERSPEECH 2015). Red Hook, NY: Curran Associates, Inc., 2015: 1101-1105. DOI: 10.21437/Interspeech.2015-290. [16] MIN D C, LEE D B, YANG E, et al. Meta-StyleSpeech: multi-speaker adaptive text-to-speech generation[C] //Proceedings of the 38th International Conference on Machine Learning. Virtual: PMLR, 2021: 7748-7759. [17] ARIK S, CHEN J T, PENG K N, et al. Neural voice cloning with a few samples[C] //Advances in Neural Information Processing Systems 31 (NeurIPS 2018). Red Hook, NY: Curran Associates, Inc., 2018: 10019-10029. [18] 徐志航, 陈博, 张辉, 等. 小数据下的音素级别说话人嵌入的语音合成自适应方法[J]. 计算机学报, 2022, 45(5): 1003-1017. DOI: 10.11897/SP.J.1016.2022.01003. [19] 蒿晓阳, 张鹏远. 使用变分自编码器的自回归多说话人中文语音合成[J]. 声学学报, 2022, 47(3): 405-416. DOI: 10.15949/j.cnki.0371-0025.2022.03.004. [20] CHEN M J, TAN X, LI B H, et al. AdaSpeech: adaptive text to speech for custom voice[C] //International Conference on Learning Representations 2021. Virtual: ICLR, 2021: 1-10. [21] YAN Y Z, TAN X, LI B H, et al.AdaSpeech 2: adaptive text to speech with untranscribed data[C] //2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 6613-6617. DOI: 10.1109/ICASSP39728.2021.9414872. [22] YAN Y Z, TAN X, LI B H, et al. Adaptive text to speech for spontaneous style[C] //22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021). Red Hook, NY: Curran Associates, Inc., 2021: 4668-4672. DOI: 10.21437/Interspeech.2021-584. [23] SNYDER D, GARCIA-ROMERO D, POVEY D, et al. Deep neural network embeddings for text-independent speaker verification[C] //18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017). Red Hook, NY: Curran Associates, Inc., 2017: 999-1003. DOI: 10.21437/Interspeech. 2017-620. [24] 张雅欣, 张连海. 一种基于x-vector说话人特征的语音克隆方法[J]. 信息工程大学学报, 2020, 21(6): 664-669. DOI: 10.3969/j.issn.1671-0673.2020.06.005. [25] ZHANG Y, CHE H, LI J, et al. One-shot voice conversion based on speaker aware module[C] //2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 5959-5963. DOI: 10.1109/ICASSP39728.2021.9414081. [26] 尚影, 韩超, 吴克伟. 基于分离对比学习的个性化语音合成[J]. 计算机工程与应用, 2023, 59(22): 158-165. DOI: 10.3778/j.issn.1002-8331.2306-0127. [27] COOPER E, LAI C I, YASUDA Y, et al. Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings[C] //2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 6184-6188. DOI: 10.1109/ICASSP40776.2020.9054535. [28] LU H, WU Z Y, DAI D Y, et al. One-shot voice conversion with global speaker embeddings[C] //20th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019). Red Hook, NY: Curran Associates, Inc., 2019: 669-673. DOI: 10.21437/Interspeech. 2019-2365. [29] HSU W N, ZHANG Y, WEISS R J, et al. Hierarchical generative modeling for controllable speech synthesis[C] //International Conference on Learning Representations (ICLR 2019). Red Hook, NY: Curran Associates, Inc., 2019: 4020-4046. [30] NGUYEN B, CARDINAUX F. NVC-Net: end-to-end adversarial voice conversion[C] //2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2022: 7012-7016. DOI: 10.1109/ICASSP43922.2022.9747020. [31] KLIMKOV V, RONANKI S, ROHNKE J, et al. Fine-grained robust prosody transfer for single-speaker neural text-to-speech[C] //20th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019). Red Hook, NY: Curran Associates, Inc., 2019: 4440-4444. DOI: 10.21437/Interspeech.2019-2571. [32] LEE Y, KIM T. Robust and fine-grained prosody control of end-to-end speech synthesis[C] //2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2019: 5911-5915. DOI: 10.1109/ICASSP.2019.8683501. [33] LI X, SONG C H, LI J B, et al. Towards multi-scale style control for expressive speech synthesis[C] //22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021). Red Hook, NY: Curran Associates, Inc., 2021: 4673-4677. DOI: 10.21437/Interspeech.2021-947. [34] FU R B, TAO J H, WEN Z Q, et al. Phoneme dependent speaker embedding and model factorization for multi-speaker speech synthesis and adaptation[C] //2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2019: 6930-6934. DOI: 10.1109/ICASSP.2019.8682535. [35] CHOI S, HAN S, KIM D, et al.Attentron: few-shot text-to-speech utilizing attention-based variable-length embedding[C] //21st Annual Conference of the International Speech Communication Association (INTERSPEECH 2020). Red Hook, NY: Curran Associates Inc., 2020: 2007-2011. DOI: 10.21437/Interspeech.2020-2096. [36] ZHOU Y X, SONG C H, LI X, et al. Content-dependent fine-grained speaker embedding for zero-shot speaker adaptation in text-to-speech synthesis[C] //23rd Annual Conference of the International Speech Communication Association (INTERSPEECH 2022). Red Hook, NY: Curran Associates Inc., 2022: 2573-2577. DOI: 10.21437/Interspeech.2022-10054. [37] KONG J, KIM J, BAE J. HiFi-GAN: generative adversarial networks for efficient andhigh fidelity speech synthesis[C] //Advances in Neural Information Processing Systems 33 (NeurIPS 2020). Red Hook, NY: Curran Associates Inc., 2020: 17022-17033. [38] SHCHEKOTOV I, ANDREEV P K, IVANOV O, et al. FFC-SE: fast Fourier convolution for speech enhancement[C] //23rd Annual Conference of the International Speech Communication Association (INTERSPEECH 2022). Red Hook, NY: Curran Associates Inc., 2022: 1188-1192. DOI: 10.21437/Interspeech.2022-603. [39] ULYANOV D, VEDALDI A, LEMPITSKY V. Instance normalization: the missing ingredient for fast stylization[EB/OL]. (2016-07-27)[2023-11-13]. https://arxiv.org/abs/1607.08022. DOI: 10.48550/arXiv.1607.08022. [40] TONG F C, ZHAO M, ZHOU J F, et al. ASV-SUBTOOLS:open source toolkit for automatic speaker verification[C] //2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 6184-6188. DOI: 10.1109/ICASSP39728.2021.9414676. [41] YAO Z Y, WU D, WANG X, et al. WeNet: production oriented streaming and non-streaming end-to-end speech recognition toolkit[C] //Proceedings of the Interspeech 2021. Red Hook, NY: Curran Associates, Inc., 2021: 4054-4058. DOI: 10.21437/Interspeech.2021-1983. [42] WANG Z C, ZHOU X Y, YANG F Y, et al. Enriching source style transfer in recognition-synthesis based non-parallel voice conversion[C] //22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021). Red Hook, NY: Curran Associates, Inc., 2021: 831-835. DOI: 10.21437/Interspeech.2021-1351. [43] SHI Y, BU H, XU X, et al. AISHELL-3: a multi-speaker Mandarin tts corpus and the baselines[C] //22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021). Red Hook, NY: Curran Associates, Inc., 2021: 2756-2760. DOI: 10.21437/Interspeech.2021-755. [44] TODA T, BLACK A W, TOKUDA K. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory[J]. IEEE Transactions on Audio Speech and Language Processing, 2007, 15(8): 2222-2235. DOI: 10.1109/TASL.2007.907344. [45] WANG H J, LIANG C D, WANG S, et al. Wespeaker: a research and production oriented speaker embedding learning toolkit[C] //2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2023: 1-5. DOI: 10.1109/ICASSP49357.2023.10096626. |
No related articles found! |
|
版权所有 © 广西师范大学学报(自然科学版)编辑部 地址:广西桂林市三里店育才路15号 邮编:541004 电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn 本系统由北京玛格泰克科技发展有限公司设计开发 |