面向域外说话人适应场景的多层级解耦个性化语音合成

doi:10.16088/j.issn.1001-6600.2023111303

广西师范大学学报（自然科学版） ›› 2024, Vol. 42 ›› Issue (4): 11-21.doi: 10.16088/j.issn.1001-6600.2023111303

面向域外说话人适应场景的多层级解耦个性化语音合成

高盛祥^1,2,3*, 杨元樟^1,2, 王琳钦^1,2, 莫尚斌^1,2, 余正涛^1,2,3, 董凌^1,2,3

1.昆明理工大学信息工程与自动化学院, 云南昆明 650500;
2.云南省人工智能重点实验室(昆明理工大学), 云南昆明 650500;
3.云南省媒体融合重点实验室(云南日报报业集团), 云南昆明 650228

收稿日期:2023-11-13 修回日期:2024-01-06 出版日期:2024-07-25 发布日期:2024-09-05
通讯作者: 高盛祥(1977—), 女, 云南洱源人, 昆明理工大学副教授, 博士。E-mail:gaoshengxiang.yn@foxmail.com
基金资助:
国家自然科学基金(62376111, U23A20388, 61972186, U21B2027); 云南高新技术产业发展项目(201606);云南省基础研究计划项目(202001AS070014); 云南省科技人才与平台计划项目(202105AC160018); 云南省媒体融合重点实验室开放课题(220225702); 云南省重点研发计划项目(202303AP140008, 202103AA080015)

Multi-level Disentangled Personalized Speech Synthesis for Out-of-Domain Speakers Adaptation Scenarios

GAO Shengxiang^1,2,3*, YANG Yuanzhang^1,2, WANG Linqin^1,2, MO Shangbin^1,2, YU Zhengtao^1,2,3, DONG Ling^1,2,3

1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming Yunnan 650500, China;
2. Yunnan Key Laboratory of Artificial Intelligence (Kunming University of Science and Technology), Kunming Yunnan 650500, China;
3. Yunnan Key Laboratory of Media Convergence (Yunnan Daily Press Group), Kunming Yunnan 650228, China

Received:2023-11-13 Revised:2024-01-06 Online:2024-07-25 Published:2024-09-05

摘要/Abstract

摘要： 个性化语音合成任务旨在合成特定说话人音色的语音。传统方法在合成域外说话人语音时,与真实语音存在明显音色差异,解耦说话人特征仍较为困难。本文提出面向训练时未出现的域外说话人适应场景下的多层级解耦个性化语音合成方法,通过不同粒度特征融合,有效提升零资源条件下域外说话人语音合成性能。本文方法采用快速傅里叶卷积提取说话人全局特征,以提高模型对域外说话人的泛化能力,实现句子粒度的说话人解耦;借助语音识别模型解耦音素粒度说话人特征,并通过注意力机制捕捉音素级音色特征,实现音素粒度的说话人解耦。实验结果表明:在公开数据集AISHELL3上,本文方法对域外说话人在客观评价指标说话人特征向量余弦相似度上达到0.697,相比基线模型提高6.25%,有效提升对域外说话人音色特征建模能力。

关键词: 语音合成, 零资源, 说话人表征, 域外说话人, 特征解耦

Abstract: Personalized speech synthesis aims to generate speech with specific speaker’s characteristics. Traditional approaches often exhibit noticeable timbre disparities when synthesizing speech from unseen speakers, making it challenging to disentangle speaker-specific timbre features. This paper proposes a multi-level disentangled personalized speech synthesis approach designed for out-of-domain speakers. By fusing features at different granularities, the proposed method effectively enhances the performance of synthesizing speech from unseen speakers under zero-resource conditions. This is achieved by utilizing fast Fourier convolution to extract global speaker features, thereby enhancing the model's generalization to unseen speakers and enabling sentence-level speaker decoupling. Additionally, leveraging a speech recognition model, the method decouples speaker features at the phoneme level and captures phoneme-level timbre features through an attention mechanism, achieving phoneme-level speaker disentanglement. Experimental results on the publicly available dataset AISHELL3 demonstrate that the proposed approach achieves a cosine similarity of 0.697 for speaker feature vectors of cross-speaker adaptation, indicating a 6.25% improvement compared with the baseline model. This enhancement shows the method’s capability in modeling timbre features for speech from unseen speakers in cross-speaker adaptation scenarios.

Key words: speech synthesis, zero-shot, speaker representation, out-of-domain speaker, feature disentanglement

中图分类号: TN912.33

高盛祥, 杨元樟, 王琳钦, 莫尚斌, 余正涛, 董凌. 面向域外说话人适应场景的多层级解耦个性化语音合成[J]. 广西师范大学学报（自然科学版）, 2024, 42(4): 11-21.

GAO Shengxiang, YANG Yuanzhang, WANG Linqin, MO Shangbin, YU Zhengtao, DONG Ling. Multi-level Disentangled Personalized Speech Synthesis for Out-of-Domain Speakers Adaptation Scenarios[J]. Journal of Guangxi Normal University(Natural Science Edition), 2024, 42(4): 11-21.

参考文献

[1] WANG Y X, SKERRY-RYAN R J, STANTON D, et al. Tacotron: towards end-to-end speech synthesis[C] //18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017). Red Hook, NY: Curran Associates Inc., 2017: 4006-4010. DOI: 10.21437/Interspeech.2017-1452.
[2] SHEN J, PANG R M, WEISS R J, et al. Natural TTS synthesis by conditioning Wavenet on MEL spectrogram predictions[C] //2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2018: 4779-4783. DOI: 10.1109/ICASSP.2018.8461368.
[3] 邱泽宇, 屈丹, 张连海. 基于WaveNet的端到端语音合成方法[J]. 计算机应用, 2019, 39(5): 1325-1329. DOI: 10.11772/j.issn.1001-9081.2018102131.
[4] ZHANG H T, LIN Y. Unsupervised learning for sequence-to-sequence text-to-speech for low-resource languages[C] //21st Annual Conference of the International Speech Communication Association (INTERSPEECH 2020). Red Hook, NY: Curran Associates Inc., 2020: 3161-3165. DOI: 10.21437/Interspeech.2020-1403.
[5] 张小峰, 谢钧, 罗健欣, 等. 深度学习语音合成技术综述[J]. 计算机工程与应用, 2021, 57(9): 50-59. DOI: 10.3778/j.issn.1002-8331.2101-0044.
[6] 张佳琳, 买日旦·吾守尔, 古兰拜尔·吐尔洪. 低资源条件下的语音合成方法综述[J]. 计算机工程与应用, 2023, 59(15): 1-16. DOI: 10.3778/j.issn.1002-8331.2211-0322.
[7] 孙志宏, 叶焱, 刘太君, 等. 基于迁移学习的自适应语音合成[J]. 数据通信, 2021(5): 47-51. DOI: 10.3969/j.issn.1002-5057.2021.05.011.
[8] KUMAR K, KUMAR R, DE BOISSIERE T, et al. MelGAN: generative adversarial networks for conditional waveform synthesis[C] //Advances in Neural Information Processing Systems 32 (NeurIPS 2019). Red Hook, NY: Curran Associates Inc., 2019: 14910-14921.
[9] 王智, 刘银华. 基于深度学习的中文情感语音合成方法[J]. 自动化与仪器仪表, 2022(9): 10-15. DOI: 10.14016/j.cnki.1001-9227.2022.09.010.
[10] HUANG R J, CUI C Y, CHEN F Y, et al. SingGAN: generative adversarial network for high-fidelity singing voice generation[C] //Proceedings of the 30th ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2022: 2525-2535. DOI: 10.1145/3503161.3547854.
[11] 李建文, 王咿卜. 函数拟合实现带声调的语音合成[J]. 计算机应用与软件, 2022, 39(9): 193-200. DOI: 10.3969/j.issn.1000-386x.2022.09.029.
[12] 李嘉欣, 张连海, 李宜亭. 基于音色一致的语音克隆说话人特征提取方法[J]. 信号处理, 2023, 39(4): 719-729. DOI: 10.16798/j.issn.1003-0530.2023.04.013.
[13] GIBIANSKY A, ARIK S, DIAMOS G, et al. Deep voice 2: multi-speaker neural text-to-speech[C] //Advances in Neural Information Processing Systems 30 (NIPS 2017). Red Hook, NY: Curran Associates, Inc., 2017: 2962-2970.
[14] YU D, YAO K S, SU H, et al. KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition[C] //2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Los Alamitos, CA: IEEE Computer Society, 2013: 7893-7897. DOI: 10.1109/ICASSP.2013.6639201.
[15] MIAO Y J, METZE F. On speaker adaptation of long short-term memory recurrent neural networks[C] //16th Annual Conference of the International Speech Communication Association (INTERSPEECH 2015). Red Hook, NY: Curran Associates, Inc., 2015: 1101-1105. DOI: 10.21437/Interspeech.2015-290.
[16] MIN D C, LEE D B, YANG E, et al. Meta-StyleSpeech: multi-speaker adaptive text-to-speech generation[C] //Proceedings of the 38th International Conference on Machine Learning. Virtual: PMLR, 2021: 7748-7759.
[17] ARIK S, CHEN J T, PENG K N, et al. Neural voice cloning with a few samples[C] //Advances in Neural Information Processing Systems 31 (NeurIPS 2018). Red Hook, NY: Curran Associates, Inc., 2018: 10019-10029.
[18] 徐志航, 陈博, 张辉, 等. 小数据下的音素级别说话人嵌入的语音合成自适应方法[J]. 计算机学报, 2022, 45(5): 1003-1017. DOI: 10.11897/SP.J.1016.2022.01003.
[19] 蒿晓阳, 张鹏远. 使用变分自编码器的自回归多说话人中文语音合成[J]. 声学学报, 2022, 47(3): 405-416. DOI: 10.15949/j.cnki.0371-0025.2022.03.004.
[20] CHEN M J, TAN X, LI B H, et al. AdaSpeech: adaptive text to speech for custom voice[C] //International Conference on Learning Representations 2021. Virtual: ICLR, 2021: 1-10.
[21] YAN Y Z, TAN X, LI B H, et al.AdaSpeech 2: adaptive text to speech with untranscribed data[C] //2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 6613-6617. DOI: 10.1109/ICASSP39728.2021.9414872.
[22] YAN Y Z, TAN X, LI B H, et al. Adaptive text to speech for spontaneous style[C] //22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021). Red Hook, NY: Curran Associates, Inc., 2021: 4668-4672. DOI: 10.21437/Interspeech.2021-584.
[23] SNYDER D, GARCIA-ROMERO D, POVEY D, et al. Deep neural network embeddings for text-independent speaker verification[C] //18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017). Red Hook, NY: Curran Associates, Inc., 2017: 999-1003. DOI: 10.21437/Interspeech. 2017-620.
[24] 张雅欣, 张连海. 一种基于x-vector说话人特征的语音克隆方法[J]. 信息工程大学学报, 2020, 21(6): 664-669. DOI: 10.3969/j.issn.1671-0673.2020.06.005.
[25] ZHANG Y, CHE H, LI J, et al. One-shot voice conversion based on speaker aware module[C] //2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 5959-5963. DOI: 10.1109/ICASSP39728.2021.9414081.
[26] 尚影, 韩超, 吴克伟. 基于分离对比学习的个性化语音合成[J]. 计算机工程与应用, 2023, 59(22): 158-165. DOI: 10.3778/j.issn.1002-8331.2306-0127.
[27] COOPER E, LAI C I, YASUDA Y, et al. Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings[C] //2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 6184-6188. DOI: 10.1109/ICASSP40776.2020.9054535.
[28] LU H, WU Z Y, DAI D Y, et al. One-shot voice conversion with global speaker embeddings[C] //20th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019). Red Hook, NY: Curran Associates, Inc., 2019: 669-673. DOI: 10.21437/Interspeech. 2019-2365.
[29] HSU W N, ZHANG Y, WEISS R J, et al. Hierarchical generative modeling for controllable speech synthesis[C] //International Conference on Learning Representations (ICLR 2019). Red Hook, NY: Curran Associates, Inc., 2019: 4020-4046.
[30] NGUYEN B, CARDINAUX F. NVC-Net: end-to-end adversarial voice conversion[C] //2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2022: 7012-7016. DOI: 10.1109/ICASSP43922.2022.9747020.
[31] KLIMKOV V, RONANKI S, ROHNKE J, et al. Fine-grained robust prosody transfer for single-speaker neural text-to-speech[C] //20th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019). Red Hook, NY: Curran Associates, Inc., 2019: 4440-4444. DOI: 10.21437/Interspeech.2019-2571.
[32] LEE Y, KIM T. Robust and fine-grained prosody control of end-to-end speech synthesis[C] //2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2019: 5911-5915. DOI: 10.1109/ICASSP.2019.8683501.
[33] LI X, SONG C H, LI J B, et al. Towards multi-scale style control for expressive speech synthesis[C] //22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021). Red Hook, NY: Curran Associates, Inc., 2021: 4673-4677. DOI: 10.21437/Interspeech.2021-947.
[34] FU R B, TAO J H, WEN Z Q, et al. Phoneme dependent speaker embedding and model factorization for multi-speaker speech synthesis and adaptation[C] //2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2019: 6930-6934. DOI: 10.1109/ICASSP.2019.8682535.
[35] CHOI S, HAN S, KIM D, et al.Attentron: few-shot text-to-speech utilizing attention-based variable-length embedding[C] //21st Annual Conference of the International Speech Communication Association (INTERSPEECH 2020). Red Hook, NY: Curran Associates Inc., 2020: 2007-2011. DOI: 10.21437/Interspeech.2020-2096.
[36] ZHOU Y X, SONG C H, LI X, et al. Content-dependent fine-grained speaker embedding for zero-shot speaker adaptation in text-to-speech synthesis[C] //23rd Annual Conference of the International Speech Communication Association (INTERSPEECH 2022). Red Hook, NY: Curran Associates Inc., 2022: 2573-2577. DOI: 10.21437/Interspeech.2022-10054.
[37] KONG J, KIM J, BAE J. HiFi-GAN: generative adversarial networks for efficient andhigh fidelity speech synthesis[C] //Advances in Neural Information Processing Systems 33 (NeurIPS 2020). Red Hook, NY: Curran Associates Inc., 2020: 17022-17033.
[38] SHCHEKOTOV I, ANDREEV P K, IVANOV O, et al. FFC-SE: fast Fourier convolution for speech enhancement[C] //23rd Annual Conference of the International Speech Communication Association (INTERSPEECH 2022). Red Hook, NY: Curran Associates Inc., 2022: 1188-1192. DOI: 10.21437/Interspeech.2022-603.
[39] ULYANOV D, VEDALDI A, LEMPITSKY V. Instance normalization: the missing ingredient for fast stylization[EB/OL]. (2016-07-27)[2023-11-13]. https://arxiv.org/abs/1607.08022. DOI: 10.48550/arXiv.1607.08022.
[40] TONG F C, ZHAO M, ZHOU J F, et al. ASV-SUBTOOLS:open source toolkit for automatic speaker verification[C] //2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 6184-6188. DOI: 10.1109/ICASSP39728.2021.9414676.
[41] YAO Z Y, WU D, WANG X, et al. WeNet: production oriented streaming and non-streaming end-to-end speech recognition toolkit[C] //Proceedings of the Interspeech 2021. Red Hook, NY: Curran Associates, Inc., 2021: 4054-4058. DOI: 10.21437/Interspeech.2021-1983.
[42] WANG Z C, ZHOU X Y, YANG F Y, et al. Enriching source style transfer in recognition-synthesis based non-parallel voice conversion[C] //22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021). Red Hook, NY: Curran Associates, Inc., 2021: 831-835. DOI: 10.21437/Interspeech.2021-1351.
[43] SHI Y, BU H, XU X, et al. AISHELL-3: a multi-speaker Mandarin tts corpus and the baselines[C] //22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021). Red Hook, NY: Curran Associates, Inc., 2021: 2756-2760. DOI: 10.21437/Interspeech.2021-755.
[44] TODA T, BLACK A W, TOKUDA K. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory[J]. IEEE Transactions on Audio Speech and Language Processing, 2007, 15(8): 2222-2235. DOI: 10.1109/TASL.2007.907344.
[45] WANG H J, LIANG C D, WANG S, et al. Wespeaker: a research and production oriented speaker embedding learning toolkit[C] //2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2023: 1-5. DOI: 10.1109/ICASSP49357.2023.10096626.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

面向域外说话人适应场景的多层级解耦个性化语音合成

Multi-level Disentangled Personalized Speech Synthesis for Out-of-Domain Speakers Adaptation Scenarios

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 10