Journal of Guangxi Normal University(Natural Science Edition) ›› 2024, Vol. 42 ›› Issue (4): 11-21.doi: 10.16088/j.issn.1001-6600.2023111303

Previous Articles     Next Articles

Multi-level Disentangled Personalized Speech Synthesis for Out-of-Domain Speakers Adaptation Scenarios

GAO Shengxiang1,2,3*, YANG Yuanzhang1,2, WANG Linqin1,2, MO Shangbin1,2, YU Zhengtao1,2,3, DONG Ling1,2,3   

  1. 1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming Yunnan 650500, China;
    2. Yunnan Key Laboratory of Artificial Intelligence (Kunming University of Science and Technology), Kunming Yunnan 650500, China;
    3. Yunnan Key Laboratory of Media Convergence (Yunnan Daily Press Group), Kunming Yunnan 650228, China
  • Received:2023-11-13 Revised:2024-01-06 Online:2024-07-25 Published:2024-09-05

Abstract: Personalized speech synthesis aims to generate speech with specific speaker’s characteristics. Traditional approaches often exhibit noticeable timbre disparities when synthesizing speech from unseen speakers, making it challenging to disentangle speaker-specific timbre features. This paper proposes a multi-level disentangled personalized speech synthesis approach designed for out-of-domain speakers. By fusing features at different granularities, the proposed method effectively enhances the performance of synthesizing speech from unseen speakers under zero-resource conditions. This is achieved by utilizing fast Fourier convolution to extract global speaker features, thereby enhancing the model's generalization to unseen speakers and enabling sentence-level speaker decoupling. Additionally, leveraging a speech recognition model, the method decouples speaker features at the phoneme level and captures phoneme-level timbre features through an attention mechanism, achieving phoneme-level speaker disentanglement. Experimental results on the publicly available dataset AISHELL3 demonstrate that the proposed approach achieves a cosine similarity of 0.697 for speaker feature vectors of cross-speaker adaptation, indicating a 6.25% improvement compared with the baseline model. This enhancement shows the method’s capability in modeling timbre features for speech from unseen speakers in cross-speaker adaptation scenarios.

Key words: speech synthesis, zero-shot, speaker representation, out-of-domain speaker, feature disentanglement

CLC Number:  TN912.33
[1] WANG Y X, SKERRY-RYAN R J, STANTON D, et al. Tacotron: towards end-to-end speech synthesis[C] //18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017). Red Hook, NY: Curran Associates Inc., 2017: 4006-4010. DOI: 10.21437/Interspeech.2017-1452.
[2] SHEN J, PANG R M, WEISS R J, et al. Natural TTS synthesis by conditioning Wavenet on MEL spectrogram predictions[C] //2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2018: 4779-4783. DOI: 10.1109/ICASSP.2018.8461368.
[3] 邱泽宇, 屈丹, 张连海. 基于WaveNet的端到端语音合成方法[J]. 计算机应用, 2019, 39(5): 1325-1329. DOI: 10.11772/j.issn.1001-9081.2018102131.
[4] ZHANG H T, LIN Y. Unsupervised learning for sequence-to-sequence text-to-speech for low-resource languages[C] //21st Annual Conference of the International Speech Communication Association (INTERSPEECH 2020). Red Hook, NY: Curran Associates Inc., 2020: 3161-3165. DOI: 10.21437/Interspeech.2020-1403.
[5] 张小峰, 谢钧, 罗健欣, 等. 深度学习语音合成技术综述[J]. 计算机工程与应用, 2021, 57(9): 50-59. DOI: 10.3778/j.issn.1002-8331.2101-0044.
[6] 张佳琳, 买日旦·吾守尔, 古兰拜尔·吐尔洪. 低资源条件下的语音合成方法综述[J]. 计算机工程与应用, 2023, 59(15): 1-16. DOI: 10.3778/j.issn.1002-8331.2211-0322.
[7] 孙志宏, 叶焱, 刘太君, 等. 基于迁移学习的自适应语音合成[J]. 数据通信, 2021(5): 47-51. DOI: 10.3969/j.issn.1002-5057.2021.05.011.
[8] KUMAR K, KUMAR R, DE BOISSIERE T, et al. MelGAN: generative adversarial networks for conditional waveform synthesis[C] //Advances in Neural Information Processing Systems 32 (NeurIPS 2019). Red Hook, NY: Curran Associates Inc., 2019: 14910-14921.
[9] 王智, 刘银华. 基于深度学习的中文情感语音合成方法[J]. 自动化与仪器仪表, 2022(9): 10-15. DOI: 10.14016/j.cnki.1001-9227.2022.09.010.
[10] HUANG R J, CUI C Y, CHEN F Y, et al. SingGAN: generative adversarial network for high-fidelity singing voice generation[C] //Proceedings of the 30th ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2022: 2525-2535. DOI: 10.1145/3503161.3547854.
[11] 李建文, 王咿卜. 函数拟合实现带声调的语音合成[J]. 计算机应用与软件, 2022, 39(9): 193-200. DOI: 10.3969/j.issn.1000-386x.2022.09.029.
[12] 李嘉欣, 张连海, 李宜亭. 基于音色一致的语音克隆说话人特征提取方法[J]. 信号处理, 2023, 39(4): 719-729. DOI: 10.16798/j.issn.1003-0530.2023.04.013.
[13] GIBIANSKY A, ARIK S, DIAMOS G, et al. Deep voice 2: multi-speaker neural text-to-speech[C] //Advances in Neural Information Processing Systems 30 (NIPS 2017). Red Hook, NY: Curran Associates, Inc., 2017: 2962-2970.
[14] YU D, YAO K S, SU H, et al. KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition[C] //2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Los Alamitos, CA: IEEE Computer Society, 2013: 7893-7897. DOI: 10.1109/ICASSP.2013.6639201.
[15] MIAO Y J, METZE F. On speaker adaptation of long short-term memory recurrent neural networks[C] //16th Annual Conference of the International Speech Communication Association (INTERSPEECH 2015). Red Hook, NY: Curran Associates, Inc., 2015: 1101-1105. DOI: 10.21437/Interspeech.2015-290.
[16] MIN D C, LEE D B, YANG E, et al. Meta-StyleSpeech: multi-speaker adaptive text-to-speech generation[C] //Proceedings of the 38th International Conference on Machine Learning. Virtual: PMLR, 2021: 7748-7759.
[17] ARIK S, CHEN J T, PENG K N, et al. Neural voice cloning with a few samples[C] //Advances in Neural Information Processing Systems 31 (NeurIPS 2018). Red Hook, NY: Curran Associates, Inc., 2018: 10019-10029.
[18] 徐志航, 陈博, 张辉, 等. 小数据下的音素级别说话人嵌入的语音合成自适应方法[J]. 计算机学报, 2022, 45(5): 1003-1017. DOI: 10.11897/SP.J.1016.2022.01003.
[19] 蒿晓阳, 张鹏远. 使用变分自编码器的自回归多说话人中文语音合成[J]. 声学学报, 2022, 47(3): 405-416. DOI: 10.15949/j.cnki.0371-0025.2022.03.004.
[20] CHEN M J, TAN X, LI B H, et al. AdaSpeech: adaptive text to speech for custom voice[C] //International Conference on Learning Representations 2021. Virtual: ICLR, 2021: 1-10.
[21] YAN Y Z, TAN X, LI B H, et al.AdaSpeech 2: adaptive text to speech with untranscribed data[C] //2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 6613-6617. DOI: 10.1109/ICASSP39728.2021.9414872.
[22] YAN Y Z, TAN X, LI B H, et al. Adaptive text to speech for spontaneous style[C] //22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021). Red Hook, NY: Curran Associates, Inc., 2021: 4668-4672. DOI: 10.21437/Interspeech.2021-584.
[23] SNYDER D, GARCIA-ROMERO D, POVEY D, et al. Deep neural network embeddings for text-independent speaker verification[C] //18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017). Red Hook, NY: Curran Associates, Inc., 2017: 999-1003. DOI: 10.21437/Interspeech. 2017-620.
[24] 张雅欣, 张连海. 一种基于x-vector说话人特征的语音克隆方法[J]. 信息工程大学学报, 2020, 21(6): 664-669. DOI: 10.3969/j.issn.1671-0673.2020.06.005.
[25] ZHANG Y, CHE H, LI J, et al. One-shot voice conversion based on speaker aware module[C] //2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 5959-5963. DOI: 10.1109/ICASSP39728.2021.9414081.
[26] 尚影, 韩超, 吴克伟. 基于分离对比学习的个性化语音合成[J]. 计算机工程与应用, 2023, 59(22): 158-165. DOI: 10.3778/j.issn.1002-8331.2306-0127.
[27] COOPER E, LAI C I, YASUDA Y, et al. Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings[C] //2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 6184-6188. DOI: 10.1109/ICASSP40776.2020.9054535.
[28] LU H, WU Z Y, DAI D Y, et al. One-shot voice conversion with global speaker embeddings[C] //20th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019). Red Hook, NY: Curran Associates, Inc., 2019: 669-673. DOI: 10.21437/Interspeech. 2019-2365.
[29] HSU W N, ZHANG Y, WEISS R J, et al. Hierarchical generative modeling for controllable speech synthesis[C] //International Conference on Learning Representations (ICLR 2019). Red Hook, NY: Curran Associates, Inc., 2019: 4020-4046.
[30] NGUYEN B, CARDINAUX F. NVC-Net: end-to-end adversarial voice conversion[C] //2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2022: 7012-7016. DOI: 10.1109/ICASSP43922.2022.9747020.
[31] KLIMKOV V, RONANKI S, ROHNKE J, et al. Fine-grained robust prosody transfer for single-speaker neural text-to-speech[C] //20th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019). Red Hook, NY: Curran Associates, Inc., 2019: 4440-4444. DOI: 10.21437/Interspeech.2019-2571.
[32] LEE Y, KIM T. Robust and fine-grained prosody control of end-to-end speech synthesis[C] //2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2019: 5911-5915. DOI: 10.1109/ICASSP.2019.8683501.
[33] LI X, SONG C H, LI J B, et al. Towards multi-scale style control for expressive speech synthesis[C] //22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021). Red Hook, NY: Curran Associates, Inc., 2021: 4673-4677. DOI: 10.21437/Interspeech.2021-947.
[34] FU R B, TAO J H, WEN Z Q, et al. Phoneme dependent speaker embedding and model factorization for multi-speaker speech synthesis and adaptation[C] //2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2019: 6930-6934. DOI: 10.1109/ICASSP.2019.8682535.
[35] CHOI S, HAN S, KIM D, et al.Attentron: few-shot text-to-speech utilizing attention-based variable-length embedding[C] //21st Annual Conference of the International Speech Communication Association (INTERSPEECH 2020). Red Hook, NY: Curran Associates Inc., 2020: 2007-2011. DOI: 10.21437/Interspeech.2020-2096.
[36] ZHOU Y X, SONG C H, LI X, et al. Content-dependent fine-grained speaker embedding for zero-shot speaker adaptation in text-to-speech synthesis[C] //23rd Annual Conference of the International Speech Communication Association (INTERSPEECH 2022). Red Hook, NY: Curran Associates Inc., 2022: 2573-2577. DOI: 10.21437/Interspeech.2022-10054.
[37] KONG J, KIM J, BAE J. HiFi-GAN: generative adversarial networks for efficient andhigh fidelity speech synthesis[C] //Advances in Neural Information Processing Systems 33 (NeurIPS 2020). Red Hook, NY: Curran Associates Inc., 2020: 17022-17033.
[38] SHCHEKOTOV I, ANDREEV P K, IVANOV O, et al. FFC-SE: fast Fourier convolution for speech enhancement[C] //23rd Annual Conference of the International Speech Communication Association (INTERSPEECH 2022). Red Hook, NY: Curran Associates Inc., 2022: 1188-1192. DOI: 10.21437/Interspeech.2022-603.
[39] ULYANOV D, VEDALDI A, LEMPITSKY V. Instance normalization: the missing ingredient for fast stylization[EB/OL]. (2016-07-27)[2023-11-13]. https://arxiv.org/abs/1607.08022. DOI: 10.48550/arXiv.1607.08022.
[40] TONG F C, ZHAO M, ZHOU J F, et al. ASV-SUBTOOLS:open source toolkit for automatic speaker verification[C] //2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 6184-6188. DOI: 10.1109/ICASSP39728.2021.9414676.
[41] YAO Z Y, WU D, WANG X, et al. WeNet: production oriented streaming and non-streaming end-to-end speech recognition toolkit[C] //Proceedings of the Interspeech 2021. Red Hook, NY: Curran Associates, Inc., 2021: 4054-4058. DOI: 10.21437/Interspeech.2021-1983.
[42] WANG Z C, ZHOU X Y, YANG F Y, et al. Enriching source style transfer in recognition-synthesis based non-parallel voice conversion[C] //22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021). Red Hook, NY: Curran Associates, Inc., 2021: 831-835. DOI: 10.21437/Interspeech.2021-1351.
[43] SHI Y, BU H, XU X, et al. AISHELL-3: a multi-speaker Mandarin tts corpus and the baselines[C] //22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021). Red Hook, NY: Curran Associates, Inc., 2021: 2756-2760. DOI: 10.21437/Interspeech.2021-755.
[44] TODA T, BLACK A W, TOKUDA K. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory[J]. IEEE Transactions on Audio Speech and Language Processing, 2007, 15(8): 2222-2235. DOI: 10.1109/TASL.2007.907344.
[45] WANG H J, LIANG C D, WANG S, et al. Wespeaker: a research and production oriented speaker embedding learning toolkit[C] //2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2023: 1-5. DOI: 10.1109/ICASSP49357.2023.10096626.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] ZHAO Jie, SONG Shuang, WU Bin. Overview of Image USM Sharpening Forensics and Anti-forensics Techniques[J]. Journal of Guangxi Normal University(Natural Science Edition), 2024, 42(3): 1 -16 .
[2] AI Congcong, GONG Guoli, JIAO Xiaoyu, TIAN Lu, GAI Zhongchao, GOU Jingxuan, LI Hui. Komagataella phaffii Serves as a Model Organism for Emerging Basic Research[J]. Journal of Guangxi Normal University(Natural Science Edition), 2024, 42(3): 17 -26 .
[3] ZHAI Yanhao, WANG Yanwu, LI Qiang, LI Jingkun. Progress of Dissolved Organic Matter in Inland Water by Three-Dimensional Fluorescence Spectroscopy Based on CiteSpace[J]. Journal of Guangxi Normal University(Natural Science Edition), 2024, 42(3): 34 -46 .
[4] CHEN Li, TANG Mingzhu, GUO Shenghui. Cyber-Physical Systems State Estimation and Actuator Attack Reconstruction of Intelligent Vehicles[J]. Journal of Guangxi Normal University(Natural Science Edition), 2024, 42(3): 59 -69 .
[5] LI Chengqian, SHI Chen, DENG Minyi. Study for the Electrocardiographic Signal of Brugada Syndrome Patients Using Cellular Automaton[J]. Journal of Guangxi Normal University(Natural Science Edition), 2024, 42(3): 86 -98 .
[6] LÜ Hui, LÜ Weifeng. Fundus Hemorrhagic Spot Detection Algorithm Based on Improved YOLOv5[J]. Journal of Guangxi Normal University(Natural Science Edition), 2024, 42(3): 99 -107 .
[7] YI Jianbing, PENG Xin, CAO Feng, LI Jun, XIE Weijia. Research on Point Cloud Registration Algorithm Based on Multi-scale Feature Fusion[J]. Journal of Guangxi Normal University(Natural Science Edition), 2024, 42(3): 108 -120 .
[8] LI Li, LI Haoze, LI Tao. Multi-primary-node Byzantine Fault-Tolerant Consensus Mechanism Based on Raft[J]. Journal of Guangxi Normal University(Natural Science Edition), 2024, 42(3): 121 -130 .
[9] ZHAO Xiaomei, DING Yong, WANG Haitao. Maximum Likelihood DOA Estimation Based on Improved Monarch Butterfly Algorithm[J]. Journal of Guangxi Normal University(Natural Science Edition), 2024, 42(3): 131 -140 .
[10] ZHU Yan, CAI Jing, LONG Fang. Statistical Analysis of Partially Step Stress Accelerated Life Tests for Compound Rayleigh Distribution Competing Failure Model Under Progressive Type-Ι Hybrid Censoring[J]. Journal of Guangxi Normal University(Natural Science Edition), 2024, 42(3): 159 -169 .