广西师范大学学报(自然科学版) ›› 2024, Vol. 42 ›› Issue (4): 11-21.doi: 10.16088/j.issn.1001-6600.2023111303

• CCIR2023 • 上一篇    下一篇

面向域外说话人适应场景的多层级解耦个性化语音合成

高盛祥1,2,3*, 杨元樟1,2, 王琳钦1,2, 莫尚斌1,2, 余正涛1,2,3, 董凌1,2,3   

  1. 1.昆明理工大学 信息工程与自动化学院, 云南 昆明 650500;
    2.云南省人工智能重点实验室(昆明理工大学), 云南 昆明 650500;
    3.云南省媒体融合重点实验室(云南日报报业集团), 云南 昆明 650228
  • 收稿日期:2023-11-13 修回日期:2024-01-06 出版日期:2024-07-25 发布日期:2024-09-05
  • 通讯作者: 高盛祥(1977—), 女, 云南洱源人, 昆明理工大学副教授, 博士。E-mail:gaoshengxiang.yn@foxmail.com
  • 基金资助:
    国家自然科学基金(62376111, U23A20388, 61972186, U21B2027); 云南高新技术产业发展项目(201606);云南省基础研究计划项目(202001AS070014); 云南省科技人才与平台计划项目(202105AC160018); 云南省媒体融合重点实验室开放课题(220225702); 云南省重点研发计划项目(202303AP140008, 202103AA080015)

Multi-level Disentangled Personalized Speech Synthesis for Out-of-Domain Speakers Adaptation Scenarios

GAO Shengxiang1,2,3*, YANG Yuanzhang1,2, WANG Linqin1,2, MO Shangbin1,2, YU Zhengtao1,2,3, DONG Ling1,2,3   

  1. 1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming Yunnan 650500, China;
    2. Yunnan Key Laboratory of Artificial Intelligence (Kunming University of Science and Technology), Kunming Yunnan 650500, China;
    3. Yunnan Key Laboratory of Media Convergence (Yunnan Daily Press Group), Kunming Yunnan 650228, China
  • Received:2023-11-13 Revised:2024-01-06 Online:2024-07-25 Published:2024-09-05

摘要: 个性化语音合成任务旨在合成特定说话人音色的语音。传统方法在合成域外说话人语音时,与真实语音存在明显音色差异,解耦说话人特征仍较为困难。本文提出面向训练时未出现的域外说话人适应场景下的多层级解耦个性化语音合成方法,通过不同粒度特征融合,有效提升零资源条件下域外说话人语音合成性能。本文方法采用快速傅里叶卷积提取说话人全局特征,以提高模型对域外说话人的泛化能力,实现句子粒度的说话人解耦;借助语音识别模型解耦音素粒度说话人特征,并通过注意力机制捕捉音素级音色特征,实现音素粒度的说话人解耦。实验结果表明:在公开数据集AISHELL3上,本文方法对域外说话人在客观评价指标说话人特征向量余弦相似度上达到0.697,相比基线模型提高6.25%,有效提升对域外说话人音色特征建模能力。

关键词: 语音合成, 零资源, 说话人表征, 域外说话人, 特征解耦

Abstract: Personalized speech synthesis aims to generate speech with specific speaker’s characteristics. Traditional approaches often exhibit noticeable timbre disparities when synthesizing speech from unseen speakers, making it challenging to disentangle speaker-specific timbre features. This paper proposes a multi-level disentangled personalized speech synthesis approach designed for out-of-domain speakers. By fusing features at different granularities, the proposed method effectively enhances the performance of synthesizing speech from unseen speakers under zero-resource conditions. This is achieved by utilizing fast Fourier convolution to extract global speaker features, thereby enhancing the model's generalization to unseen speakers and enabling sentence-level speaker decoupling. Additionally, leveraging a speech recognition model, the method decouples speaker features at the phoneme level and captures phoneme-level timbre features through an attention mechanism, achieving phoneme-level speaker disentanglement. Experimental results on the publicly available dataset AISHELL3 demonstrate that the proposed approach achieves a cosine similarity of 0.697 for speaker feature vectors of cross-speaker adaptation, indicating a 6.25% improvement compared with the baseline model. This enhancement shows the method’s capability in modeling timbre features for speech from unseen speakers in cross-speaker adaptation scenarios.

Key words: speech synthesis, zero-shot, speaker representation, out-of-domain speaker, feature disentanglement

中图分类号:  TN912.33

[1] WANG Y X, SKERRY-RYAN R J, STANTON D, et al. Tacotron: towards end-to-end speech synthesis[C] //18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017). Red Hook, NY: Curran Associates Inc., 2017: 4006-4010. DOI: 10.21437/Interspeech.2017-1452.
[2] SHEN J, PANG R M, WEISS R J, et al. Natural TTS synthesis by conditioning Wavenet on MEL spectrogram predictions[C] //2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2018: 4779-4783. DOI: 10.1109/ICASSP.2018.8461368.
[3] 邱泽宇, 屈丹, 张连海. 基于WaveNet的端到端语音合成方法[J]. 计算机应用, 2019, 39(5): 1325-1329. DOI: 10.11772/j.issn.1001-9081.2018102131.
[4] ZHANG H T, LIN Y. Unsupervised learning for sequence-to-sequence text-to-speech for low-resource languages[C] //21st Annual Conference of the International Speech Communication Association (INTERSPEECH 2020). Red Hook, NY: Curran Associates Inc., 2020: 3161-3165. DOI: 10.21437/Interspeech.2020-1403.
[5] 张小峰, 谢钧, 罗健欣, 等. 深度学习语音合成技术综述[J]. 计算机工程与应用, 2021, 57(9): 50-59. DOI: 10.3778/j.issn.1002-8331.2101-0044.
[6] 张佳琳, 买日旦·吾守尔, 古兰拜尔·吐尔洪. 低资源条件下的语音合成方法综述[J]. 计算机工程与应用, 2023, 59(15): 1-16. DOI: 10.3778/j.issn.1002-8331.2211-0322.
[7] 孙志宏, 叶焱, 刘太君, 等. 基于迁移学习的自适应语音合成[J]. 数据通信, 2021(5): 47-51. DOI: 10.3969/j.issn.1002-5057.2021.05.011.
[8] KUMAR K, KUMAR R, DE BOISSIERE T, et al. MelGAN: generative adversarial networks for conditional waveform synthesis[C] //Advances in Neural Information Processing Systems 32 (NeurIPS 2019). Red Hook, NY: Curran Associates Inc., 2019: 14910-14921.
[9] 王智, 刘银华. 基于深度学习的中文情感语音合成方法[J]. 自动化与仪器仪表, 2022(9): 10-15. DOI: 10.14016/j.cnki.1001-9227.2022.09.010.
[10] HUANG R J, CUI C Y, CHEN F Y, et al. SingGAN: generative adversarial network for high-fidelity singing voice generation[C] //Proceedings of the 30th ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2022: 2525-2535. DOI: 10.1145/3503161.3547854.
[11] 李建文, 王咿卜. 函数拟合实现带声调的语音合成[J]. 计算机应用与软件, 2022, 39(9): 193-200. DOI: 10.3969/j.issn.1000-386x.2022.09.029.
[12] 李嘉欣, 张连海, 李宜亭. 基于音色一致的语音克隆说话人特征提取方法[J]. 信号处理, 2023, 39(4): 719-729. DOI: 10.16798/j.issn.1003-0530.2023.04.013.
[13] GIBIANSKY A, ARIK S, DIAMOS G, et al. Deep voice 2: multi-speaker neural text-to-speech[C] //Advances in Neural Information Processing Systems 30 (NIPS 2017). Red Hook, NY: Curran Associates, Inc., 2017: 2962-2970.
[14] YU D, YAO K S, SU H, et al. KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition[C] //2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Los Alamitos, CA: IEEE Computer Society, 2013: 7893-7897. DOI: 10.1109/ICASSP.2013.6639201.
[15] MIAO Y J, METZE F. On speaker adaptation of long short-term memory recurrent neural networks[C] //16th Annual Conference of the International Speech Communication Association (INTERSPEECH 2015). Red Hook, NY: Curran Associates, Inc., 2015: 1101-1105. DOI: 10.21437/Interspeech.2015-290.
[16] MIN D C, LEE D B, YANG E, et al. Meta-StyleSpeech: multi-speaker adaptive text-to-speech generation[C] //Proceedings of the 38th International Conference on Machine Learning. Virtual: PMLR, 2021: 7748-7759.
[17] ARIK S, CHEN J T, PENG K N, et al. Neural voice cloning with a few samples[C] //Advances in Neural Information Processing Systems 31 (NeurIPS 2018). Red Hook, NY: Curran Associates, Inc., 2018: 10019-10029.
[18] 徐志航, 陈博, 张辉, 等. 小数据下的音素级别说话人嵌入的语音合成自适应方法[J]. 计算机学报, 2022, 45(5): 1003-1017. DOI: 10.11897/SP.J.1016.2022.01003.
[19] 蒿晓阳, 张鹏远. 使用变分自编码器的自回归多说话人中文语音合成[J]. 声学学报, 2022, 47(3): 405-416. DOI: 10.15949/j.cnki.0371-0025.2022.03.004.
[20] CHEN M J, TAN X, LI B H, et al. AdaSpeech: adaptive text to speech for custom voice[C] //International Conference on Learning Representations 2021. Virtual: ICLR, 2021: 1-10.
[21] YAN Y Z, TAN X, LI B H, et al.AdaSpeech 2: adaptive text to speech with untranscribed data[C] //2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 6613-6617. DOI: 10.1109/ICASSP39728.2021.9414872.
[22] YAN Y Z, TAN X, LI B H, et al. Adaptive text to speech for spontaneous style[C] //22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021). Red Hook, NY: Curran Associates, Inc., 2021: 4668-4672. DOI: 10.21437/Interspeech.2021-584.
[23] SNYDER D, GARCIA-ROMERO D, POVEY D, et al. Deep neural network embeddings for text-independent speaker verification[C] //18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017). Red Hook, NY: Curran Associates, Inc., 2017: 999-1003. DOI: 10.21437/Interspeech. 2017-620.
[24] 张雅欣, 张连海. 一种基于x-vector说话人特征的语音克隆方法[J]. 信息工程大学学报, 2020, 21(6): 664-669. DOI: 10.3969/j.issn.1671-0673.2020.06.005.
[25] ZHANG Y, CHE H, LI J, et al. One-shot voice conversion based on speaker aware module[C] //2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 5959-5963. DOI: 10.1109/ICASSP39728.2021.9414081.
[26] 尚影, 韩超, 吴克伟. 基于分离对比学习的个性化语音合成[J]. 计算机工程与应用, 2023, 59(22): 158-165. DOI: 10.3778/j.issn.1002-8331.2306-0127.
[27] COOPER E, LAI C I, YASUDA Y, et al. Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings[C] //2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 6184-6188. DOI: 10.1109/ICASSP40776.2020.9054535.
[28] LU H, WU Z Y, DAI D Y, et al. One-shot voice conversion with global speaker embeddings[C] //20th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019). Red Hook, NY: Curran Associates, Inc., 2019: 669-673. DOI: 10.21437/Interspeech. 2019-2365.
[29] HSU W N, ZHANG Y, WEISS R J, et al. Hierarchical generative modeling for controllable speech synthesis[C] //International Conference on Learning Representations (ICLR 2019). Red Hook, NY: Curran Associates, Inc., 2019: 4020-4046.
[30] NGUYEN B, CARDINAUX F. NVC-Net: end-to-end adversarial voice conversion[C] //2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2022: 7012-7016. DOI: 10.1109/ICASSP43922.2022.9747020.
[31] KLIMKOV V, RONANKI S, ROHNKE J, et al. Fine-grained robust prosody transfer for single-speaker neural text-to-speech[C] //20th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019). Red Hook, NY: Curran Associates, Inc., 2019: 4440-4444. DOI: 10.21437/Interspeech.2019-2571.
[32] LEE Y, KIM T. Robust and fine-grained prosody control of end-to-end speech synthesis[C] //2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2019: 5911-5915. DOI: 10.1109/ICASSP.2019.8683501.
[33] LI X, SONG C H, LI J B, et al. Towards multi-scale style control for expressive speech synthesis[C] //22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021). Red Hook, NY: Curran Associates, Inc., 2021: 4673-4677. DOI: 10.21437/Interspeech.2021-947.
[34] FU R B, TAO J H, WEN Z Q, et al. Phoneme dependent speaker embedding and model factorization for multi-speaker speech synthesis and adaptation[C] //2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2019: 6930-6934. DOI: 10.1109/ICASSP.2019.8682535.
[35] CHOI S, HAN S, KIM D, et al.Attentron: few-shot text-to-speech utilizing attention-based variable-length embedding[C] //21st Annual Conference of the International Speech Communication Association (INTERSPEECH 2020). Red Hook, NY: Curran Associates Inc., 2020: 2007-2011. DOI: 10.21437/Interspeech.2020-2096.
[36] ZHOU Y X, SONG C H, LI X, et al. Content-dependent fine-grained speaker embedding for zero-shot speaker adaptation in text-to-speech synthesis[C] //23rd Annual Conference of the International Speech Communication Association (INTERSPEECH 2022). Red Hook, NY: Curran Associates Inc., 2022: 2573-2577. DOI: 10.21437/Interspeech.2022-10054.
[37] KONG J, KIM J, BAE J. HiFi-GAN: generative adversarial networks for efficient andhigh fidelity speech synthesis[C] //Advances in Neural Information Processing Systems 33 (NeurIPS 2020). Red Hook, NY: Curran Associates Inc., 2020: 17022-17033.
[38] SHCHEKOTOV I, ANDREEV P K, IVANOV O, et al. FFC-SE: fast Fourier convolution for speech enhancement[C] //23rd Annual Conference of the International Speech Communication Association (INTERSPEECH 2022). Red Hook, NY: Curran Associates Inc., 2022: 1188-1192. DOI: 10.21437/Interspeech.2022-603.
[39] ULYANOV D, VEDALDI A, LEMPITSKY V. Instance normalization: the missing ingredient for fast stylization[EB/OL]. (2016-07-27)[2023-11-13]. https://arxiv.org/abs/1607.08022. DOI: 10.48550/arXiv.1607.08022.
[40] TONG F C, ZHAO M, ZHOU J F, et al. ASV-SUBTOOLS:open source toolkit for automatic speaker verification[C] //2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 6184-6188. DOI: 10.1109/ICASSP39728.2021.9414676.
[41] YAO Z Y, WU D, WANG X, et al. WeNet: production oriented streaming and non-streaming end-to-end speech recognition toolkit[C] //Proceedings of the Interspeech 2021. Red Hook, NY: Curran Associates, Inc., 2021: 4054-4058. DOI: 10.21437/Interspeech.2021-1983.
[42] WANG Z C, ZHOU X Y, YANG F Y, et al. Enriching source style transfer in recognition-synthesis based non-parallel voice conversion[C] //22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021). Red Hook, NY: Curran Associates, Inc., 2021: 831-835. DOI: 10.21437/Interspeech.2021-1351.
[43] SHI Y, BU H, XU X, et al. AISHELL-3: a multi-speaker Mandarin tts corpus and the baselines[C] //22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021). Red Hook, NY: Curran Associates, Inc., 2021: 2756-2760. DOI: 10.21437/Interspeech.2021-755.
[44] TODA T, BLACK A W, TOKUDA K. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory[J]. IEEE Transactions on Audio Speech and Language Processing, 2007, 15(8): 2222-2235. DOI: 10.1109/TASL.2007.907344.
[45] WANG H J, LIANG C D, WANG S, et al. Wespeaker: a research and production oriented speaker embedding learning toolkit[C] //2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2023: 1-5. DOI: 10.1109/ICASSP49357.2023.10096626.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 赵洁, 宋爽, 武斌. 图像USM锐化取证与反取证技术综述[J]. 广西师范大学学报(自然科学版), 2024, 42(3): 1 -16 .
[2] 艾聪聪, 龚国利, 焦小雨, 田露, 盖中朝, 缑敬轩, 李慧. 毕赤酵母作为基础研究的新兴模式生物研究进展[J]. 广西师范大学学报(自然科学版), 2024, 42(3): 17 -26 .
[3] 翟言豪, 王燕舞, 李强, 李景坤. 基于CiteSpace的三维荧光光谱技术对内陆水体中溶解性有机质研究的进展[J]. 广西师范大学学报(自然科学版), 2024, 42(3): 34 -46 .
[4] 陈丽, 唐明珠, 郭胜辉. 智能汽车信息物理系统状态估计与执行器攻击重构[J]. 广西师范大学学报(自然科学版), 2024, 42(3): 59 -69 .
[5] 李成乾, 石晨, 邓敏艺. 基于元胞自动机的Brugada综合征患者心电信号研究[J]. 广西师范大学学报(自然科学版), 2024, 42(3): 86 -98 .
[6] 吕辉, 吕卫峰. 基于改进YOLOv5的眼底出血点检测算法[J]. 广西师范大学学报(自然科学版), 2024, 42(3): 99 -107 .
[7] 易见兵, 彭鑫, 曹锋, 李俊, 谢唯嘉. 多尺度特征融合的点云配准算法研究[J]. 广西师范大学学报(自然科学版), 2024, 42(3): 108 -120 .
[8] 李莉, 李昊泽, 李涛. 基于Raft的多主节点拜占庭容错共识机制[J]. 广西师范大学学报(自然科学版), 2024, 42(3): 121 -130 .
[9] 赵小梅, 丁勇, 王海涛. 基于改进帝王蝶算法的最大似然DOA估计[J]. 广西师范大学学报(自然科学版), 2024, 42(3): 131 -140 .
[10] 朱艳, 蔡静, 龙芳. 逐步Ⅰ型混合截尾下复合Rayleigh分布竞争失效产品部分步加寿命试验的统计分析[J]. 广西师范大学学报(自然科学版), 2024, 42(3): 159 -169 .
版权所有 © 广西师范大学学报(自然科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发