Journal of Guangxi Normal University(Natural Science Edition) ›› 2024, Vol. 42 ›› Issue (4): 11-21.doi: 10.16088/j.issn.1001-6600.2023111303
Previous Articles Next Articles
GAO Shengxiang1,2,3*, YANG Yuanzhang1,2, WANG Linqin1,2, MO Shangbin1,2, YU Zhengtao1,2,3, DONG Ling1,2,3
[1] WANG Y X, SKERRY-RYAN R J, STANTON D, et al. Tacotron: towards end-to-end speech synthesis[C] //18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017). Red Hook, NY: Curran Associates Inc., 2017: 4006-4010. DOI: 10.21437/Interspeech.2017-1452. [2] SHEN J, PANG R M, WEISS R J, et al. Natural TTS synthesis by conditioning Wavenet on MEL spectrogram predictions[C] //2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2018: 4779-4783. DOI: 10.1109/ICASSP.2018.8461368. [3] 邱泽宇, 屈丹, 张连海. 基于WaveNet的端到端语音合成方法[J]. 计算机应用, 2019, 39(5): 1325-1329. DOI: 10.11772/j.issn.1001-9081.2018102131. [4] ZHANG H T, LIN Y. Unsupervised learning for sequence-to-sequence text-to-speech for low-resource languages[C] //21st Annual Conference of the International Speech Communication Association (INTERSPEECH 2020). Red Hook, NY: Curran Associates Inc., 2020: 3161-3165. DOI: 10.21437/Interspeech.2020-1403. [5] 张小峰, 谢钧, 罗健欣, 等. 深度学习语音合成技术综述[J]. 计算机工程与应用, 2021, 57(9): 50-59. DOI: 10.3778/j.issn.1002-8331.2101-0044. [6] 张佳琳, 买日旦·吾守尔, 古兰拜尔·吐尔洪. 低资源条件下的语音合成方法综述[J]. 计算机工程与应用, 2023, 59(15): 1-16. DOI: 10.3778/j.issn.1002-8331.2211-0322. [7] 孙志宏, 叶焱, 刘太君, 等. 基于迁移学习的自适应语音合成[J]. 数据通信, 2021(5): 47-51. DOI: 10.3969/j.issn.1002-5057.2021.05.011. [8] KUMAR K, KUMAR R, DE BOISSIERE T, et al. MelGAN: generative adversarial networks for conditional waveform synthesis[C] //Advances in Neural Information Processing Systems 32 (NeurIPS 2019). Red Hook, NY: Curran Associates Inc., 2019: 14910-14921. [9] 王智, 刘银华. 基于深度学习的中文情感语音合成方法[J]. 自动化与仪器仪表, 2022(9): 10-15. DOI: 10.14016/j.cnki.1001-9227.2022.09.010. [10] HUANG R J, CUI C Y, CHEN F Y, et al. SingGAN: generative adversarial network for high-fidelity singing voice generation[C] //Proceedings of the 30th ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery, 2022: 2525-2535. DOI: 10.1145/3503161.3547854. [11] 李建文, 王咿卜. 函数拟合实现带声调的语音合成[J]. 计算机应用与软件, 2022, 39(9): 193-200. DOI: 10.3969/j.issn.1000-386x.2022.09.029. [12] 李嘉欣, 张连海, 李宜亭. 基于音色一致的语音克隆说话人特征提取方法[J]. 信号处理, 2023, 39(4): 719-729. DOI: 10.16798/j.issn.1003-0530.2023.04.013. [13] GIBIANSKY A, ARIK S, DIAMOS G, et al. Deep voice 2: multi-speaker neural text-to-speech[C] //Advances in Neural Information Processing Systems 30 (NIPS 2017). Red Hook, NY: Curran Associates, Inc., 2017: 2962-2970. [14] YU D, YAO K S, SU H, et al. KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition[C] //2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Los Alamitos, CA: IEEE Computer Society, 2013: 7893-7897. DOI: 10.1109/ICASSP.2013.6639201. [15] MIAO Y J, METZE F. On speaker adaptation of long short-term memory recurrent neural networks[C] //16th Annual Conference of the International Speech Communication Association (INTERSPEECH 2015). Red Hook, NY: Curran Associates, Inc., 2015: 1101-1105. DOI: 10.21437/Interspeech.2015-290. [16] MIN D C, LEE D B, YANG E, et al. Meta-StyleSpeech: multi-speaker adaptive text-to-speech generation[C] //Proceedings of the 38th International Conference on Machine Learning. Virtual: PMLR, 2021: 7748-7759. [17] ARIK S, CHEN J T, PENG K N, et al. Neural voice cloning with a few samples[C] //Advances in Neural Information Processing Systems 31 (NeurIPS 2018). Red Hook, NY: Curran Associates, Inc., 2018: 10019-10029. [18] 徐志航, 陈博, 张辉, 等. 小数据下的音素级别说话人嵌入的语音合成自适应方法[J]. 计算机学报, 2022, 45(5): 1003-1017. DOI: 10.11897/SP.J.1016.2022.01003. [19] 蒿晓阳, 张鹏远. 使用变分自编码器的自回归多说话人中文语音合成[J]. 声学学报, 2022, 47(3): 405-416. DOI: 10.15949/j.cnki.0371-0025.2022.03.004. [20] CHEN M J, TAN X, LI B H, et al. AdaSpeech: adaptive text to speech for custom voice[C] //International Conference on Learning Representations 2021. Virtual: ICLR, 2021: 1-10. [21] YAN Y Z, TAN X, LI B H, et al.AdaSpeech 2: adaptive text to speech with untranscribed data[C] //2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 6613-6617. DOI: 10.1109/ICASSP39728.2021.9414872. [22] YAN Y Z, TAN X, LI B H, et al. Adaptive text to speech for spontaneous style[C] //22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021). Red Hook, NY: Curran Associates, Inc., 2021: 4668-4672. DOI: 10.21437/Interspeech.2021-584. [23] SNYDER D, GARCIA-ROMERO D, POVEY D, et al. Deep neural network embeddings for text-independent speaker verification[C] //18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017). Red Hook, NY: Curran Associates, Inc., 2017: 999-1003. DOI: 10.21437/Interspeech. 2017-620. [24] 张雅欣, 张连海. 一种基于x-vector说话人特征的语音克隆方法[J]. 信息工程大学学报, 2020, 21(6): 664-669. DOI: 10.3969/j.issn.1671-0673.2020.06.005. [25] ZHANG Y, CHE H, LI J, et al. One-shot voice conversion based on speaker aware module[C] //2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 5959-5963. DOI: 10.1109/ICASSP39728.2021.9414081. [26] 尚影, 韩超, 吴克伟. 基于分离对比学习的个性化语音合成[J]. 计算机工程与应用, 2023, 59(22): 158-165. DOI: 10.3778/j.issn.1002-8331.2306-0127. [27] COOPER E, LAI C I, YASUDA Y, et al. Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings[C] //2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 6184-6188. DOI: 10.1109/ICASSP40776.2020.9054535. [28] LU H, WU Z Y, DAI D Y, et al. One-shot voice conversion with global speaker embeddings[C] //20th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019). Red Hook, NY: Curran Associates, Inc., 2019: 669-673. DOI: 10.21437/Interspeech. 2019-2365. [29] HSU W N, ZHANG Y, WEISS R J, et al. Hierarchical generative modeling for controllable speech synthesis[C] //International Conference on Learning Representations (ICLR 2019). Red Hook, NY: Curran Associates, Inc., 2019: 4020-4046. [30] NGUYEN B, CARDINAUX F. NVC-Net: end-to-end adversarial voice conversion[C] //2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2022: 7012-7016. DOI: 10.1109/ICASSP43922.2022.9747020. [31] KLIMKOV V, RONANKI S, ROHNKE J, et al. Fine-grained robust prosody transfer for single-speaker neural text-to-speech[C] //20th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019). Red Hook, NY: Curran Associates, Inc., 2019: 4440-4444. DOI: 10.21437/Interspeech.2019-2571. [32] LEE Y, KIM T. Robust and fine-grained prosody control of end-to-end speech synthesis[C] //2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2019: 5911-5915. DOI: 10.1109/ICASSP.2019.8683501. [33] LI X, SONG C H, LI J B, et al. Towards multi-scale style control for expressive speech synthesis[C] //22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021). Red Hook, NY: Curran Associates, Inc., 2021: 4673-4677. DOI: 10.21437/Interspeech.2021-947. [34] FU R B, TAO J H, WEN Z Q, et al. Phoneme dependent speaker embedding and model factorization for multi-speaker speech synthesis and adaptation[C] //2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2019: 6930-6934. DOI: 10.1109/ICASSP.2019.8682535. [35] CHOI S, HAN S, KIM D, et al.Attentron: few-shot text-to-speech utilizing attention-based variable-length embedding[C] //21st Annual Conference of the International Speech Communication Association (INTERSPEECH 2020). Red Hook, NY: Curran Associates Inc., 2020: 2007-2011. DOI: 10.21437/Interspeech.2020-2096. [36] ZHOU Y X, SONG C H, LI X, et al. Content-dependent fine-grained speaker embedding for zero-shot speaker adaptation in text-to-speech synthesis[C] //23rd Annual Conference of the International Speech Communication Association (INTERSPEECH 2022). Red Hook, NY: Curran Associates Inc., 2022: 2573-2577. DOI: 10.21437/Interspeech.2022-10054. [37] KONG J, KIM J, BAE J. HiFi-GAN: generative adversarial networks for efficient andhigh fidelity speech synthesis[C] //Advances in Neural Information Processing Systems 33 (NeurIPS 2020). Red Hook, NY: Curran Associates Inc., 2020: 17022-17033. [38] SHCHEKOTOV I, ANDREEV P K, IVANOV O, et al. FFC-SE: fast Fourier convolution for speech enhancement[C] //23rd Annual Conference of the International Speech Communication Association (INTERSPEECH 2022). Red Hook, NY: Curran Associates Inc., 2022: 1188-1192. DOI: 10.21437/Interspeech.2022-603. [39] ULYANOV D, VEDALDI A, LEMPITSKY V. Instance normalization: the missing ingredient for fast stylization[EB/OL]. (2016-07-27)[2023-11-13]. https://arxiv.org/abs/1607.08022. DOI: 10.48550/arXiv.1607.08022. [40] TONG F C, ZHAO M, ZHOU J F, et al. ASV-SUBTOOLS:open source toolkit for automatic speaker verification[C] //2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 6184-6188. DOI: 10.1109/ICASSP39728.2021.9414676. [41] YAO Z Y, WU D, WANG X, et al. WeNet: production oriented streaming and non-streaming end-to-end speech recognition toolkit[C] //Proceedings of the Interspeech 2021. Red Hook, NY: Curran Associates, Inc., 2021: 4054-4058. DOI: 10.21437/Interspeech.2021-1983. [42] WANG Z C, ZHOU X Y, YANG F Y, et al. Enriching source style transfer in recognition-synthesis based non-parallel voice conversion[C] //22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021). Red Hook, NY: Curran Associates, Inc., 2021: 831-835. DOI: 10.21437/Interspeech.2021-1351. [43] SHI Y, BU H, XU X, et al. AISHELL-3: a multi-speaker Mandarin tts corpus and the baselines[C] //22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021). Red Hook, NY: Curran Associates, Inc., 2021: 2756-2760. DOI: 10.21437/Interspeech.2021-755. [44] TODA T, BLACK A W, TOKUDA K. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory[J]. IEEE Transactions on Audio Speech and Language Processing, 2007, 15(8): 2222-2235. DOI: 10.1109/TASL.2007.907344. [45] WANG H J, LIANG C D, WANG S, et al. Wespeaker: a research and production oriented speaker embedding learning toolkit[C] //2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2023: 1-5. DOI: 10.1109/ICASSP49357.2023.10096626. |
No related articles found! |
|