Adapt-TTS: High-quality zero-shot multi-speaker text-to-speech adaptive-based for Vietnamese
Tóm tắt
Từ khóa
#Zero-shot TTS #multi-speaker #text-to-speech #diffusion models #mel-spectrogram denoiser #extracting mel-vector #EMV #adapt-TTS.Tài liệu tham khảo
Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., ... & Wu, Y. (2018, April). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4779-4783). IEEE.
Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2020). Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558.
Kim, J., Kong, J., & Son, J. (2021, July). Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning (pp. 5530-5540). PMLR.
Cooper, E., Lai, C. I., Yasuda, Y., Fang, F., Wang, X., Chen, N., & Yamagishi, J. (2020, May). Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6184-6188). IEEE.
Wu, Y., Tan, X., Li, B., He, L., Zhao, S., Song, R., ... & Liu, T. Y. (2022). Adaspeech 4: Adaptive text to speech in zero-shot scenarios. arXiv preprint arXiv:2204.00436.
Tits, N., El Haddad, K., & Dutoit, T. (2020). Exploring transfer learning for low resource emotional tts. In Intelligent Systems and Applications: Proceedings of the 2019 Intelligent Systems Conference (IntelliSys) Volume 1 (pp. 52-60). Springer International Publishing.
Xie, Q., Tian, X., Liu, G., Song, K., Xie, L., Wu, Z., ... & Xu, X. (2021, June). The multi-speaker multi-style voice cloning challenge 2021. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 8613-8617). IEEE.
Arik, S., Chen, J., Peng, K., Ping, W., & Zhou, Y. (2018). Neural voice cloning with a few samples. Advances in neural information processing systems, 31.
Pourpanah, F., Abdar, M., Luo, Y., Zhou, X., Wang, R., Lim, C. P., ... & Wu, Q. J. (2022). A review of generalized zero-shot learning methods. IEEE transactions on pattern analysis and machine intelligence.
. Ping, W., Peng, K., Gibiansky, A., Arik, S. Ö., Kannan, A., Narang, S., ... & Miller, J. (2017). Deep Voice 3: 2000-Speaker Neural Text-to-Speech.
Min, D., Lee, D. B., Yang, E., & Hwang, S. J. (2021, July). Meta-stylespeech: Multi-speaker adaptive text-to-speech generation. In International Conference on Machine Learning (pp. 7748-7759). PMLR.
Liu, J., Li, C., Ren, Y., Chen, F., & Zhao, Z. (2022, June). Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 10, pp. 11020-11028)
Jia, Y., Zhang, Y., Weiss, R., Wang, Q., Shen, J., Ren, F., ... & Wu, Y. (2018). Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Advances in neural information processing systems, 31.
Cooper, E., Lai, C. I., Yasuda, Y., Fang, F., Wang, X., Chen, N., & Yamagishi, J. (2020, May). Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6184-6188). IEEE.
Wang, Y., Stanton, D., Zhang, Y., Ryan, R. S., Battenberg, E., Shor, J., ... & Saurous, R. A. (2018, July). Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International Conference on Machine Learning (pp. 5180-5189). PMLR.
Choi, S., Han, S., Kim, D., & Ha, S. (2020). Attentron: Few-shot text-to-speech utilizing attention-based variable-length embedding. arXiv preprint arXiv:2005.08484.
Casanova, E., Shulby, C., Gölge, E., Müller, N. M., de Oliveira, F. S., Junior, A. C., ... & Ponti, M. A. (2021). Sc-glowtts: an efficient zero-shot multi-speaker text-to-speech model. arXiv preprint arXiv:2104.05557.
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840-6851.
Nichol, A. Q., & Dhariwal, P. (2021, July). Improved denoising diffusion probabilistic models. In International Conference on Machine Learning (pp. 8162-8171). PMLR
Huang, S. F., Lin, C. J., Liu, D. R., Chen, Y. C., & Lee, H. Y. (2022). Meta-TTS: Meta-learning for few-shot speaker adaptive text-to-speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 1558-1571.
Liu, Y., He, L., Liu, J., & Johnson, M. T. (2019). Introducing phonetic information to speaker embedding for speaker verification. EURASIP Journal on Audio, Speech, and Music Processing, 2019, 1-17.
Snyder, D., Ghahremani, P., Povey, D., Garcia-Romero, D., Carmiel, Y., & Khudanpur, S. (2016, December). Deep neural network-based speaker embeddings for end-to-end speaker verification. In 2016 IEEE Spoken Language Technology Workshop (SLT) (pp. 165-170). IEEE.
Kwon, Y., Jung, J. W., Heo, H. S., Kim, Y. J., Lee, B. J., & Chung, J. S. (2021). Adapting speaker embeddings for speaker diarisation. arXiv preprint arXiv:2104.02879.
Wang, Y., Stanton, D., Zhang, Y., Ryan, R. S., Battenberg, E., Shor, J., ... & Saurous, R. A. (2018, July). Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International Conference on Machine Learning (pp. 5180-5189). PMLR.
Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862.
Wester, M., Wu, Z., & Yamagishi, J. (2016, September). Analysis of the Voice Conversion Challenge 2016 Evaluation Results. In Interspeech (pp. 1637-1641).
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(11).
