Adapt-TTS: High-quality zero-shot multi-speaker text-to-speech adaptive-based for Vietnamese

Journal of Computer Science and Cybernetics - Tập 39 Số 2 - Trang 159--173 - 2023

Phuong Pham Ngoc¹, Chung Tran Quang², Mai Luong Chi³

¹Thai Nguyen University, Tan Thinh Ward, Thai Nguyen City, Thai Nguyen Province, Viet Nam

²Japan Advanced Institute of Science and Technology (JAIST), Japan

³Institute of Information Technology, Vietnam Academy of Science and Technology, 18 Hoang Quoc Viet Street, Cau Giay District, Ha Noi, Viet Nam

Tóm tắt

Current adaptive-based speech synthesis techniques are based on two main streams: 1. Fine-tuning the model using small amounts of adaptive data, and 2. Conditionally training the entire model through a speaker embedding of the target speaker. However, both of these methods require adaptive data to appear during training, which makes the training cost to generate new voices quite expensively. In addition, the traditional TTS model uses a simple loss function to reproduce the acoustic features. However, this optimization is based on incorrect distribution assumptions leading to noisy composite audio results. We introduce the Adapt-TTS model that allows high-quality audio synthesis from a small adaptive sample without training to solve these problems. Key recommendations: 1. The Extracting Mel-vector (EMV) architecture allows for a better representation of speaker characteristics and speech style; 2. An improved zero-shot model with a denoising diffusion model (Mel-spectrogram denoiser) component allows for new voice synthesis without training with better quality (less noise). The evaluation results have proven the model's effectiveness when only needing a single utterance (1-3 seconds) of the reference speaker, the synthesis system gave high-quality synthesis results and achieved high similarity.

Từ khóa

#Zero-shot TTS #multi-speaker #text-to-speech #diffusion models #mel-spectrogram denoiser #extracting mel-vector #EMV #adapt-TTS.

Tài liệu tham khảo

Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., ... & Wu, Y. (2018, April). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4779-4783). IEEE.

Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2020). Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558.

Kim, J., Kong, J., & Son, J. (2021, July). Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning (pp. 5530-5540). PMLR.

Cooper, E., Lai, C. I., Yasuda, Y., Fang, F., Wang, X., Chen, N., & Yamagishi, J. (2020, May). Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6184-6188). IEEE.

Wu, Y., Tan, X., Li, B., He, L., Zhao, S., Song, R., ... & Liu, T. Y. (2022). Adaspeech 4: Adaptive text to speech in zero-shot scenarios. arXiv preprint arXiv:2204.00436.

Tits, N., El Haddad, K., & Dutoit, T. (2020). Exploring transfer learning for low resource emotional tts. In Intelligent Systems and Applications: Proceedings of the 2019 Intelligent Systems Conference (IntelliSys) Volume 1 (pp. 52-60). Springer International Publishing.

Xie, Q., Tian, X., Liu, G., Song, K., Xie, L., Wu, Z., ... & Xu, X. (2021, June). The multi-speaker multi-style voice cloning challenge 2021. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 8613-8617). IEEE.

Arik, S., Chen, J., Peng, K., Ping, W., & Zhou, Y. (2018). Neural voice cloning with a few samples. Advances in neural information processing systems, 31.

Pourpanah, F., Abdar, M., Luo, Y., Zhou, X., Wang, R., Lim, C. P., ... & Wu, Q. J. (2022). A review of generalized zero-shot learning methods. IEEE transactions on pattern analysis and machine intelligence.

. Ping, W., Peng, K., Gibiansky, A., Arik, S. Ö., Kannan, A., Narang, S., ... & Miller, J. (2017). Deep Voice 3: 2000-Speaker Neural Text-to-Speech.

Min, D., Lee, D. B., Yang, E., & Hwang, S. J. (2021, July). Meta-stylespeech: Multi-speaker adaptive text-to-speech generation. In International Conference on Machine Learning (pp. 7748-7759). PMLR.

Liu, J., Li, C., Ren, Y., Chen, F., & Zhao, Z. (2022, June). Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 10, pp. 11020-11028)

Jia, Y., Zhang, Y., Weiss, R., Wang, Q., Shen, J., Ren, F., ... & Wu, Y. (2018). Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Advances in neural information processing systems, 31.

Wang, Y., Stanton, D., Zhang, Y., Ryan, R. S., Battenberg, E., Shor, J., ... & Saurous, R. A. (2018, July). Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International Conference on Machine Learning (pp. 5180-5189). PMLR.

Choi, S., Han, S., Kim, D., & Ha, S. (2020). Attentron: Few-shot text-to-speech utilizing attention-based variable-length embedding. arXiv preprint arXiv:2005.08484.

Casanova, E., Shulby, C., Gölge, E., Müller, N. M., de Oliveira, F. S., Junior, A. C., ... & Ponti, M. A. (2021). Sc-glowtts: an efficient zero-shot multi-speaker text-to-speech model. arXiv preprint arXiv:2104.05557.

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840-6851.

Nichol, A. Q., & Dhariwal, P. (2021, July). Improved denoising diffusion probabilistic models. In International Conference on Machine Learning (pp. 8162-8171). PMLR

Huang, S. F., Lin, C. J., Liu, D. R., Chen, Y. C., & Lee, H. Y. (2022). Meta-TTS: Meta-learning for few-shot speaker adaptive text-to-speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 1558-1571.

Liu, Y., He, L., Liu, J., & Johnson, M. T. (2019). Introducing phonetic information to speaker embedding for speaker verification. EURASIP Journal on Audio, Speech, and Music Processing, 2019, 1-17.

Snyder, D., Ghahremani, P., Povey, D., Garcia-Romero, D., Carmiel, Y., & Khudanpur, S. (2016, December). Deep neural network-based speaker embeddings for end-to-end speaker verification. In 2016 IEEE Spoken Language Technology Workshop (SLT) (pp. 165-170). IEEE.

Kwon, Y., Jung, J. W., Heo, H. S., Kim, Y. J., Lee, B. J., & Chung, J. S. (2021). Adapting speaker embeddings for speaker diarisation. arXiv preprint arXiv:2104.02879.

Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862.

Wester, M., Wu, Z., & Yamagishi, J. (2016, September). Analysis of the Voice Conversion Challenge 2016 Evaluation Results. In Interspeech (pp. 1637-1641).

Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(11).

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích ảnh hưởng của các bài báo, công bố khoa học Việt Nam và Quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ SciBase

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Hệ thống hội thảo khoa học Việt Nam

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA

Thông tin liên hệ & hỗ trợ

Đơn vị chủ quản, phát triển và vận hành: Công ty Cổ phần Metis

Địa chỉ liên hệ: 26A Lê Đức Thọ, Phường Từ Liêm, Thành phố Hà Nội

Số giấy chứng nhận ĐKKD: 0109293202 cấp ngày 03/08/2020 tại Sở Kế hoạch và Đầu tư thành phố Hà Nội

Người quản lý và chịu trách nhiệm nội dung: Nguyễn Ngọc Sơn

Hotline: 0566.685.688

Email: [email protected]