Factorized WaveNet for voice conversion with limited data

Speech Communication - Tập 130 - Trang 45-54 - 2021
Hongqiang Du1,2, Xiaohai Tian2, Lei Xie1, Haizhou Li2
1Audio, Speech and Language Processing Laboratory, School of Computer Science, Northwestern Polytechnical University, China
2Department of Electrical and Computer Engineering, National University of Singapore, Singapore

Tài liệu tham khảo

Adiga, 2018, On the use of WaveNet as a statistical vocoder, 5674 Adriana, R., Nicolas, B., Ebrahimi, K.S., Antoine, C., Carlo, G., Yoshua, B., 2015. Fitnets: Hints for thin deep nets. In: Proc. ICLR. Augasta, 2013, Pruning algorithms of neural networks—a comparative study, Open Comput. Sci., 3, 105, 10.2478/s13537-013-0109-x Ba, 2014, Do deep nets really need to be deep?, 2654 Cheng, 2018, Model compression and acceleration for deep neural networks: The principles, progress, and challenges, IEEE Signal Process. Mag., 35, 126, 10.1109/MSP.2017.2765695 Çişman, 2017, Sparse representation of phonetic features for voice conversion with and without parallel data, 677 Du, 2019, Wavenet factorization with singular value decomposition for voice conversion, 152 Du, 2020, Effective wavenet adaptation for voice conversion with limited data, 7779 Engel, 2017, Neural audio synthesis of musical notes with wavenet autoencoders, 1068 Ezzine, 2017, A comparative study of voice conversion techniques: A review, 1 Fan, 2015, Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis, 4475 Gibiansky, 2017, Deep voice 2: Multi-speaker neural text-to-speech, 2962 Han, 2015, Learning both weights and connections for efficient neural network, 1135 Hinton, 2015, Distilling the knowledge in a neural network Kalchbrenner, 2018, Efficient neural audio synthesis, 2410 Kobayashi, 2017, Statistical voice conversion with wavenet-based waveform generation, 1138 Kominek, 2004, The CMU arctic speech databases Krizhevsky, 2012, Imagenet classification with deep convolutional neural networks, 1097 Lee, 2006, Map-based adaptation for speech conversion using adaptation data selection and non-parallel training Liu, 2018, Wavenet vocoder with limited training data for voice conversion, 1983 Lu, 2019, One-shot voice conversion with global speaker embeddings, 669 Lu, 2019, A compact framework for voice conversion using wavenet conditioned on phonetic posteriorgrams, 6810 Machado, A.F., Queiroz, M., 2010. Voice conversion: A critical survey. In: Proc. Sound and Music Computing (SMC). pp. 1–8. Manzelli, 2018, Conditioning deep generative raw audio models for structured automatic music Mohammadi, 2017, An overview of voice conversion systems, Speech Commun., 88, 65, 10.1016/j.specom.2017.01.008 Molchanov, 2017, Pruning convolutional neural networks for resource efficient inference Mor, 2018, A universal music translation network Morise, 2016, WORLD: a vocoder-based high-quality speech synthesis system for real-time applications, IEICE Trans. Inf. Syst., 99, 1877, 10.1587/transinf.2015EDP7457 Niwa, 2018, Statistical voice conversion based on wavenet, 5289 Paine, 2016 Paul, 1992, The design for the wall street journal-based CSR corpus, 357 Povey, 2018, Semi-orthogonal low-rank matrix factorization for deep neural networks, 3743 Prabhavalkar, 2016, On the compression of recurrent neural networks with an application to LVCSR acoustic modeling for embedded speech recognition, 5970 Prenger, 2019, Waveglow: A flow-based generative network for speech synthesis, 3617 Sainath, 2013, Low-rank matrix factorization for deep neural network training with high-dimensional output targets, 6655 Shen, 2018, Natural tts synthesis by conditioning wavenet on mel spectrogram predictions, 4779 Sisman, 2018, A voice conversion framework with tandem feature sparse representation and speaker-adapted wavenet vocoder, 1978 Sun, 2016, Phonetic posteriorgrams for many-to-one voice conversion without parallel data training, 1 Tamamori, 2017, Speaker-dependent wavenet vocoder, 1118 Tian, 2019, A speaker-dependent wavenet for voice conversion with non-parallel data, 201 Tian, 2018, Average modeling approach to voice conversion with non-parallel data, 227 Tobing, P.L., Wu, Y.-C., Toda, T., 2020. Baseline system of voice conversion challenge 2020 with cyclic variational autoencoder and parallel WaveGAN. In: Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020. pp. 155–159. Toda, 2007, Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory, IEEE Trans. Audio Speech Lang. Process., 15, 2222, 10.1109/TASL.2007.907344 Toda, 2006, Eigenvoice conversion based on Gaussian mixture model, 2446 Tucker, 2016, Model compression applied to small-footprint keyword spotting, 1878 Valin, 2019, LPCNet: Improving neural speech synthesis through linear prediction, 5891 van den Oord, 2016, WaveNet: A generative model for raw audio, 125 Veaux, 2017 Wu, 2015, A study of speaker adaptation for DNN-based speech synthesis Wu, 2016, On the use of i-vectors and average voice model for voice conversion without parallel data, 1 Xue, 2013, Restructuring of deep neural network acoustic models with singular value decomposition, 2365 Xue, 2014, Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network, 6359 Yamagishi, 2007, Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training, IEICE Trans. Inf. Syst., 90, 533, 10.1093/ietisy/e90-d.2.533 Yamagishi, 2007, Model adaptation approach to speech synthesis with diverse voices and styles, 4, IV Yamamoto, 2020, Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram, 6199 Yi, Z., Huang, W.-C., Tian, X., Yamagishi, J., Das, R.K., Kinnunen, T., Ling, Z., Toda, T., 2020. Voice Conversion Challenge 2020—Intra-lingual semi-parallel and cross-lingual voice conversion. In: Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020. pp. 80–98. Yu, 2016, Multi-scale context aggregation by dilated convolutions Yu, 2011, Improved bottleneck features using pretrained deep neural networks