Model architectures to extrapolate emotional expressions in DNN-based text-to-speech
Tài liệu tham khảo
An, Shumin, Ling, Zhenhua, Dai, Lirong, 2017. Emotional statistical parametric speech synthesis using LSTM-RNNs. In: Proceedings of APSIPA ASC. pp. 1613–1616.
Caruana, 1997, Multitask learning, Mach. Learn., 28, 41, 10.1023/A:1007379606734
Dehak, 2010, Front-end factor analysis for speaker verification, IEEE Trans. Audio Speech Language Process., 19, 788, 10.1109/TASL.2010.2064307
Fan, Yuchen, Qian, Yao, Soong, Frank K., He, Lei, 2015. Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis. In: Proceedings of ICASSP, pp. 4475–4479.
Fan, Yuchen, Qian, Yao, Xie, Feng-Long, Soong, Frank K., 2014. TTS synthesis with bidirectional LSTM based recurrent neural networks. In: Proceedings of interspeech. pp. 1964–1968.
Hojo, 2018, DNN-based speech synthesis using speaker codes, IEICE Trans. Inform. Syst., 101, 462, 10.1587/transinf.2017EDP7165
Inoue, 2017, An investigation to transplant emotional expressions in DNN-based TTS synthesis, 1253
Jaime, 2018, Investigating different representations for modeling and controlling multiple emotions in DNN-based speech synthesis, Speech Commun., 99, 135, 10.1016/j.specom.2018.03.002
Jaime, Lorenzo-Trueba, Roberto, Barra-Chicote, Watts, Oliver, Montero, Juan Manuel, 2013. Towards speaking style transplantation in speech synthesis. In: 8th ISCA Speech Synthesis Workshop. pp. 159–163.
Kanagawa, Hiroki, Nose, Takashi, Kobayashi, Takao, 2013. Speaker-independent style conversion for HMM-based expressive speech synthesis. In: Proceedings of ICASSP. pp. 7864–7868.
Kawahara, 1999, Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of repetitive structure in sounds, Speech Commun., 27, 187, 10.1016/S0167-6393(98)00085-5
Li, Bo, Zen, Heiga, 2016. Multi-language multi-speaker acoustic modeling for LSTM-RNN based statistical parametric speech synthesis. In: Proceedings of INTERSPEECH. pp. 2468–2472.
Luong, Hieu-Thi, Takaki, Shinji, Henter, Gustav Eje, Yamagishi, Junichi, 2017. Adapting and controlling DNN-based speech synthesis using input codes. In: Proceedings of ICASSP. pp. 4905–4909.
Ohtani, Yamato, Nasu, Yu, Morita, Masahiro, Akamine, Masami, 2015. Emotional transplant in statistical speech synthesis based on emotion additive model. In: Proceedings of INTERSPEECH. pp. 274–278.
Qian, Yao, Fan, Yuchen, Hu, Wenping, Soong, Frank K., 2014. On the training aspects of deep neural network (DNN) for parametric TTS synthesis. In: Proceedings of ICASSP, pp. 3829–3833.
Silén, Hanna, Helander, Elina, Nurminen, Jani, Gabbouj, MoncefSilén, 2012. Ways to implement global variance in statistical speech synthesis. In: Proceedings of INTERSPEECH. pp. 1436–1439.
Snyder, David, Garcia-Romero, Daniel, Sell, Gregory, Povey, Daniel, Khudanpur, Sanjeev, 2018. X-vectors: Robust DNN embeddings for speaker recognition. In: Proceedings of ICASSP. pp. 5329–5333.
Toda, 2007, A speech parameter generation algorithm considering global variance for HMM-based speech synthesis, IEICE Trans. Inform. Syst., 90, 816, 10.1093/ietisy/e90-d.5.816
Tokuda, Keiichi, Yoshimura, Takayoshi, Masuko, Takashi, Kobayashi, Takao, Kitamura, Tadashi, 2000. Speech parameter generation algorithms for HMM-based speech synthesis. In: Proceedings of ICASSP. pp. 1315–1318.
Variani, Ehsan, Lei, Xin, McDermott, Erik, Moreno, Ignacio Lopez, Gonzalez-Dominguez, Javier, 2014. Deep neural networks for small footprint text-dependent speaker verification. In: Proceedings of ICASSP. pp. 4052–4056.
Watts, Oliver, Henter, Gustav Eje, Merritt, Thomas, Wu, Zhizheng, King, Simon, 2016. From HMMs to DNNs: where do the improvements come from? In: Proceedings of ICASSP. pp. 5505–5509.
Wu, Zhizheng, Swietojanski, Pawel, Veaux, Christophe, Renals, Steve, King, Simon, 2015. A study of speaker adaptation for DNN-based speech synthesis. In: Proceedings of INTERSPEECH, pp. 879–883.
Wu, Zhizheng, Valentini-Botinhao, Cassia, Watts, Oliver, King, Simon, 2015. Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In: Proceedings of ICASSP, pp. 4460–4464.
Yamagishi, 2009, Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm, IEEE Trans. Audio Speech Language Process., 17, 66, 10.1109/TASL.2008.2006647
Yamagishi, 2003, A training method of average voice model for HMM-based speech synthesis, IEICE Trans. Fundam. Electron. Commun. Comput. Sci., 86, 1956
Yang, Hongwu, Zhang, Weizhao, Zhi, Pengpeng, 2018. A DNN-based emotional speech synthesis by speaker adaptation. In: Proceedings of APSIPA ASC. pp. 633–637.
Young, 2006, 3, 75
Zen, Heiga, Senior, Andrew, Schuster, Mike, 2013. Statistical parametric speech synthesis using deep neural networks. In: Proceedings of ICASSP. pp. 7962–7966.