Model architectures to extrapolate emotional expressions in DNN-based text-to-speech

Speech Communication - Tập 126 - Trang 35-43 - 2021
Katsuki Inoue1, Sunao Hara1, Masanobu Abe1, Nobukatsu Hojo2, Yusuke Ijima2
1Graduate school of Interdisciplinary Science and Engineering in Health Systems, Okayama University, Japan
2NTT Corporation, Japan

Tài liệu tham khảo

An, Shumin, Ling, Zhenhua, Dai, Lirong, 2017. Emotional statistical parametric speech synthesis using LSTM-RNNs. In: Proceedings of APSIPA ASC. pp. 1613–1616. Caruana, 1997, Multitask learning, Mach. Learn., 28, 41, 10.1023/A:1007379606734 Dehak, 2010, Front-end factor analysis for speaker verification, IEEE Trans. Audio Speech Language Process., 19, 788, 10.1109/TASL.2010.2064307 Fan, Yuchen, Qian, Yao, Soong, Frank K., He, Lei, 2015. Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis. In: Proceedings of ICASSP, pp. 4475–4479. Fan, Yuchen, Qian, Yao, Xie, Feng-Long, Soong, Frank K., 2014. TTS synthesis with bidirectional LSTM based recurrent neural networks. In: Proceedings of interspeech. pp. 1964–1968. Hojo, 2018, DNN-based speech synthesis using speaker codes, IEICE Trans. Inform. Syst., 101, 462, 10.1587/transinf.2017EDP7165 Inoue, 2017, An investigation to transplant emotional expressions in DNN-based TTS synthesis, 1253 Jaime, 2018, Investigating different representations for modeling and controlling multiple emotions in DNN-based speech synthesis, Speech Commun., 99, 135, 10.1016/j.specom.2018.03.002 Jaime, Lorenzo-Trueba, Roberto, Barra-Chicote, Watts, Oliver, Montero, Juan Manuel, 2013. Towards speaking style transplantation in speech synthesis. In: 8th ISCA Speech Synthesis Workshop. pp. 159–163. Kanagawa, Hiroki, Nose, Takashi, Kobayashi, Takao, 2013. Speaker-independent style conversion for HMM-based expressive speech synthesis. In: Proceedings of ICASSP. pp. 7864–7868. Kawahara, 1999, Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of repetitive structure in sounds, Speech Commun., 27, 187, 10.1016/S0167-6393(98)00085-5 Li, Bo, Zen, Heiga, 2016. Multi-language multi-speaker acoustic modeling for LSTM-RNN based statistical parametric speech synthesis. In: Proceedings of INTERSPEECH. pp. 2468–2472. Luong, Hieu-Thi, Takaki, Shinji, Henter, Gustav Eje, Yamagishi, Junichi, 2017. Adapting and controlling DNN-based speech synthesis using input codes. In: Proceedings of ICASSP. pp. 4905–4909. Ohtani, Yamato, Nasu, Yu, Morita, Masahiro, Akamine, Masami, 2015. Emotional transplant in statistical speech synthesis based on emotion additive model. In: Proceedings of INTERSPEECH. pp. 274–278. Qian, Yao, Fan, Yuchen, Hu, Wenping, Soong, Frank K., 2014. On the training aspects of deep neural network (DNN) for parametric TTS synthesis. In: Proceedings of ICASSP, pp. 3829–3833. Silén, Hanna, Helander, Elina, Nurminen, Jani, Gabbouj, MoncefSilén, 2012. Ways to implement global variance in statistical speech synthesis. In: Proceedings of INTERSPEECH. pp. 1436–1439. Snyder, David, Garcia-Romero, Daniel, Sell, Gregory, Povey, Daniel, Khudanpur, Sanjeev, 2018. X-vectors: Robust DNN embeddings for speaker recognition. In: Proceedings of ICASSP. pp. 5329–5333. Toda, 2007, A speech parameter generation algorithm considering global variance for HMM-based speech synthesis, IEICE Trans. Inform. Syst., 90, 816, 10.1093/ietisy/e90-d.5.816 Tokuda, Keiichi, Yoshimura, Takayoshi, Masuko, Takashi, Kobayashi, Takao, Kitamura, Tadashi, 2000. Speech parameter generation algorithms for HMM-based speech synthesis. In: Proceedings of ICASSP. pp. 1315–1318. Variani, Ehsan, Lei, Xin, McDermott, Erik, Moreno, Ignacio Lopez, Gonzalez-Dominguez, Javier, 2014. Deep neural networks for small footprint text-dependent speaker verification. In: Proceedings of ICASSP. pp. 4052–4056. Watts, Oliver, Henter, Gustav Eje, Merritt, Thomas, Wu, Zhizheng, King, Simon, 2016. From HMMs to DNNs: where do the improvements come from? In: Proceedings of ICASSP. pp. 5505–5509. Wu, Zhizheng, Swietojanski, Pawel, Veaux, Christophe, Renals, Steve, King, Simon, 2015. A study of speaker adaptation for DNN-based speech synthesis. In: Proceedings of INTERSPEECH, pp. 879–883. Wu, Zhizheng, Valentini-Botinhao, Cassia, Watts, Oliver, King, Simon, 2015. Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In: Proceedings of ICASSP, pp. 4460–4464. Yamagishi, 2009, Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm, IEEE Trans. Audio Speech Language Process., 17, 66, 10.1109/TASL.2008.2006647 Yamagishi, 2003, A training method of average voice model for HMM-based speech synthesis, IEICE Trans. Fundam. Electron. Commun. Comput. Sci., 86, 1956 Yang, Hongwu, Zhang, Weizhao, Zhi, Pengpeng, 2018. A DNN-based emotional speech synthesis by speaker adaptation. In: Proceedings of APSIPA ASC. pp. 633–637. Young, 2006, 3, 75 Zen, Heiga, Senior, Andrew, Schuster, Mike, 2013. Statistical parametric speech synthesis using deep neural networks. In: Proceedings of ICASSP. pp. 7962–7966.