Training augmentation with TANDEM acoustic modelling in Punjabi adult speech recognition system
Tóm tắt
Tài liệu tham khảo
Bahari, M. H., Saeidi, R., & Van Leeuwen, D. (2013). Accent recognition using i-vector, gaussian mean supervector and gaussian posterior probability supervector for spontaneous telephone speech. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 7344–7348). IEEE. https://doi.org/10.1109/ICASSP.2013.6639089
Bell, P., Swietojanski, P., & Renals, S. (2013). Multi-level adaptive networks in tandem and hybrid ASR systems. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 6975–6979). IEEE. https://doi.org/10.1109/ICASSP.2013.6639014
Boll, S. (1979). Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(2), 113–120. https://doi.org/10.1109/TASSP.1979.1163209.
Boll, S., & Pulsipher, D. C. (1980). Suppression of acoustic noise in speech using two microphone adaptive noise cancellation. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(6), 752–753. https://doi.org/10.1109/TASSP.1980.1163472.
Boril, H., & Hansen, J. H. (2009). Unsupervised equalization of Lombard effect for speech recognition in noisy adverse environments. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1379–1393. https://doi.org/10.1109/TASL.2009.2034770.
Cichocki, A., Unbehauen, R., & Swiniarski, R. W. (1993). Neural networks for optimization and signal processing (Vol. 253). New York: Wiley.
Ellis, D. P., Singh, R., & Sivadas, S. (2001). Tandem acoustic modeling in large-vocabulary recognition. In 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221) (Vol. 1, pp. 517–520). IEEE. https://doi.org/10.1109/ICASSP.2001.940881
Hansen, J. H. (1994). Morphological constrained feature enhancement with adaptive cepstral compensation (MCE-ACC) for speech recognition in noise and Lombard effect. IEEE Transactions on Speech and Audio Processing, 2(4), 598–614. https://doi.org/10.1109/89.326618.
Hansen, J. H., & Cairns, D. A. (1995). Icarus: Source generator based real-time recognition of speech in noisy stressful and lombard effect environments. Speech Communication, 16(4), 391–422. https://doi.org/10.1016/0167-6393(95)00007-B.
Hermansky, H., Ellis, D. P., & Sharma, S. (2000). Tandem connectionist feature extraction for conventional HMM systems. In 2000 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 00CH37100) (Vol. 3, pp. 1635–1638). IEEE. https://doi.org/10.1109/ICASSP.2000.862024
Hirsch, H. G., & Ehrlicher, C. (1995). Noise estimation techniques for robust speech recognition. In 1995 International conference on acoustics, speech, and signal processing (Vol. 1, pp. 153–156). IEEE. https://doi.org/10.1109/ICASSP.1995.479387
Hsu, W. N., Zhang, Y., Weiss, R. J., Chung, Y. A., Wang, Y., Wu, Y., & Glass, J. (2019). Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization. In ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5901–5905). IEEE. https://doi.org/10.1109/ICASSP.2019.8683561
Huang, J., & Kingsbury, B. (2013). Audio-visual deep learning for noise robust speech recognition. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 7596–7599). IEEE. https://doi.org/10.1109/ICASSP.2013.6639140
Hush, D. R., & Horne, B. G. (1993). Progress in supervised neural networks. IEEE Signal Processing Magazine, 10(1), 8–39. https://doi.org/10.1109/79.180705.
Kadyan, V., Mantri, A., Aggarwal, R. K., & Singh, A. (2019). A comparative study of deep neural network based Punjabi-ASR system. International Journal of Speech Technology, 22(1), 111–119. https://doi.org/10.1007/s10772-018-09577-3.
Lal, P., & King, S. (2013). Cross-lingual automatic speech recognition using tandem features. IEEE Transactions on Audio, Speech, and Language Processing, 21(12), 2506–2515. https://doi.org/10.1109/TASL.2013.2277932.
Kinnunen, T., Juvela, L., Alku, P., & Yamagishi, J. (2017). Non-parallel voice conversion using i-vector PLDA: Towards unifying speaker verification and transformation. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5535–5539). IEEE. https://doi.org/10.1109/ICASSP.2017.7953215
Lippmann, R., Martin, E., & Paul, D. (1987). Multi-style training for robust isolated-word speech recognition. In ICASSP'87. IEEE international conference on acoustics, speech, and signal processing (Vol. 12, pp. 705–708). IEEE. https://doi.org/10.1109/ICASSP.1987.1169544
Lyon, R. (1984). Computational models of neural auditory processing. In ICASSP'84. IEEE international conference on acoustics, speech, and signal processing (Vol. 9, pp. 41–44). IEEE. https://doi.org/10.1109/ICASSP.1984.1172756
Naik, J. M., & Lubensky, D. M. (1994). A hybrid HMM-MLP speaker verification algorithm for telephone speech. In Proceedings of ICASSP'94. IEEE international conference on acoustics, speech and signal processing (Vol. 1, pp. I–153). IEEE. https://doi.org/10.1109/ICASSP.1994.389332
Povey, D., Burget, L., Agarwal, M., Akyazi, P., Feng, K., Ghoshal, A., ... & Rose, R. C. (2010). Subspace Gaussian mixture models for speech recognition. In 2010 IEEE international conference on acoustics, speech and signal processing (pp. 4330–4333). IEEE. https://doi.org/10.1109/ICASSP.2010.5495662
Ravanelli, M., & Janin, A. (2014). TANDEM-bottleneck feature combination using hierarchical Deep Neural Networks. In The 9th international symposium on chinese spoken language processing (pp. 113–117). IEEE. https://doi.org/10.1109/ISCSLP.2014.6936576
Saon, G., Tüske, Z., Audhkhasi, K., & Kingsbury, B. (2019). Sequence noise injected training for end-to-end speech recognition. In ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6261–6265). IEEE. https://doi.org/10.1109/ICASSP.2019.8683706
Serdyuk, D., Audhkhasi, K., Brakel, P., Ramabhadran, B., Thomas, S., & Bengio, Y. (2016). Invariant representations for noisy speech recognition. arXiv preprint. arXiv:1612.01928
Tebelskis, J., & Waibel, A. (1990). Large vocabulary recognition using linked predictive neural networks. In International conference on acoustics, speech, and signal processing (pp. 437–440). IEEE. https://doi.org/10.1109/ICASSP.1990.115742
Tomar, V. S., & Rose, R. C. (2013). A family of discriminative manifold learning algorithms and their application to speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(1), 161–171. https://doi.org/10.1109/TASLP.2013.2286906.