Training augmentation with TANDEM acoustic modelling in Punjabi adult speech recognition system

International Journal of Speech Technology - Tập 24 - Trang 473-481 - 2021
Virender Kadyan1, Shashi Bala2, Puneet Bawa2
1Department of Informatics, School of Computer Science, University of Petroleum & Energy Studies (UPES), Dehradun, India
2Centre of Excellence for Speech and Multimodal Laboratory, Chitkara University Institute of Engineering and Technology, Chitkara University, Punjab, India

Tóm tắt

Processing of low resource pre and post acoustic signals always faced the challenge of data scarcity in its training module. It’s difficult to obtain high system accuracy with limited corpora in train set which results into extraction of large discriminative feature vector. These vectors information are distorted due to acoustic mismatch occurs because of real environment and inter speaker variations. In this paper, context independent information of an input speech signal is pre-processed using bottleneck features and later in modeling phase Tandem-NN model has been employ to enhance system accuracy. Later to fulfill the requirement of train data issues, in-domain training augmentation is perform using fusion of original clean and artificially created modified train noisy data and to further boost this training data, tempo modification of input speech signal is perform with maintenance of its spectral envelope and pitch in corresponding input audio signal. Experimental result shows that a relative improvement of 13.53% is achieved in clean and 32.43% in noisy conditions with Tandem-NN system in comparison to that of baseline system respectively.

Tài liệu tham khảo

Bahari, M. H., Saeidi, R., & Van Leeuwen, D. (2013). Accent recognition using i-vector, gaussian mean supervector and gaussian posterior probability supervector for spontaneous telephone speech. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 7344–7348). IEEE. https://doi.org/10.1109/ICASSP.2013.6639089

Bell, P., Swietojanski, P., & Renals, S. (2013). Multi-level adaptive networks in tandem and hybrid ASR systems. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 6975–6979). IEEE. https://doi.org/10.1109/ICASSP.2013.6639014

Boll, S. (1979). Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(2), 113–120. https://doi.org/10.1109/TASSP.1979.1163209.

Boll, S., & Pulsipher, D. C. (1980). Suppression of acoustic noise in speech using two microphone adaptive noise cancellation. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(6), 752–753. https://doi.org/10.1109/TASSP.1980.1163472.

Boril, H., & Hansen, J. H. (2009). Unsupervised equalization of Lombard effect for speech recognition in noisy adverse environments. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1379–1393. https://doi.org/10.1109/TASL.2009.2034770.

Cichocki, A., Unbehauen, R., & Swiniarski, R. W. (1993). Neural networks for optimization and signal processing (Vol. 253). New York: Wiley.

Ellis, D. P., Singh, R., & Sivadas, S. (2001). Tandem acoustic modeling in large-vocabulary recognition. In 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221) (Vol. 1, pp. 517–520). IEEE. https://doi.org/10.1109/ICASSP.2001.940881

Hansen, J. H. (1994). Morphological constrained feature enhancement with adaptive cepstral compensation (MCE-ACC) for speech recognition in noise and Lombard effect. IEEE Transactions on Speech and Audio Processing, 2(4), 598–614. https://doi.org/10.1109/89.326618.

Hansen, J. H., & Cairns, D. A. (1995). Icarus: Source generator based real-time recognition of speech in noisy stressful and lombard effect environments. Speech Communication, 16(4), 391–422. https://doi.org/10.1016/0167-6393(95)00007-B.

Hermansky, H., Ellis, D. P., & Sharma, S. (2000). Tandem connectionist feature extraction for conventional HMM systems. In 2000 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 00CH37100) (Vol. 3, pp. 1635–1638). IEEE. https://doi.org/10.1109/ICASSP.2000.862024

Hirsch, H. G., & Ehrlicher, C. (1995). Noise estimation techniques for robust speech recognition. In 1995 International conference on acoustics, speech, and signal processing (Vol. 1, pp. 153–156). IEEE. https://doi.org/10.1109/ICASSP.1995.479387

Hsu, W. N., Zhang, Y., Weiss, R. J., Chung, Y. A., Wang, Y., Wu, Y., & Glass, J. (2019). Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization. In ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5901–5905). IEEE. https://doi.org/10.1109/ICASSP.2019.8683561

Huang, J., & Kingsbury, B. (2013). Audio-visual deep learning for noise robust speech recognition. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 7596–7599). IEEE. https://doi.org/10.1109/ICASSP.2013.6639140

Hush, D. R., & Horne, B. G. (1993). Progress in supervised neural networks. IEEE Signal Processing Magazine, 10(1), 8–39. https://doi.org/10.1109/79.180705.

Kadyan, V., Mantri, A., Aggarwal, R. K., & Singh, A. (2019). A comparative study of deep neural network based Punjabi-ASR system. International Journal of Speech Technology, 22(1), 111–119. https://doi.org/10.1007/s10772-018-09577-3.

Lal, P., & King, S. (2013). Cross-lingual automatic speech recognition using tandem features. IEEE Transactions on Audio, Speech, and Language Processing, 21(12), 2506–2515. https://doi.org/10.1109/TASL.2013.2277932.

Kinnunen, T., Juvela, L., Alku, P., & Yamagishi, J. (2017). Non-parallel voice conversion using i-vector PLDA: Towards unifying speaker verification and transformation. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5535–5539). IEEE. https://doi.org/10.1109/ICASSP.2017.7953215

Lippmann, R., Martin, E., & Paul, D. (1987). Multi-style training for robust isolated-word speech recognition. In ICASSP'87. IEEE international conference on acoustics, speech, and signal processing (Vol. 12, pp. 705–708). IEEE. https://doi.org/10.1109/ICASSP.1987.1169544

Lyon, R. (1984). Computational models of neural auditory processing. In ICASSP'84. IEEE international conference on acoustics, speech, and signal processing (Vol. 9, pp. 41–44). IEEE. https://doi.org/10.1109/ICASSP.1984.1172756

Naik, J. M., & Lubensky, D. M. (1994). A hybrid HMM-MLP speaker verification algorithm for telephone speech. In Proceedings of ICASSP'94. IEEE international conference on acoustics, speech and signal processing (Vol. 1, pp. I–153). IEEE. https://doi.org/10.1109/ICASSP.1994.389332

Povey, D., Burget, L., Agarwal, M., Akyazi, P., Feng, K., Ghoshal, A., ... & Rose, R. C. (2010). Subspace Gaussian mixture models for speech recognition. In 2010 IEEE international conference on acoustics, speech and signal processing (pp. 4330–4333). IEEE. https://doi.org/10.1109/ICASSP.2010.5495662

Ravanelli, M., & Janin, A. (2014). TANDEM-bottleneck feature combination using hierarchical Deep Neural Networks. In The 9th international symposium on chinese spoken language processing (pp. 113–117). IEEE. https://doi.org/10.1109/ISCSLP.2014.6936576

Saon, G., Tüske, Z., Audhkhasi, K., & Kingsbury, B. (2019). Sequence noise injected training for end-to-end speech recognition. In ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6261–6265). IEEE. https://doi.org/10.1109/ICASSP.2019.8683706

Serdyuk, D., Audhkhasi, K., Brakel, P., Ramabhadran, B., Thomas, S., & Bengio, Y. (2016). Invariant representations for noisy speech recognition. arXiv preprint. arXiv:1612.01928

Tebelskis, J., & Waibel, A. (1990). Large vocabulary recognition using linked predictive neural networks. In International conference on acoustics, speech, and signal processing (pp. 437–440). IEEE. https://doi.org/10.1109/ICASSP.1990.115742

Tomar, V. S., & Rose, R. C. (2013). A family of discriminative manifold learning algorithms and their application to speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(1), 161–171. https://doi.org/10.1109/TASLP.2013.2286906.

Zeng, Y. M., Wu, Z. Y., Falk, T., & Chan, W. Y. (2006). Robust GMM based gender classification using pitch and RASTA-PLP parameters of speech. In 2006 International conference on machine learning and cybernetics (pp. 3376–3379). IEEE. https://doi.org/10.1109/ICMLC.2006.258497