Processing of speech signals for robust recognition in practical environments

Springer Science and Business Media LLC - Tập 5 - Trang 167-178 - 2017
Vishala Pannala1
1Speech and Vision Lab, LTRC, International Institute of Information Technology (IIIT), Hyderabad, India

Tóm tắt

In automatic speech recognition systems, the information in the speech signal is traditionally retrieved in the form of feature vectors representing sub-word units and thereby converting the features into human readable text form. However, these systems perform poorly due to degradations of speech under varying environmental conditions. To improve the performance, the main issues to be considered are: (a) Determination of speech regions in the speech data collected in degraded environments, and (b) Recognition of speech sounds from the degraded speech in the detected speech regions. Although there exist wide variety of techniques which address these issues, most of them are applicable for clean speech synthetically degraded by stationary noise conditions, due to the need for large amount of training data for statistical modeling. The present work focuses on methods of processing the signals so as to determine the desired speech regions in degraded conditions. For this, signal processing methods are being explored to extract speech-specific characteristics independent of the characteristics of degradations.

Tài liệu tham khảo

Digital Cellular Telecommunications System (Phase 2+); Voice Activity Detector (VAD) for Adaptive Multi Rate (AMR) Speech Traffic Channel; General Description. 1999 de Cheveigne A, Kawahara H (2002) YIN, a fundamental frequency estimator for speech and music. J Acoust Soc Am 111(4):1917–1930 Aneeja G, Yegnanarayana B (2015) Single frequency filtering approach for discriminating speech and nonspeech. IEEE/ACM Trans Audio Speech Lang Process 23(4):705–717 Boersma P (2001) Praat, a system for doing phonetics by computer. Glot Int 5(9):341–345 Camacho A, Harris J (2008) A sawtooth waveform inspired pitch estimator for speech and music. J Acoust Soc Am 124:1638–1652 Chen SH, Wang JF (2002) A wavelet-based voice activity detection algorithm in noisy environments. In 9th International Conference on Electronics, Circuits and Systems, 3:995–998 Cho YD, Kondoz A (2001) Analysis and improvement of a statistical model-based voice activity detector. IEEE Signal Process Lett 8(10):276–278 Chu W, Alwan A (2012) SAFE: a statistical approach to F0 estimation under clean and noisy conditions. IEEE Trans Audio Speech Lang Process 20(3):933–944 Craciun A, Gabrea M (2004) Correlation coefficient-based voice activity detector algorithm. Can Conf Electr Comput Eng 3:1789–1792 de Cheveigne A (1991) Speech F0 extraction based on Lickliders pitch perception model. ICPhS, pp. 218–221 Dhananjaya N, Yegnanarayana B (2010) Voiced/nonvoiced detection based on robustness of voiced epochs. IEEE Signal Process Lett 17(3):273–276 Drugman T, Alwan A (2011) Joint robust voicing detection and pitch estimation based on residual harmonics. In: Proceedings of the Interspeech, pp 1973–1976 Evangelopoulos G, Maragos P (2005) Speech event detection using multi band modulation energy. In INTERSPEECH, pp. 685–688 Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL (1993) DARPA TIMIT acoustic phonetic continuous speech corpus CD-ROM. NIST, Gaithersburg Mantena GV, Rajendran S, Gangashetty SV, Yegnanarayana B, Prahallad KS (2011) Development of a spoken dialogue system for accessing agricultural information in Telugu. In: Proceedings of the 9th international conference on natural language processing Ghosh PK, Tsiartas A, Narayanan SS (2011) Robust voice activity detection using long-term signal variability. IEEE Trans Audio Speech Lang Process 19(3):600–613 Gonzalez S, Brookes M (2014) PEFAC-a pitch estimation algorithm robust to high levels of noise. IEEE/ACM Trans Audio Speech Lang Process 22(2):518–530 Gorriz JM, Ramirez J, Lang EW, Puntonet CG, Turias I (2010) Improved likelihood ratio test based voice activity detector applied to speech recognition. Speech Commun 52(78):664–677 Haigh JA, Mason JS (1993) A voice activity detector based on cepstral analysis. In EUROSPEECH, pp. 1103–1106 Hughes T, Mierle K (2013) Recurrent neural networks for voice activity detection. In ICASSP, pp. 7378–7382 Kasi K, Zahorian S (2002) Yet another algorithm for pitch tracking. ICASSP 1:361–364 Kotnik B, Kacic Z, Horvat B (2001) A multiconditional robust front-end feature extraction with a noise reduction procedure based on improved spectral subtraction algorithm. In INTERSPEECH, pp. 197–200 Lee Y-C, Ahn S-S (2006) Statistical model-based VAD algorithm with wavelet transform. IEICE Trans Fundam Electron Commun Comput Sci E89–A(6):1594–1600 Ma Y, Nishihara A (2013) Efficient voice activity detection algorithm using long-term spectral flatness measure. EURASIP J Audio Speech Music Process 1–18:2013 Markel JD (1972) The SIFT algorithm for fundamental frequency estimation. IEEE Trans Audio Electroacoust 20:367–377 McLoughlin IV (2014) Super-audible voice activity detection. IEEE/ACM Trans Audio Speech Lang Process 22(9):1424–1433 Murthy HA, Yegnanarayana B (2011) Group delay functions and its applications in speech technology. Sadhana 36(5):745–782 Nagarajan T, Prasad VK, Murthy H et al (2003) Minimum phase signal derived from root cepstrum. Electron Lett 39(12):941–942 Nakatani T, Irino T (2004) Robust and accurate fundamental frequency estimation based on dominant harmonic components. J Acoust Soc Am 116(6):3690–3700 Ng T, Zhang B, Nguyen L, Matsoukas S, Zhou Xinhui, Mesgarani Nima, Veselý Karel, Matějka Pavel (2012) Developing a speech activity detection system for the DARPA RATS program. INTERSPEECH 9:1–4 Noll AM (1967) Cepstrum pitch determination. J Acoust Soc Am 41:293–309 Plante F, Meyer GF, Aubsworth WA (1995) A pitch extraction reference database. In Proc Euro Conf on speech commun (Eurospeech), Madrid, Spain, pp. 827–840 Rabiner LR, Cheng MJ, Rosenberg AE, McGonegal CA (1976) A comparative performance study of several pitch detection algorithms. IEEEASSP 24:399–418 Ramirez J, Segura JC, Benitez C, De La Torre A, Rubio A (2004) Efficient voice activity detection algorithms using long-term speech information. Speech commun 42(3):271–287 Sadjadi SO, Hansen JHL (2013) Unsupervised speech activity detection using voicing measures and perceptual spectral flux. IEEE Signal Process Lett 20(3):197–200 Sarikaya R, Hansen JHL (1998) Robust speech activity detection in the presence of noise. In International Conference on Spoken Language Processing Shimamura T, Kobayashi H (2001) Weighted autocorrelation for pitch extraction of noisy speech. IEEESAP 9(7):727–730 Siemund R, Höge H, Kunzmann S, Marasek K (2000) SPEECON-speech data for consumer devices. In: Proceedings of the LREC2000, pp 883–886 Sohn J, Kim NS (1999) A statistical model-based voice activity detection. IEEE Signal Process Lett 6(1):1–3 Sun X (2002) Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio. In ICASSP, pp. 333–336. IEEE Talkin D (1995) A Robust algorithm for pitch tracking (RAPT). In: Kleijn WB, Paliwal KK (eds) Speech Coding and Synthesis, Elsevier, pp 497–518 Tan LN, Alwan A (2013) Multi-band summary correlogram-based pitch detection for noisy speech. Speech Commun 55(7–8):841–856 Varga A, Steeneken HJ (1993) Assessment for automatic speech recognition II: Noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251 Pannala V, Aneeja G, Kadiri SR, Yegnanarayana B (2016) Robust estimation of fundamental frequency using single frequency filtering approach. In INTERSPEECH, pp. 2155–2159 Yang N, Ba H, Cai W, Demirkol I, Heinzelman W (2014) BaNa: a noise resilient fundamental frequency detection algorithm for speech and music. IEEE/ACM Trans Audio Speech Lang Process 22(12):1833–1848 Yegnanarayana B, Murty KSR (2009) Event-based instantaneous fundamental frequency estimation from speech signals. IEEE Trans Audio Speech Lang Process 17(4):614–624 Yegnanarayana B, Murthy HA (1992) Significance of group delay functions in spectrum estimation. IEEE Trans Signal Process 40(9):2281–2289 Zhang X-L, Wu J (2013) Denoising deep neural networks based voice activity detection. In: Proceedings of the 38th IEEE international conference on acoustic, speech, and signal processing, Vancouver, Canada, May 2013, pp 853–857