Processing of speech signals for robust recognition in practical environments
Tóm tắt
In automatic speech recognition systems, the information in the speech signal is traditionally retrieved in the form of feature vectors representing sub-word units and thereby converting the features into human readable text form. However, these systems perform poorly due to degradations of speech under varying environmental conditions. To improve the performance, the main issues to be considered are: (a) Determination of speech regions in the speech data collected in degraded environments, and (b) Recognition of speech sounds from the degraded speech in the detected speech regions. Although there exist wide variety of techniques which address these issues, most of them are applicable for clean speech synthetically degraded by stationary noise conditions, due to the need for large amount of training data for statistical modeling. The present work focuses on methods of processing the signals so as to determine the desired speech regions in degraded conditions. For this, signal processing methods are being explored to extract speech-specific characteristics independent of the characteristics of degradations.
Tài liệu tham khảo
Digital Cellular Telecommunications System (Phase 2+); Voice Activity Detector (VAD) for Adaptive Multi Rate (AMR) Speech Traffic Channel; General Description. 1999
de Cheveigne A, Kawahara H (2002) YIN, a fundamental frequency estimator for speech and music. J Acoust Soc Am 111(4):1917–1930
Aneeja G, Yegnanarayana B (2015) Single frequency filtering approach for discriminating speech and nonspeech. IEEE/ACM Trans Audio Speech Lang Process 23(4):705–717
Boersma P (2001) Praat, a system for doing phonetics by computer. Glot Int 5(9):341–345
Camacho A, Harris J (2008) A sawtooth waveform inspired pitch estimator for speech and music. J Acoust Soc Am 124:1638–1652
Chen SH, Wang JF (2002) A wavelet-based voice activity detection algorithm in noisy environments. In 9th International Conference on Electronics, Circuits and Systems, 3:995–998
Cho YD, Kondoz A (2001) Analysis and improvement of a statistical model-based voice activity detector. IEEE Signal Process Lett 8(10):276–278
Chu W, Alwan A (2012) SAFE: a statistical approach to F0 estimation under clean and noisy conditions. IEEE Trans Audio Speech Lang Process 20(3):933–944
Craciun A, Gabrea M (2004) Correlation coefficient-based voice activity detector algorithm. Can Conf Electr Comput Eng 3:1789–1792
de Cheveigne A (1991) Speech F0 extraction based on Lickliders pitch perception model. ICPhS, pp. 218–221
Dhananjaya N, Yegnanarayana B (2010) Voiced/nonvoiced detection based on robustness of voiced epochs. IEEE Signal Process Lett 17(3):273–276
Drugman T, Alwan A (2011) Joint robust voicing detection and pitch estimation based on residual harmonics. In: Proceedings of the Interspeech, pp 1973–1976
Evangelopoulos G, Maragos P (2005) Speech event detection using multi band modulation energy. In INTERSPEECH, pp. 685–688
Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL (1993) DARPA TIMIT acoustic phonetic continuous speech corpus CD-ROM. NIST, Gaithersburg
Mantena GV, Rajendran S, Gangashetty SV, Yegnanarayana B, Prahallad KS (2011) Development of a spoken dialogue system for accessing agricultural information in Telugu. In: Proceedings of the 9th international conference on natural language processing
Ghosh PK, Tsiartas A, Narayanan SS (2011) Robust voice activity detection using long-term signal variability. IEEE Trans Audio Speech Lang Process 19(3):600–613
Gonzalez S, Brookes M (2014) PEFAC-a pitch estimation algorithm robust to high levels of noise. IEEE/ACM Trans Audio Speech Lang Process 22(2):518–530
Gorriz JM, Ramirez J, Lang EW, Puntonet CG, Turias I (2010) Improved likelihood ratio test based voice activity detector applied to speech recognition. Speech Commun 52(78):664–677
Haigh JA, Mason JS (1993) A voice activity detector based on cepstral analysis. In EUROSPEECH, pp. 1103–1106
Hughes T, Mierle K (2013) Recurrent neural networks for voice activity detection. In ICASSP, pp. 7378–7382
Kasi K, Zahorian S (2002) Yet another algorithm for pitch tracking. ICASSP 1:361–364
Kotnik B, Kacic Z, Horvat B (2001) A multiconditional robust front-end feature extraction with a noise reduction procedure based on improved spectral subtraction algorithm. In INTERSPEECH, pp. 197–200
Lee Y-C, Ahn S-S (2006) Statistical model-based VAD algorithm with wavelet transform. IEICE Trans Fundam Electron Commun Comput Sci E89–A(6):1594–1600
Ma Y, Nishihara A (2013) Efficient voice activity detection algorithm using long-term spectral flatness measure. EURASIP J Audio Speech Music Process 1–18:2013
Markel JD (1972) The SIFT algorithm for fundamental frequency estimation. IEEE Trans Audio Electroacoust 20:367–377
McLoughlin IV (2014) Super-audible voice activity detection. IEEE/ACM Trans Audio Speech Lang Process 22(9):1424–1433
Murthy HA, Yegnanarayana B (2011) Group delay functions and its applications in speech technology. Sadhana 36(5):745–782
Nagarajan T, Prasad VK, Murthy H et al (2003) Minimum phase signal derived from root cepstrum. Electron Lett 39(12):941–942
Nakatani T, Irino T (2004) Robust and accurate fundamental frequency estimation based on dominant harmonic components. J Acoust Soc Am 116(6):3690–3700
Ng T, Zhang B, Nguyen L, Matsoukas S, Zhou Xinhui, Mesgarani Nima, Veselý Karel, Matějka Pavel (2012) Developing a speech activity detection system for the DARPA RATS program. INTERSPEECH 9:1–4
Noll AM (1967) Cepstrum pitch determination. J Acoust Soc Am 41:293–309
Plante F, Meyer GF, Aubsworth WA (1995) A pitch extraction reference database. In Proc Euro Conf on speech commun (Eurospeech), Madrid, Spain, pp. 827–840
Rabiner LR, Cheng MJ, Rosenberg AE, McGonegal CA (1976) A comparative performance study of several pitch detection algorithms. IEEEASSP 24:399–418
Ramirez J, Segura JC, Benitez C, De La Torre A, Rubio A (2004) Efficient voice activity detection algorithms using long-term speech information. Speech commun 42(3):271–287
Sadjadi SO, Hansen JHL (2013) Unsupervised speech activity detection using voicing measures and perceptual spectral flux. IEEE Signal Process Lett 20(3):197–200
Sarikaya R, Hansen JHL (1998) Robust speech activity detection in the presence of noise. In International Conference on Spoken Language Processing
Shimamura T, Kobayashi H (2001) Weighted autocorrelation for pitch extraction of noisy speech. IEEESAP 9(7):727–730
Siemund R, Höge H, Kunzmann S, Marasek K (2000) SPEECON-speech data for consumer devices. In: Proceedings of the LREC2000, pp 883–886
Sohn J, Kim NS (1999) A statistical model-based voice activity detection. IEEE Signal Process Lett 6(1):1–3
Sun X (2002) Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio. In ICASSP, pp. 333–336. IEEE
Talkin D (1995) A Robust algorithm for pitch tracking (RAPT). In: Kleijn WB, Paliwal KK (eds) Speech Coding and Synthesis, Elsevier, pp 497–518
Tan LN, Alwan A (2013) Multi-band summary correlogram-based pitch detection for noisy speech. Speech Commun 55(7–8):841–856
Varga A, Steeneken HJ (1993) Assessment for automatic speech recognition II: Noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251
Pannala V, Aneeja G, Kadiri SR, Yegnanarayana B (2016) Robust estimation of fundamental frequency using single frequency filtering approach. In INTERSPEECH, pp. 2155–2159
Yang N, Ba H, Cai W, Demirkol I, Heinzelman W (2014) BaNa: a noise resilient fundamental frequency detection algorithm for speech and music. IEEE/ACM Trans Audio Speech Lang Process 22(12):1833–1848
Yegnanarayana B, Murty KSR (2009) Event-based instantaneous fundamental frequency estimation from speech signals. IEEE Trans Audio Speech Lang Process 17(4):614–624
Yegnanarayana B, Murthy HA (1992) Significance of group delay functions in spectrum estimation. IEEE Trans Signal Process 40(9):2281–2289
Zhang X-L, Wu J (2013) Denoising deep neural networks based voice activity detection. In: Proceedings of the 38th IEEE international conference on acoustic, speech, and signal processing, Vancouver, Canada, May 2013, pp 853–857