Exploiting spectro-temporal locality in deep learning based acoustic event detection

Miquel Espi1, Masakiyo Fujimoto1, Keisuke Kinoshita1, Tomohiro Nakatani1
1NTT Communication Science Laboratories, NTT Corporation, 2-4, Hikaridai, Seika-cho, Keihanna Science City, 619-0237, Kyoto, Japan

Tóm tắt

Từ khóa


Tài liệu tham khảo

T Hori, S Araki, T Yoshioka, M Fujimoto, S Watanabe, T Oba, A Ogawa, K Otsuka, D Mikami, K Kinoshita, T Nakatani, A Nakamura, J Yamato, Low-latency real-time meeting recognition and understanding using distant microphones and omni-directional camera. IEEE Trans. Audio Speech Lang. Process. 20(2), 499–513 (2012).

A Ozerov, A Liutkus, R Badeau, G Richard, in Applications of Signal Processing to Audio and Acoustics (WASPAA), 2011 IEEE Workshop On. Informed source separation: source coding meets source separation (IEEE, 2011), pp. 257–260, doi: 10.1109/ASPAA.2011.6082285 .

D Mostefa, N Moreau, K Choukri, G Potamianos, S Chu, A Tyagi, J Casas, J Turmo, L Cristoforetti, F Tobia, A Pnevmatikakis, V Mylonakis, F Talantzis, S Burger, R Stiefelhagen, K Bernardin, C Rochet, The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms. Lang. Resour. Eval. 41(3-4), 389–407 (2007).

D Giannoulis, E Benetos, D Stowell, M Rossignol, M Lagrange, MD Plumbley, in Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop On. Detection and classification of acoustic scenes and events: an IEEE AASP challenge, (2013), pp. 1–4, doi: 10.1109/WASPAA.2013.6701819 .

K Imoto, S Shimauchi, H Uematsu, H Ohmuro, in INTERSPEECH’2013. User activity estimation method based on probabilistic generative model of acoustic event sequence with user activity and its subordinate categories, (2013), pp. 2609–2613.

C Canton-Ferrer, T Butko, C Segura, X Giro, C Nadeu, J Hernando, JR Casas, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR). Audiovisual event detection towards scene understanding, (2009), pp. 81–88, doi: 10.1109/CVPRW.2009.5204264 .

X Lu, Y Tsao, S Matsuda, in IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). Sparse representation based on a bag of spectral exemplars for acoustic event detection, (2014), pp. 6255–6259, doi: 10.1109/ICASSP.2014.6854807 .

M Espi, Y Fujimoto, M Kubo, T Nakatani, in HSCMA. Spectrogram patch based acoustic event detection and classification in overlapping speech scenarios, (2014), pp. 117–121, doi: 10.1109/HSCMA.2014.6843263 .

X Zhuang, X Zhou, MA Hasegawa-Johnson, TS Huang, Real-world acoustic event detection. Pattern. Recogn. Lett. 31(12), 1543–51 (2010).

M Espi, M Fujimoto, D Saito, N Ono, S Sagayama, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). A tandem connectionist model using combination of multi-scale spectro-temporal features for acoustic event detection, (2012), pp. 4293–4296, doi: 10.1109/ICASSP.2012.6288868 .

S-Y Chang, N Morgan, in INTERPEECH’2014. Robust cnn-based speech recognition with gabor filter kernels, (2014), pp. 905–909.

H Zhang, I McLoughlin, S Yan, in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference On. Robust sound event recognition using convolutional neural networks, (2015), pp. 559–563, doi: 10.1109/ICASSP.2015.7178031 .

Y LeCun, L Bottou, Y Bengio, P Haffner, Gradient-based learning applied to document recognition. Proc. IEEE. 86(11), 2278–324 (1998).

TN Sainath, B Kingsbury, G Saon, H Soltau, A-r Mohamed, G Dahl, B Ramabhadran, Deep convolutional neural networks for large-scale speech tasks. Neural. Netw. 0 (2014). doi: 10.1016/j.neunet.2014.08.005 .

G Hinton, A practical guide to training restricted boltzmann machines. Momentum. 9(1), 926 (2010).

A Mohamed, GE Dahl, GE Hinton, Acoustic modeling using deep belief networks. IEEE Trans. Audio, Speech, Lang. Process. 20(1), 14–22 (2012).

PY Simard, D Steinkraus, JC Platt, in 2013 12th International Conference on Document Analysis and Recognition, 2. Best practices for convolutional neural networks applied to visual document analysis (IEEE Computer Society, 2003), pp. 958–958, doi: 10.1109/ICDAR.2003.1227801 .

S Thomas, S Ganapathy, G Saon, H Soltau, in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference On. Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions, (2014), pp. 2519–2523, doi: 10.1109/ICASSP.2014.6854054 .

O Gencoglu, T Virtanen, H Huttunen, in EUSIPCO. Recognition of acoustic events using deep neural networks, (2014), pp. 506–510.

T Heittola, A Mesaros, A Eronen, T Virtanen, Context-dependent sound event detection. EURASIP J. Audio, Speech Music Process (2013). doi: 10.1186/1687-4722-2013-1 .

HG Hirsch, D Pearce, AURORA-4. http://aurora.hsnr.de/aurora-4.html Access on: September 10th, 2015.

J Bergstra, O Breuleux, F Bastien, P Lamblin, R Pascanu, G Desjardins, J Turian, D Warde-Farley, Y Bengio, in Python for Scientific Computing Conference (SciPy), 4. Theano: a CPU and GPU math expression compiler (Oral Presentation, 2010), p. 3.