Dual input neural networks for positional sound source localization

Eric Grinstein1, Vincent W. Neo1, Patrick A. Naylor1
1Department of Electrical and Electronic Engineering, Imperial College London, London, UK

Tóm tắt

In many signal processing applications, metadata may be advantageously used in conjunction with a high dimensional signal to produce a desired output. In the case of classical Sound Source Localization (SSL) algorithms, information from a high dimensional, multichannel audio signals received by many distributed microphones is combined with information describing acoustic properties of the scene, such as the microphones’ coordinates in space, to estimate the position of a sound source. We introduce Dual Input Neural Networks (DI-NNs) as a simple and effective way to model these two data types in a neural network. We train and evaluate our proposed DI-NN on scenarios of varying difficulty and realism and compare it against an alternative architecture, a classical Least-Squares (LS) method as well as a classical Convolutional Recurrent Neural Network (CRNN). Our results show that the DI-NN significantly outperforms the baselines, achieving a five times lower localization error than the LS method and two times lower than the CRNN in a test dataset of real recordings.

Tài liệu tham khảo

H.C. So, Source Localization: Algorithms and Analysis (Wiley, Hoboken, 2011) D.B. Haddad, M.V.S. Lima, W.A. Martins, L.W.P. Biscainho, L.O. Nunes, B. Lee, Acoustic Sensor Self-Localization: Models and Recent Results. Wirel. Commun. Mob. Comput. 2017, e7972146 (2017). https://doi.org/10.1155/2017/7972146 M. Brandstein, D. Ward, Microphone Arrays: Signal Processing Techniques and Applications (Springer Science & Business Media, Berlin, 2001) H. Wang, P. Chu, in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP). Voice Source Localization for Automatic Camera Pointing System in Videoconferencing (IEEE, USA, 1997), pp. 187–190 C. Evers, P.A. Naylor, Acoustic SLAM. IEEE Trans. Audio Speech Lang. Process, vol. 26 (IEEE, USA, 2018) p. 1484–1498 A. Bertrand, in IEEE Symp. on Commun. and Veh. Technol. in the Benelux (SCVT). Applications and Trends in Wireless Acoustic Sensor Networks: A Signal Processing Perspective (IEEE, USA, 2011), pp. 1–6 S. Adavanne, A. Politis, T. Virtanen, in Proc. Eur. Signal Process. Conf. (EUSIPCO). Direction of Arrival Estimation for Multiple Sound Sources Using Convolutional Recurrent Neural Network (IEEE, USA, 2018), pp. 1462–1466 W. He, P. Motlicek, J.M. Odobez, in Proc. Int. Conf. Robotics and Automation. Deep Neural Networks for Multiple Speaker Detection and Localization (IEEE, USA, 2018), pp. 74–79 J.M. Vera-Diaz, D. Pizarro, J. Macias-Guarasa, Towards End-to-End Acoustic Localization using Deep Learning: From Audio Signal to Source Position Coordinates. Sensors. 18(10), 3418 (2018) P. Baldi, K. Cranmer, T. Faucett, P. Sadowski, D. Whiteson, Parameterized Neural Networks for High-Energy Physics. Eur. Phys. J. C. 76(5), 235 (2016) P.K. Atrey, M.A. Hossain, A. El Saddik, M.S. Kankanhalli, Multimodal fusion for multimedia analysis: A survey. Multimed. Syst. 16(6), 345–379 (2010) T. Baltrušaitis, C. Ahuja, L.P. Morency, Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2019) A. Ozerov, E. Vincent, F. Bimbot, A General Flexible Framework for the Handling of Prior Information in Audio Source Separation. IEEE Trans. Audio, Speech, Language Process. 20(4), 1118–1133 (2012) H. Sundar, W. Wang, M. Sun, C. Wang, in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP). Raw waveform based end-to-end deep convolutional network for spatial localization of multiple acoustic sources (IEEE, USA, 2020), pp. 4642–4646 A. Anonymous, in Under Review. Deep Complex-Valued Convolutional-Recurrent Networks for Single Source DOA Estimation (2022) W. Xue, Y. Tong, C. Zhang, G. Ding, X. He, B. Zhou, in Proc. Conf. of Int. Speech Commun. Assoc. (INTERSPEECH). Sound Event Localization and Detection Based on Multiple DOA Beamforming and Multi-Task Learning (ISCA, France, 2020), pp. 5091–5095 S. Chakrabarty, E.A.P. Habets, in Proc. IEEE Workshop on Appl. of Signal Process. to Audio and Acoust. (WASPAA). Broadband DoA Estimation Using Convolutional Neural Networks Trained with Noise Signals (2017), pp. 136–140 N. Yalta, K. Nakadai, T. Ogata, Sound Source Localization Using Deep Learning Models. J. Robot. Mechatron. 29(1), 37–48 (2017) D. Krause, A. Politis, K. Kowalczyk, in Proc. Eur. Signal Process. Conf. (EUSIPCO). Comparison of Convolution Types in CNN-based Feature Extraction for Sound Source Localization (IEEE, USA, 2021), pp. 820–824 L. Perotin, R. Serizel, E. Vincent, A. Guérin, CRNN-Based Multiple DoA Estimation Using Acoustic Intensity Features for Ambisonics Recordings. IEEE J. Sel. Topics Signal Process. 13(1), 22–33 (2019) L. Perotin, A. Défossez, E. Vincent, R. Serizel, A. Guérin, in Proc. IEEE Workshop on Appl. of Signal Process. to Audio and Acoust. (WASPAA). Regression Versus Classification for Neural Network Based Audio Source Localization (IEEE, USA, 2019), pp. 343–347 P.A. Grumiaux, S. Kitić, L. Girin, A. Guérin, A Survey of Sound Source Localization with Deep Learning Methods. J. Acoust. Soc. Am. 152, 107–151 (2022) M. Taseska, E.A.P. Habets, in Proc. IEEE Workshop on Appl. of Signal Process. to Audio and Acoust. (WASPAA). Spotforming using distributed microphone arrays (IEEE, USA, 2013) E. Guizzo, C. Marinoni, M. Pennese, X. Ren, X. Zheng, C. Zhang, B. Masiero, A. Uncini, D. Comminiello, in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP). L3DAS22 Challenge: Learning 3D Audio Sources in a Real Office Environment (2022). Comment: Accepted to 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2022). arXiv admin note: substantial text overlap with arXiv:2104.05499 F. Gustafsson, F. Gunnarsson, in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP). Positioning Using Time-Difference of Arrival Measurements (IEEE, USA, 2003) D. Li, Y.H. Hu, Energy Based Collaborative Source Localization Using Acoustic Micro-Sensor Array. EURASIP J Appl. Signal Process. 985029 (2003) Z. Liu, Z. Zhang, L.W. He, P. Chou, in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP). Energy-Based Sound Source Localization and Gain Normalization for Ad Hoc Microphone Arrays (IEEE, USA, 2007) R.O. Schmidt, Multiple Emitter Location and Signal Parameter Estimation. IEEE Trans. Antennas Propag. 34(3), 276–280 (1986) J.P. Dmochowski, J. Benesty, S. Affes, A Generalized Steered Response Power Method for Computationally Viable Source Localization. IEEE Trans. Audio Speech Lang. Process. 15(8), 2510–2526 (2007) R. Lebarbenchon, E. Camberlein, D. di Carlo, C. Gaultier, A. Deleforge, N. Bertin, in Proc. of the LOCATA Challenge Workshop. Evaluation of an open-source implementation of the SRP-PHAT algorithm within the 2018 LOCATA challenge. (2018). Y. Huang, J. Benesty, G. Elko, R. Mersereati, Real-Time Passive Source Localization: A Practical Linear-Correction Least-Squares Approach. IEEE Trans. Audio Speech Lang. Process. 9(8), 943–956 (2001) C. Knapp, G. Carter, The Generalized Correlation Method for Estimation of Time Delay. IEEE Trans. Acoust. Speech Signal Process. 24(4), 320–327 (1976) C. Zhang, D. Florencio, Z. Zhang, in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP). Why Does PHAT Work Well in Lownoise, Reverberative Environments? (IEEE, USA, 2008), pp. 2565–2568 J.B. Allen, Short Term Spectral Analysis, Synthesis, and Modification by Discrete Fourier Transform. IEEE Trans. Acoust. Speech Signal Process. 25(3), 235–238 (1977) K. Choi, G. Fazekas, M. Sandler, K. Cho, in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP). Convolutional recurrent neural networks for music classification (IEEE, USA, 2017), pp. 2392–2396 Y. Cao, Q. Kong, T. Iqbal, F. An, W. Wang, M.D. Plumbley, in Proc. Detect. and Classific. of Acoust. Scenes and Events (DCASE). Polyphonic Sound Event Detection and Localization using a Two-Stage Strategy (2019), pp. 30–34 J. Chung, C. Gulcehre, K. Cho, Y. Bengio, in Proc. Neural Inform. Process. Conf. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling (NeurIPS Foundation, USA, 2014) S. Bistafa, J. Bradley, Reverberation time and maximum background-noise level for classrooms from a comparative study of speech intelligibility metrics. J. Acoust. Soc. Am. 107, 861–75 (2000). https://doi.org/10.1121/1.428268 J.B. Allen, D.A. Berkley, Image Method for Efficiently Simulating Small-Room Acoustics. J. Acoust. Soc. Am. 65(4), 943–950 (1979) R. Scheibler, E. Bezzam, I. Dokmanić, in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP). Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms (IEEE, USA, 2018), pp. 351–355 A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, et al., in Proc. Neural Inform. Process. Conf. Pytorch: An imperative style, high-performance deep learning library (2019) W. Falcon, The PyTorch Lightning Team. PyTorch Lightning. (2019). https://www.pytorchlightning.ai. Accessed 28 Aug 2023 O. Yadan. Hydra - A Framework for Elegantly Configuring Complex Applications. (2019). https://www.hydra.cc. Accessed 28 Aug 2023 S. Ioffe, C. Szegedy, in Proc. Int. Conf. Machine Learning (ICML). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (PMLR, USA, 2015), pp. 448–456 D.P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization. (2017). arXiv:1412.6980 J. Yamagishi, C. Veaux, K. MacDonald. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (2019). University of Edinburgh. The Centre for Speech Technology Research (CSTR). https://doi.org/10.7488/ds/2645. S. Guan, S. Liu, J. Chen, W. Zhu, S. Li, et al., in Asia-Pacific Signal and Inform. Process. Assoc. Annual Summit and Conf. (APSIPA). Libri-Adhoc40: A Dataset Collected from Synchronized Ad-Hoc Microphone Arrays (IEEE, USA, 2021) V. Panayotov, G. Chen, D. Povey, S. Khudanpur, in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP). Librispeech: An ASR Corpus Based on Public Domain Audio Books (IEEE, USA, 2015), pp. 5206–5210 A.M. Aurand, J.S. Dufour, W.S. Marras, Accuracy map of an optical motion capture system with 42 or 21 cameras in a large measurement volume. J. Biomech. 58, 237–240 (2017) N.D. Gaubitch, W.B. Kleijn, R. Heusdens, in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (IEEE, USA, ICASSP). Auto-localization in ad-hoc microphone arrays (2013), pp. 106–110 P. Pertilä, M. Mieskolainen, M.S. Hämäläinen, in Proc. Eur. Signal Process. Conf. (EUSIPCO). Passive self-localization of microphones using ambient sounds (IEEE, USA, 2012), pp. 1314–1318 H. Gamper, I.J. Tashev, in Proc. Int. Workshop on Acoust. Signal Enhancement (IWAENC). Blind Reverberation Time Estimation Using a Convolutional Neural Network (IEEE, USA, 2018), pp. 136–140 P.S. López, P. Callens, M. Cernak, in Proc. IEEE Workshop on Appl. of Signal Process. to Audio and Acoust. (WASPAA). A Universal Deep Room Acoustics Estimator (IEEE, USA, 2021), pp. 356–360 F. Ribeiro, D. Ba, C. Zhang, D. Florêncio, in IEEE International Conference on Multimedia and Expo. Turning enemies into friends: Using reflections to improve sound source localization (IEEE, USA, 2010), pp. 731–736 D. Yu, M. Kolbæk, Z.H. Tan, J. Jensen, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Permutation invariant training of deep models for speaker-independent multi-talker speech separation (IEEE, USA, 2017), pp. 241–245. https://doi.org/10.1109/ICASSP.2017.7952154