A Survey on Probabilistic Models in Human Perception and Machines

Lux Li1, Robert Rehr2, Patrick Bruns1, Timo Gerkmann2, Brigitte Röder1
1Biological Psychology and Neuropsychology, University of Hamburg, Hamburg, Germany
2Signal Processing (SP), Department of Informatics, University of Hamburg, Hamburg, Germany

Tóm tắt

Từ khóa


Tài liệu tham khảo

Abdelaziz, 2015, Learning dynamic stream weights for coupled-HMM-based audio-visual speech recognition, IEEE ACM Trans. Audio Speech Lang. Process., 23, 863, 10.1109/TASLP.2015.2409785

Adjoudani, 1996, On the integration of auditory and visual parameters in an HMM-based ASR, Speechreading by Humans and Machines, Models, Systems and Applications of NATO ASI Series F: Computer and Systems Sciences, 461

Ahrens, 2008, Nonlinearities and contextual influences in auditory cortical responses modeled with multilinear spectrotemporal methods, J. Neurosci., 28, 1929, 10.1523/JNEUROSCI.3377-07.2008

Alais, 2004, The ventriloquist effect results from near-optimal bimodal integration, Curr. Biol., 14, 257, 10.1016/j.cub.2004.01.029

Arnold, 2019, Suboptimal human multisensory cue combination, Sci. Rep, 9, 5155, 10.1038/s41598-018-37888-7

Balan, 2002, Microphone array speech enhancement by bayesian estimation of spectral amplitude and phase, IEEE Sensor Array and Multichannel Signal Processing Workshop Proceedings, 209

Battaglia, 2003, Bayesian integration of visual and auditory signals for spatial localization, J. Opt. Soc. Am. A, 20, 1391, 10.1364/JOSAA.20.001391

Brand, 1997, Coupled hidden markov models for complex action recognition, Proceeding IEEE International Conference on Computer Vision and Pattern Recognition, 994, 10.1109/CVPR.1997.609450

Burshtein, 2002, Speech enhancement using a mixture-maximum model, IEEE Trans. Speech Audio Process., 10, 341, 10.1109/TSA.2002.803420

Calabrese, 2011, A generalized linear model for estimating spectrotemporal receptive fields from responses to natural sounds, PLoS ONE, 6, e16104, 10.1371/journal.pone.0016104

Castella, 2010, Convolutive mixtures, Handbook of Blind Source Separation, 281, 10.1016/B978-0-12-374726-6.00013-8

Chazan, 2016, A hybrid approach for speech enhancement using MoG model and neural network phoneme classifier, IEEE ACM Trans. Audio Speech Lang. Process., 24, 2516, 10.1109/TASLP.2016.2618007

Cherry, 1953, Some experiments on the recognition of speech, with one and with two ears, J. Acoust. Soc. Am., 25, 975, 10.1121/1.1907229

Chichilnisky, 2001, A simple white noise analysis of neuronal light responses, Netw. Comput. Neural Syst, 12, 199, 10.1080/713663221

Colonius, 2018, Formal models and quantitative measures of multisensory integration: a selective overview, Eur. J. Neurosci., 51, 1161, 10.1111/ejn.13813

David, 2018, Incorporating behavioral and sensory context into spectro-temporal models of auditory encoding, Heart Res, 360, 107, 10.1016/j.heares.2017.12.021

David, 2012, Task reward structure shapes rapid receptive field plasticity in auditory cortex, Proc. Natl. Acad. Sci. U.S.A, 109, 2144, 10.1073/pnas.1117717109

Deng, 2013, Machine learning paradigms for speech recognition: an overview, IEEE Trans. Audio Speech Lang. Process., 21, 1060, 10.1109/TASL.2013.2244083

Doclo, 2015, Multichannel signal enhancement algorithms for assisted listening devices: exploiting spatial diversity using multiple microphones, IEEE Signal Process. Mag, 32, 18, 10.1109/MSP.2014.2366780

Ephraim, 1992, A bayesian estimation approach for speech enhancement using hidden markov models, IEEE Trans. Signal Process., 40, 725, 10.1109/78.127947

Ephraim, 1984, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator, IEEE Trans. Acoust, 32, 1109, 10.1109/TASSP.1984.1164453

Ephraim, 1985, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator, IEEE Trans. Acoust., 33, 443, 10.1109/TASSP.1985.1164550

Ephrat, 2018, Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation, ACM Trans. Graph, 37, 109, 10.1145/3197517.3201357

Ernst, 2007, Learning to integrate arbitrary signals from vision and touch, J. Vis, 7, 7, 10.1167/7.5.7

Ernst, 2012, Optimal multisensory integration: assumptions and limits, The New Handbook of Multisensory Processes, 527, 10.7551/mitpress/8466.003.0048

Ernst, 2004, Merging the senses into a robust percept, Trends Cogn. Sci., 8, 162, 10.1016/j.tics.2004.02.002

Fetsch, 2013, Bridging the gap between theories of sensory cue integration and the physiology of multisensory neurons, Nat. Neurosci, 14, 429, 10.1038/nrn3503

Fritz, 2003, Rapid task-related plasticity of spectrotemporal receptive fields in primary auditory cortex, Nat. Neurosci, 6, 1216, 10.1038/nn1141

Gerkmann, 2012, Unbiased MMSE-based noise power estimation with low complexity and low tracking delay, IEEE Trans. Audio Speech Lang. Process., 20, 1383, 10.1109/TASL.2011.2180896

Ghahramani, 2015, Probabilistic machine learning and artificial intelligence, Nature, 521, 452, 10.1038/nature14541

Ghahramani, 1997, Factorial hidden markov models, Mach. Learn, 29, 245, 10.1023/A:1007425814087

Hendriks, 2013, DFT-domain based single-microphone noise reduction for speech enhancement - a survey of the state of the art, Synthesis Lectures on Speech and Audio Processing, 1

Hennecke, 1996, Visionary speech: Looking ahead to practical speechreading systems, Speechreading by Humans and Machines, Models, Systems and Applications, Volume 150 of NATO ASI Series F: Computer and Systems Sciences, 331

Hershey, 2004, Audio-visual graphical models for speech processing, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 649

Hershey, 2002, Audio-visual sound separation via hidden markov models, Advances in Neural Information Processing Systems (NIPS), 1173

Hershey, 2016, Deep clustering: discriminative embeddings for segmentation and separation, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 31

Jain, 2000, Statistical pattern recognition: a review, IEEE Trans. Pattern Anal. Mach. Intel. Ligence, 22, 4, 10.1109/34.824819

Jutten, 1991, Blind separation of sources, part I: an adaptive algorithm based on neuromimetic architecture, Signal Process., 24, 1, 10.1016/0165-1684(91)90079-X

Kay, 1993, Fundamentals of Statistical Signal Processing - Volume 1: Estimation Theory

King, 2018, Recent advances in understanding the auditory cortex, F1000Research, 7, 1555, 10.12688/f1000research.15580.1

Kolossa, 2011, Robust Speech Recognition of Uncertain or Missing Data: Theory and Applications, 1st Edn, 10.1007/978-3-642-21317-5

Körding, 2007, Causal inference in multisensory perception, PLoS ONE, 2, e943, 10.1371/journal.pone.0000943

Krawczyk-Becker, 2016, Fundamental frequency informed speech enhancement in a flexible statistical framework, IEEE ACM Trans. Audio Speech Lang. Proc., 24, 940, 10.1109/TASLP.2016.2533867

Lake, 2017, Building machines that learn and think like people, Behav Brain Sci., 40, e253, 10.1017/S0140525X16001837

Lee, 2015, A single microphone noise reduction algorithm based on the detection and reconstruction of spectro-temporal features, Proc. R. Soc. A Math. Phys. Eng. Sci., 471, 20150309, 10.1098/rspa.2015.0309

Liu, 2012, Use of bimodal coherence to resolve per- mutation problem in convolutive BSS, Signal Process., 92, 1916, 10.1016/j.sigpro.2011.11.007

Lohse, 2020, Neural circuits underlying auditory contrast gain control and their perceptual implications, Nat. Commun, 11, 324, 10.1038/s41467-019-14163-5

Lotter, 2005, Speech enhancement by MAP spectral amplitude estimation using a super-gaussian speech model, EURASIP J. Adv. Signal Process, 2005, 354850, 10.1155/ASP.2005.1110

Ma, 2012, Organizing probabilistic models of perception, Trends Cogn. Sci., 16, 511, 10.1016/j.tics.2012.08.010

Magnotti, 2017, A causal inference model explains perception of the mcgurk effect and other incongruent audiovisual speech, PLoS Comput. Biol, 13, e1005229, 10.1371/journal.pcbi.1005229

Maloney, 2002, Statistical theory and biological vision, Perception and the Physical World: Psychologocal and Philosophical Issues in Perception, 145, 10.1002/0470013427.ch6

Martin, 2001, Noise power spectral density estimation based on optimal smoothing and minimum statistics, IEEE Trans. Speech Audio Process., 9, 504, 10.1109/89.928915

Martin, 2005, Speech enhancement based on minimum mean-square error estimation and supergaussian priors, IEEE Trans. Speech Audio Process., 13, 845, 10.1109/TSA.2005.851927

Meijer, 2019, Integration of audiovisual spatial signals is not consistent with maximum likelihood estimation, Cortex, 119, 74, 10.1016/j.cortex.2019.03.026

Mesgarani, 2012, Selective cortical representation of attended speaker in multi-talker speech perception, Nature, 485, 233, 10.1038/nature11020

Mesgarani, 2014, Mechanisms of noise robust representation of speech in primary auditory cortex, Proc. Natl. Acad. Sci. U.S.A., 111, 1, 10.1073/pnas.1318017111

Meutzner, 2017, Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates, Proceeding ICASSP, 10.1109/ICASSP.2017.7953172

Meyer, 2017, Models of neuronal stimulus-response functions: elaboration, estimation, and evaluation, Front. Syst. Neurosci., 10, 109, 10.3389/fnsys.2016.00109

Nefian, 2002, Dynamic bayesian networks for audio-visual speech recognition, EURASIP J. Adv. Signal Process, 2002, 1274, 10.1155/S1110865702206083

Audio Visual Speech Recognition NetiC. PotamianosG. LuettinJ. MatthewsI. GlotinH. VergyriD. Workshop 2000 Final Report2000

Noppeney, 2018, Causal inference and temporal predictions in audiovisual perception of speech and music, Ann. N. Y. Acad. Sci, 1423, 102, 10.1111/nyas.13615

Padmanabhan, 2015, Machine learning in automatic speech recognition: a survey, IETE Tech. Rev., 32, 240, 10.1080/02564602.2015.1010611

Paninski, 2003, Convergence properties of some spike-triggered analysis techniques, Network: Comput Neural Syst, 14, 437, 10.1088/0954-898X_14_3_304

Parise, 2014, Natural auditory scene statistics shapes human spatial hearing, Proc. Natl. Acad. Sci. U.S.A, 111, 6104, 10.1073/pnas.1322705111

Porter, 1984, Optimal estimators for spectral restoration of noisy speech, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 53, 10.1109/ICASSP.1984.1172545

Potamianos, 2003, Recent advances in the automatic recognition of audiovisual speech, Proc. IEEE, 91, 1306, 10.1109/JPROC.2003.817150

Pouget, 2013, Probabilistic brains : knowns and unknowns, Nat. Neurosci, 16, 1170, 10.1038/nn.3495

Rabiner, 1989, A tutorial on hidden markov models and selected applications in speech recognition, Proc. IEEE, 77, 257, 10.1109/5.18626

Rabinowitz, 2013, Constructing noise-invariant representations of sound in the auditory pathway, PLoS Biol, 11, e1001710, 10.1371/journal.pbio.1001710

Rabinowitz, 2012, Spectrotemporal contrast kernels for neurons in primary auditory cortex, J. Neurosci., 32, 11271, 10.1523/JNEUROSCI.1715-12.2012

Rao, 2002, Probabilistic Models of the Brain: Perception and Neural Function, 10.7551/mitpress/5583.001.0001

Rehr, 2018, On the importance of super-gaussian speech priors for machine-learning based speech enhancement, IEEE/ACM Trans. Audio Speech Lang. Process., 26, 357, 10.1109/TASLP.2017.2778151

Rehr, 2019, An analysis of noise-aware features in combination with the size and diversity of training data for DNN-based speech enhancement, IEEE International Conference Acoustics Speech Signal Process (ICASSP), 10.1109/ICASSP.2019.8682991

Rivet, 2007, Using a visual voice activity detector to regularize the permutations in blind source separation of convolutive speech mixtures, Proceeding International Conference Digital Signal Processing (DSP), 223

Rivet, 2014, Audiovisual speech source separation: an overview of key methodologies, IEEE Signal Process. Mag, 31, 125, 10.1109/MSP.2013.2296173

Roach, 2006, Resolving multisensory conflict : a strategy for balancing the costs and benefits of audio-visual integration, Proc. R. Soc. B Biol. Sci., 273, 2159, 10.1098/rspb.2006.3578

Rohde, 2015, Statistically optimal multisensory cue integration?: A practical tutorial, Multisens. Res., 1, 10.1163/22134808-00002510

Roweis, 2001, One microphone source separation, Advances in Neural Information Processing Systems 13, 793

Roweis, 2003, Factorial models and refiltering for speech separation and denoising, Eurospeech, 10.21437/Eurospeech.2003-345

Rowland, 2007, A Bayesian model unifies multisensory spatial localization with the physiological properties of the superior colliculus, Exp. Brain Res., 180, 153, 10.1007/s00221-006-0847-2

Sato, 2007, Bayesian inference explains perception of unity and ventriloquism aftereffect : identification of common sources, Neural Comput, 19, 3335, 10.1162/neco.2007.19.12.3335

Schwartz, 2004, Seeing to hear better: evidence for early audio-visual interactions in speech identification, Cognition, 93, B69, 10.1016/j.cognition.2004.01.006

Shams, 2010, Causal inference in perception, Trends Cogn. Sci., 14, 425, 10.1016/j.tics.2010.07.001

Shams, 2005, Sound-induced flash illusion as an optimal percept, Neuroreport, 16, 1923, 10.1097/01.wnr.0000187634.68504.bb

Sharpee, 2004, Analyzing neural responses to natural signals: maximally informative dimensions, Neural Comput., 16, 223, 10.1162/089976604322742010

Sodoyer, 2002, Separation of audio-visual speech sources: a new approach exploiting the audio-visual coherence of speech stimuli, EURASIP J. Adv. Signal Process, 2002, 1165, 10.1155/S1110865702207015

Theis, 2013, Beyond GLMs: a generative mixture modeling approach to neural system identification, PLoS Comput. Biol, 9, e1003356, 10.1371/journal.pcbi.1003356

Ursino, 2014, Neurocomputational approaches to modelling multisensory integration in the brain: a review, Neural Netw., 60, 141, 10.1016/j.neunet.2014.08.003

Willmore, 2014, Hearing in noisy environments: noise invariance and contrast gain control, J. Physiol., 592, 3371, 10.1113/jphysiol.2014.274886

Willmore, 2016, Incorporating midbrain adaptation to mean sound level improves models of auditory cortical processing, J. Neurosci., 36, 280, 10.1523/JNEUROSCI.2441-15.2016

Wozny, 2010, Probability matching as a computational strategy used in perception, PLoS Comput. Biol, 6, e1000871, 10.1371/journal.pcbi.1000871

Yamins, 2016, Using goal-driven deep learning models to understand sensory cortex, Nat. Neurosci, 19, 356, 10.1038/nn.4244

Yilmaz, 2004, Blind separation of speech mixtures via time-frequency masking, IEEE Trans. Signal Process., 52, 1830, 10.1109/TSP.2004.828896

Yuille, 1996, Bayesian decision theory and psychophysics, Perception as Bayesian Inference, 123, 10.1017/CBO9780511984037.006

Yumoto, 1982, Harmonic to noise ratio as an index of the degree of hoarseness, J. Acoust. Soc. Am., 71, 1544, 10.1121/1.387808

Zhao, 2007, HMM-based gain modeling for enhancement of speech in noise, IEEE Trans. Audio Speech Lang. Process., 15, 882, 10.1109/TASL.2006.885256

Zhao, 2011, Understanding auditory spectro-temporal receptive fields and their changes with input statistics by efficient coding principles, PLoS Comput. Biol, 7, e1002123, 10.1371/journal.pcbi.1002123