Multimodal Machine Learning: A Survey and Taxonomy

Tadas Baltrus̆aitis1, Chaitanya Ahuja2, Louis‐Philippe Morency2
1Microsoft Corporation, Cambridge, United Kingdom
2Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA

Tóm tắt

Từ khóa


Tài liệu tham khảo

10.1109/JPROC.2003.817150

10.1109/TPAMI.2007.1124

qin, 2008, Global ranking using continuous conditional random fields, Proc 28th Int Conf Neural Inf Process Syst, 1281

rajendran, 2015, Bridge correlational neural networks for multilingual multimodal representation learning, Proc Conf North Amer Chapter Assoc Comput Linguistics Human Language Technol, 171

rajagopalan, 2016, Extending long short-term memory for multi-view structured learning, Proc Eur Conf Comput Vis, 338

10.1145/1873951.1873987

ramirez, 2011, Modeling latent discriminative dynamic of multi-dimensional affective signals, Proc Int Conf Affective Comput Intell Interaction, 396, 10.1007/978-3-642-24571-8_51

reed, 2016, Generative adversarial text to image synthesis, Proc 29th Int Conf Mach Learn, 1060

ratnaparkhi, 2000, Trainable methods for surface natural language generation, Proc Conf North Amer Chapter Assoc Comput Linguistics Human Language Technol, 194

10.1109/ICCV.2015.303

10.18653/v1/D15-1303

10.1145/2808196.2811638

10.1145/2663204.2666277

10.1145/2939672.2939812

bucak, 2014, Multiple kernel learning for visual object recognition: A review, IEEE Trans Pattern Anal Mach Intell, 36, 1354, 10.1109/TPAMI.2013.212

10.1613/jair.4135

bruni, 2012, Distributional semantics in technicolor, Proc Annual Meeting of the Assoc Computational Linguistics, 136

10.3115/v1/P15-1006

10.1109/ICASSP.2016.7472621

castellano, 2008, Emotion recognition through multiple modalities:Face, body gesture, speech, Affect and Emotion in Human-Computer Interaction, 10.1007/978-3-540-85099-1_8

carletta, 2005, The AMI meeting corpus: A pre-announcement, Proc Int Conf Methods Tech Behav Res, 28

rohrbach, 2015, The long-short story of movie description, Proc German Conf Pattern Recognit, 209, 10.1007/978-3-319-24947-6_17

10.1109/ICME.2007.4284731

10.3115/1073336.1073359

10.1109/TMM.2007.906583

salakhutdinov, 2009, Deep Boltzmann machines, Proc Conf Artif Intell Statist, 448

10.1007/s11263-016-0987-1

10.1145/2522848.2531741

10.18653/v1/N16-1020

10.1109/ICCV.2011.6126545

schuller, 2011, AVEC 2011 the first international audio/visual emotion challenge, Proc Int Conf Affective Comput Intell Interaction, 415, 10.1007/978-3-642-24571-8_53

10.1109/CVPR.2010.5539928

10.1145/258734.258880

10.1162/tacl_a_00207

brown, 1993, The mathematics of statistical machine translation: Parameter estimation, Comput Linguistics, 263

10.1613/jair.4900

10.1145/279943.279962

10.1145/1866029.1866080

10.1109/ICCV.2015.507

bojanowski, 2014, Weakly supervised action labeling in videos under ordering constraints, Proc Eur Conf Comput Vis, 628

10.1109/CVPR.1997.609450

bourlard, 1996, A new ASR approach based on independent processing and recombination of partial frequency bands, Proc Int Conf Spoken Language, 426

10.3115/v1/W14-3348

10.3115/v1/P15-2017

10.1613/jair.3540

naim, 2014, Unsupervised alignment of natural language instructions with video segments, Proc 26th AAAI Conf Artif Intell, 1558

10.1109/TPAMI.2015.2461544

nefian, 2002, A coupled HMM for audio-visual speech recognition, InterSpeech, 2, ii-2013

10.1109/ICASSP.2015.7178347

10.3115/v1/N15-1017

10.1007/978-3-540-74048-3_4

mikolov, 2013, Distributed representations of words and phrases and their compositionality, Proc 28th Int Conf Neural Inf Process Syst, 3111

mitchell, 2012, Midge: Generating image descriptions from computer vision detections, Proc 14th Conf Eur Chapter Assoc Comput Linguistics, 747

moon, 2015, Multimodal transfer deep learning for audio-visual recognition, Proc 28th Int Conf Neural Inf Process Syst Workshops

morvant, 2014, Majority vote of diverse classifiers for latefusion, Lecture Notes in Computer Science Description, 10.1007/978-3-662-44415-3_16

10.1109/CVPR.2016.213

farhadi, 2010, Every picture tells a story: Generating sentences from images, Lecture Notes in Computer Science, 10.1007/978-3-642-15561-1_2

10.1109/CVPR.2009.5206772

fan, 2014, TTS synthesis with bidirectional LSTM based Recurrent Neural Networks, Proc Annu Conf Int Speech Commun Assoc, 1964

10.1109/TMM.2013.2267205

10.3115/v1/P14-2074

elliott, 2013, Image description using visual dependency representations, Proc Conf Empirical Methods Natural Language Process, 1292

10.1145/2682899

chen, 2015, Microsoft COCO captions: Data collection and evaluation server

papineni, 2002, BLEU: A method for automatic evaluation of machine translation, Proc Annual Meeting of the Assoc Computational Linguistics, 311

10.1109/CVPR.2016.497

palatucci, 2009, Zero-shot learning with semantic output codes, Proc 28th Int Conf Neural Inf Process Syst, 1410

10.1109/CVPR.2016.264

10.1109/CVPR.2014.299

ordonez, 2011, Im2text: Describing images using 1 million captioned photographs, Proc 28th Int Conf Neural Inf Process Syst, 1143

oord, 2016, WaveNet: A generative model for raw audio

10.1109/TPAMI.2011.47

10.1007/s10462-012-9368-5

10.18653/v1/D16-1203

10.1109/CVPR.2016.12

10.1109/CVPR.2013.434

anguera, 2014, Audio-to-text alignment for speech recognition with very limited resources, Proc Annu Conf Int Speech Commun Assoc, 1405

10.1145/2993148.2993176

andrew, 2013, Deep canonical correlation analysis, Proc 29th Int Conf Mach Learn

deena, 2009, Speech-driven facial animation using a shared gaussian process latent variable model, Proc Int Symp Adv Vis Comput, 89, 10.1007/978-3-642-10331-5_9

ngiam, 2011, Multimodal deep learning, Proc 29th Int Conf Mach Learn, 689

10.1109/ICCV.2015.279

10.1109/T-AFFC.2011.9

cour, 2008, Movie/script: Alignment and parsing of video and text transcription, Proc Eur Conf Comput Vis, 158

10.1109/ICASSP.1994.389596

10.1007/978-0-85729-997-0_19

10.1145/383259.383316

10.1145/1180995.1181013

chorowski, 2015, Attention-based models for speech recognition, Proc 28th Int Conf Neural Inf Process Syst, 577

collobert, 2016, Wav2letter: An end-to-end convnet-based speech recognition system

christoudias, 2008, Multi-view learning in the presence of view disagreement, Proc 24th Conf Uncertainty Artif Intell, 88

gönen, 2011, Multiple kernel learning algorithms, J Mach Learn Res, 12, 2211

glorot, 2010, Understanding the difficulty of training deep feedforward neural networks, Proc Conf Artif Intell Statist, 249

glodek, 2011, Multiple classifier systems for the classification of audio-visual emotional states, Proc Int Conf Affect Emotion Human-Comput Interaction, 359

10.21236/ADA307097

10.1109/ICCV.2013.337

gupta, 2012, Choosing linguistics over vision to describe images, Proc 26th AAAI Conf Artif Intell, 606

goodfellow, 2014, Generative adversarial nets, Proc 28th Int Conf Neural Inf Process Syst, 2672

10.1109/ICASSP.2013.6638947

10.1145/1452392.1452442

10.1162/0899766042321814

10.1016/j.neucom.2014.12.020

feng, 2010, Visual information in semantic representation, Proc Conf North Amer Chapter Assoc Comput Linguistics Human Language Technol, 91

10.1145/2647868.2654902

fidler, 2013, A sentence is worth a thousand pixels holistic CRF model, Proc IEEE Conf Comput Vis Pattern Recognit, 1995

frome, 2013, DeViSE: A deep visual-semantic embedding model, Proc 28th Int Conf Neural Inf Process Syst, 2121

10.18653/v1/D16-1044

gao, 2015, Are you talking to a machine? dataset and methods for multilingual image question answering, Proc 28th Int Conf Neural Inf Process Syst, 2296

10.1109/JPROC.2003.817119

10.1109/TPAMI.2017.2648793

10.1109/ICCV.2009.5459169

10.1109/CVPR.2010.5540112

socher, 2013, Zero-shot learning through cross-modal transfer, Proc 28th Int Conf Neural Inf Process Syst, 935

10.1162/tacl_a_00177

simonyan, 2015, Very deep convolutional networks for large-scale image recognition, Proc Int Conf Learn Representations

sjölander, 2003, An HMM-based system for automatic segmentation and alignment of speech, Proc Fonetik, 93

slaney, 2000, FaceSync: A linear operator for measuring synchronization of video facial images and audio tracks, Proc 28th Int Conf Neural Inf Process Syst

10.1023/B:MTAP.0000046380.27575.a5

10.1109/ICCV.2015.277

jaques, 2015, Multi-task, multi-kernel learning for estimating individual wellbeing, Proc Multimodal Mach Learn Workshop Conjunction NIPS, 1

silberer, 2012, Grounded models of semantic representation, Proc Conf Empirical Methods Natural Language Process, 1423

10.1016/j.inffus.2013.12.002

10.3115/v1/P14-1068

ioffe, 2015, Batch normalization: Accelerating deep network training by reducing internal covariate shift, Proc Int Conf Mach Learning, 448

simonyan, 2014, Two-stream convolutional networks for action recognition in videos, Proc 28th Int Conf Neural Inf Process Syst, 568

10.1109/ICASSP.1996.541110

10.18653/v1/N16-1147

10.1145/2911996.2912043

10.1080/00401706.1991.10484833

10.1109/CVPR.2017.348

10.1016/j.patrec.2014.08.005

10.1109/MSP.2012.2205597

10.1109/CVPR.2016.8

10.1162/neco.2006.18.7.1527

hinton, 1993, Autoencoders, minimum description length and Helmoltz free energy, Proc 28th Int Conf Neural Inf Process Syst, 3

10.1109/ICME.2007.4284627

10.1109/ICASSP.2013.6639140

10.1162/neco.1997.9.8.1735

10.1613/jair.3994

10.1093/biomet/28.3-4.321

10.1109/CVPR.2016.493

10.1109/T-AFFC.2011.37

kalchbrenner, 2013, Recurrent continuous translation models, Proc Conf Empirical Methods Natural Language Process, 1700

10.1007/s12193-015-0195-2

sutton, 2006, Introduction to conditional random fields for relationallearning, Introduction to Statistical Relational Learning

song, 2016, Unsupervised alignment of actions in video with text descriptions, Proc 7th Int Joint Conf Artif Intell, 2025

srivastava, 2014, Dropout : A simple way to prevent neural networks from overfitting, J Mach Learn Res, 15, 1929

song, 2012, Multi-view latent variable discriminative models for action recognition, Proc IEEE Conf Comput Vis Pattern Recognit, 2120

10.1145/2388676.2388684

10.1016/j.neuroimage.2014.06.077

sutskever, 2014, Sequence to sequence learning with neural networks, Proc 28th Int Conf Neural Inf Process Syst, 3104

srivastava, 2012, Learning representations for multimodal data with deep belief nets, Proc 29th Int Conf Mach Learn

srivastava, 2012, Multimodal learning with deep Boltzmann machines, Proc 28th Int Conf Neural Inf Process Syst, 2949

10.1109/CVPR.2015.7298792

10.1007/s13735-014-0065-9

taylor, 2012, Dynamic units of visual speech, Proc 28th Annu Conf Comput Graph Interactive Techn, 275

thomason, 2014, Integrating language and vision to generate natural language descriptions of videos in the wild, Proc 23rd Int Conf Comput Linguistics Posters, 1218

torabi, 2015, Using descriptive video services to create a large data source for video annotation research

10.1109/CVPR.2016.552

10.1109/ICASSP.2016.7472669

10.1145/2512530.2512533

van den oord, 2016, Pixel recurrent neural networks, Proc 29th Int Conf Mach Learn, 1747

10.1109/CVPR.2015.7299087

vendrov, 2016, Order-embeddings of images and language, Proc Int Conf Learn Representations

10.3115/v1/N15-1173

10.18653/v1/D16-1204

weston, 2010, Web scale image annotation: Learning to rank with joint word-image embeddings image annotation, Proc Eur Conf Mach Learn, 21, 10.1007/s10994-010-5198-3

wang, 2015, On deep multi-view representation learning, Proc 29th Int Conf Mach Learn, 1083

10.1109/CVPR.2016.541

wang, 2014, Hashing for similarity search: A survey

wang, 2015, Deep multimodal hashing with orthogonal regularization, Proc 7th Int Joint Conf Artif Intell, 2291

10.3115/993268.993313

10.1109/CVPR.2015.7298935

10.1016/j.neucom.2014.08.003

li, 2011, Composing simple image descriptions using web-scale n-grams, Proc 15th Conf Computational Natural Language Learning, 220

10.1109/ICCV.2003.1238406

lebret, 2015, Phrase-based image captioning, Proc 29th Int Conf Mach Learn, 2085

10.3115/1073445.1073465

lienhart, 1998, Comparison of automatic shot boundary detection algorithms, Proc SPIE, 290, 10.1117/12.333848

10.1109/JBHI.2013.2285378

lu, 2016, Hierarchical co-attention for visual question answering, Proc 28th Int Conf Neural Inf Process Syst, 1

10.1109/CVPR.2016.333

10.1111/j.1756-8765.2010.01106.x

10.1023/B:VISI.0000029664.99615.94

wöllmer, 2010, Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling, Proc Annu Conf Int Speech Commun Assoc, 2362

10.1145/2647868.2654969

weston, 2011, WSABIE: Scaling up to large vocabulary image annotation, Proc 7th Int Joint Conf Artif Intell, 2764

10.1016/j.imavis.2012.03.001

xu, 2015, Jointly modeling deep video and compositional text to bridge vision and language in a unified framework, Proc 26th AAAI Conf Artif Intell, 2346

xu, 2015, Show, attend and tell: Neural image caption generation with visual attention, Proc 29th Int Conf Mach Learn, 2048

10.1145/2647868.2654931

wu, 2005, Multi-level fusion of audio and visual features for speaker identification, Proc Int Conf Adv Biometrics, 493, 10.1007/11608288_66

xu, 2016, Ask, attend and answer: Exploring question-guided spatial attention for visual question answering, Proc Eur Conf Comput Vis, 451

xiong, 2016, Dynamic memory networks for visual and textual question answering, Proc 29th Int Conf Mach Learn, 2397

malmaud, 2015, What’s cookin’? interpreting cooking videos using text, speech and vision, Proc Conf North Amer Chapter Assoc Comput Linguistics Human Language Technol

10.1109/ICCV.2015.9

10.1109/CVPR.2016.9

mansimov, 2016, Generating images from captions with attention, Proc Int Conf Learn Representations

mao, 2015, Deep captioning with multimodal recurrent neural networks (m-RNN), Proc Int Conf Learn Representations

10.3115/v1/P14-2097

10.1109/ICASSP.1998.679698

mcfee, 2011, Learning multi-modal similarity, J Mach Learn Res, 12, 491

mcgurk, 1976, Hearing lips and seeing voices, Nature, 264, 746, 10.1038/264746a0

2016

10.1109/ICME.2010.5583006

2017

mei, 2016, Listen, attend, and walk: Neural mapping of navigational instructions to action sequences, Proc 26th AAAI Conf Artif Intell, 2772

yang, 2011, Corpus-guided sentence generation of natural images, Proc Conf Empirical Methods Natural Language Process, 444

10.1109/CVPR.2016.10

10.1109/JPROC.2010.2050411

10.1109/ICCV.2015.512

10.3115/v1/P15-2018

yu, 2013, Grounded language learning from video described with sentences, Proc Annual Meeting of the Assoc Computational Linguistics, 53

yu, 2004, On the integration of grounding language and learning objects, Proc 26th AAAI Conf Artif Intell, 488

10.1162/tacl_a_00166

10.1109/TMM.2012.2188783

10.1109/CVPR.2016.496

kingma, 2015, Adam: A method for stochastic optimization, Proc Int Conf Learn Representations

10.1109/ICASSP.2013.6638346

10.18653/v1/D15-1293

10.3115/v1/P15-2038

10.3115/v1/D14-1005

khapra, 2010, Everybody loves a rich cousin: An empirical study of transliteration through bridge languages, Proc Conf North Amer Chapter Assoc Comput Linguistics Human Language Technol, 420

karpathy, 2014, Deep fragment embeddings for bidirectional image sentence mapping, Proc 28th Int Conf Neural Inf Process Syst, 1889

10.1109/CVPR.2015.7298932

klein, 2015, Fisher vectors derived from hybrid gaussian-laplacian mixture models for image annotation, Proc IEEE Conf Comput Vis Pattern Recognit, 4437

10.1023/A:1020346032608

kiros, 2015, Unifying visual-semantic embeddings with multimodal neural language models, Transactions of the Association for Computational Linguistics, 1

10.1007/978-3-319-46475-6_5

10.1109/35.41402

zeng, 2017, Leveraging video descriptions to learn video question answering, Proc 26th AAAI Conf Artif Intell, 4334

10.1109/TPAMI.2008.52

10.1109/TASL.2012.2187195

10.1016/j.specom.2009.04.004

10.18653/v1/P16-1169

zhang, 2014, Large-scale supervised multimodal hashing with semantic correlation maximization, Proc 26th AAAI Conf Artif Intell, 2177

zhou, 2009, Canonical time warping for alignment of human behavior, Proc 28th Int Conf Neural Inf Process Syst, 2286

10.1109/ICASSP.2013.6639047

zhou, 2012, Generalized time warping for multi-modal alignment of human motion, Proc IEEE Conf Comput Vis Pattern Recognit, 1282

10.1007/s00530-010-0182-0

aytar, 2016, SoundNet: Learning sound representations from unlabeled video, Proc 28th Int Conf Neural Inf Process Syst

bahdanau, 2014, Neural machine translation by jointly learning to align and translate, Proc Int Conf Learn Representations

baltrušaitis, 2013, Dimensional Affect recognition using continuous conditional random fields, Proc 10th IEEE Int Conf Workshops Autom Face Gesture Recognit, 1

barbu, 2012, Video in sentences out, Proc Conf Uncertainty Artif Intell, 102

barnard, 2003, Matching words and pictures, J Mach Learn Res, 3, 1107

kumar, 2011, Learning hash functions for cross-view similarity search, Proc 7th Int Joint Conf Artif Intell, 1360

10.1111/lnc3.12170

10.1109/TPAMI.2012.162

10.1146/annurev.psych.59.103006.093639

10.1109/TPAMI.2013.50

kuznetsova, 2012, Collective generation of natural image descriptions, Proc Annual Meeting of the Assoc Computational Linguistics, 359

krizhevsky, 2012, ImageNet classification with deep convolutional neural networks, Proc 28th Int Conf Neural Inf Process Syst, 1097

10.1109/CVPR.2014.455

10.1137/1025045

10.1023/B:MACH.0000035472.73496.0c

lafferty, 2001, Conditional random fields : Probabilistic models for segmenting and labeling sequence data, Proc 29th Int Conf Mach Learn, 282

10.1142/S012906570000034X

10.1007/s11042-013-1391-2

10.3115/v1/P14-1132

10.1109/ICCV.2015.11

10.1109/CVPR.2013.387