Multimodal Machine Learning: A Survey and Taxonomy
Tóm tắt
Từ khóa
Tài liệu tham khảo
qin, 2008, Global ranking using continuous conditional random fields, Proc 28th Int Conf Neural Inf Process Syst, 1281
rajendran, 2015, Bridge correlational neural networks for multilingual multimodal representation learning, Proc Conf North Amer Chapter Assoc Comput Linguistics Human Language Technol, 171
rajagopalan, 2016, Extending long short-term memory for multi-view structured learning, Proc Eur Conf Comput Vis, 338
ramirez, 2011, Modeling latent discriminative dynamic of multi-dimensional affective signals, Proc Int Conf Affective Comput Intell Interaction, 396, 10.1007/978-3-642-24571-8_51
reed, 2016, Generative adversarial text to image synthesis, Proc 29th Int Conf Mach Learn, 1060
ratnaparkhi, 2000, Trainable methods for surface natural language generation, Proc Conf North Amer Chapter Assoc Comput Linguistics Human Language Technol, 194
bucak, 2014, Multiple kernel learning for visual object recognition: A review, IEEE Trans Pattern Anal Mach Intell, 36, 1354, 10.1109/TPAMI.2013.212
bruni, 2012, Distributional semantics in technicolor, Proc Annual Meeting of the Assoc Computational Linguistics, 136
castellano, 2008, Emotion recognition through multiple modalities:Face, body gesture, speech, Affect and Emotion in Human-Computer Interaction, 10.1007/978-3-540-85099-1_8
carletta, 2005, The AMI meeting corpus: A pre-announcement, Proc Int Conf Methods Tech Behav Res, 28
rohrbach, 2015, The long-short story of movie description, Proc German Conf Pattern Recognit, 209, 10.1007/978-3-319-24947-6_17
salakhutdinov, 2009, Deep Boltzmann machines, Proc Conf Artif Intell Statist, 448
schuller, 2011, AVEC 2011 the first international audio/visual emotion challenge, Proc Int Conf Affective Comput Intell Interaction, 415, 10.1007/978-3-642-24571-8_53
brown, 1993, The mathematics of statistical machine translation: Parameter estimation, Comput Linguistics, 263
bojanowski, 2014, Weakly supervised action labeling in videos under ordering constraints, Proc Eur Conf Comput Vis, 628
bourlard, 1996, A new ASR approach based on independent processing and recombination of partial frequency bands, Proc Int Conf Spoken Language, 426
naim, 2014, Unsupervised alignment of natural language instructions with video segments, Proc 26th AAAI Conf Artif Intell, 1558
nefian, 2002, A coupled HMM for audio-visual speech recognition, InterSpeech, 2, ii-2013
mikolov, 2013, Distributed representations of words and phrases and their compositionality, Proc 28th Int Conf Neural Inf Process Syst, 3111
mitchell, 2012, Midge: Generating image descriptions from computer vision detections, Proc 14th Conf Eur Chapter Assoc Comput Linguistics, 747
moon, 2015, Multimodal transfer deep learning for audio-visual recognition, Proc 28th Int Conf Neural Inf Process Syst Workshops
morvant, 2014, Majority vote of diverse classifiers for latefusion, Lecture Notes in Computer Science Description, 10.1007/978-3-662-44415-3_16
farhadi, 2010, Every picture tells a story: Generating sentences from images, Lecture Notes in Computer Science, 10.1007/978-3-642-15561-1_2
fan, 2014, TTS synthesis with bidirectional LSTM based Recurrent Neural Networks, Proc Annu Conf Int Speech Commun Assoc, 1964
elliott, 2013, Image description using visual dependency representations, Proc Conf Empirical Methods Natural Language Process, 1292
chen, 2015, Microsoft COCO captions: Data collection and evaluation server
papineni, 2002, BLEU: A method for automatic evaluation of machine translation, Proc Annual Meeting of the Assoc Computational Linguistics, 311
palatucci, 2009, Zero-shot learning with semantic output codes, Proc 28th Int Conf Neural Inf Process Syst, 1410
ordonez, 2011, Im2text: Describing images using 1 million captioned photographs, Proc 28th Int Conf Neural Inf Process Syst, 1143
oord, 2016, WaveNet: A generative model for raw audio
anguera, 2014, Audio-to-text alignment for speech recognition with very limited resources, Proc Annu Conf Int Speech Commun Assoc, 1405
andrew, 2013, Deep canonical correlation analysis, Proc 29th Int Conf Mach Learn
deena, 2009, Speech-driven facial animation using a shared gaussian process latent variable model, Proc Int Symp Adv Vis Comput, 89, 10.1007/978-3-642-10331-5_9
ngiam, 2011, Multimodal deep learning, Proc 29th Int Conf Mach Learn, 689
cour, 2008, Movie/script: Alignment and parsing of video and text transcription, Proc Eur Conf Comput Vis, 158
chorowski, 2015, Attention-based models for speech recognition, Proc 28th Int Conf Neural Inf Process Syst, 577
collobert, 2016, Wav2letter: An end-to-end convnet-based speech recognition system
christoudias, 2008, Multi-view learning in the presence of view disagreement, Proc 24th Conf Uncertainty Artif Intell, 88
gönen, 2011, Multiple kernel learning algorithms, J Mach Learn Res, 12, 2211
glorot, 2010, Understanding the difficulty of training deep feedforward neural networks, Proc Conf Artif Intell Statist, 249
glodek, 2011, Multiple classifier systems for the classification of audio-visual emotional states, Proc Int Conf Affect Emotion Human-Comput Interaction, 359
gupta, 2012, Choosing linguistics over vision to describe images, Proc 26th AAAI Conf Artif Intell, 606
goodfellow, 2014, Generative adversarial nets, Proc 28th Int Conf Neural Inf Process Syst, 2672
feng, 2010, Visual information in semantic representation, Proc Conf North Amer Chapter Assoc Comput Linguistics Human Language Technol, 91
fidler, 2013, A sentence is worth a thousand pixels holistic CRF model, Proc IEEE Conf Comput Vis Pattern Recognit, 1995
frome, 2013, DeViSE: A deep visual-semantic embedding model, Proc 28th Int Conf Neural Inf Process Syst, 2121
gao, 2015, Are you talking to a machine? dataset and methods for multilingual image question answering, Proc 28th Int Conf Neural Inf Process Syst, 2296
socher, 2013, Zero-shot learning through cross-modal transfer, Proc 28th Int Conf Neural Inf Process Syst, 935
simonyan, 2015, Very deep convolutional networks for large-scale image recognition, Proc Int Conf Learn Representations
sjölander, 2003, An HMM-based system for automatic segmentation and alignment of speech, Proc Fonetik, 93
slaney, 2000, FaceSync: A linear operator for measuring synchronization of video facial images and audio tracks, Proc 28th Int Conf Neural Inf Process Syst
jaques, 2015, Multi-task, multi-kernel learning for estimating individual wellbeing, Proc Multimodal Mach Learn Workshop Conjunction NIPS, 1
silberer, 2012, Grounded models of semantic representation, Proc Conf Empirical Methods Natural Language Process, 1423
ioffe, 2015, Batch normalization: Accelerating deep network training by reducing internal covariate shift, Proc Int Conf Mach Learning, 448
simonyan, 2014, Two-stream convolutional networks for action recognition in videos, Proc 28th Int Conf Neural Inf Process Syst, 568
hinton, 1993, Autoencoders, minimum description length and Helmoltz free energy, Proc 28th Int Conf Neural Inf Process Syst, 3
kalchbrenner, 2013, Recurrent continuous translation models, Proc Conf Empirical Methods Natural Language Process, 1700
sutton, 2006, Introduction to conditional random fields for relationallearning, Introduction to Statistical Relational Learning
song, 2016, Unsupervised alignment of actions in video with text descriptions, Proc 7th Int Joint Conf Artif Intell, 2025
srivastava, 2014, Dropout : A simple way to prevent neural networks from overfitting, J Mach Learn Res, 15, 1929
song, 2012, Multi-view latent variable discriminative models for action recognition, Proc IEEE Conf Comput Vis Pattern Recognit, 2120
sutskever, 2014, Sequence to sequence learning with neural networks, Proc 28th Int Conf Neural Inf Process Syst, 3104
srivastava, 2012, Learning representations for multimodal data with deep belief nets, Proc 29th Int Conf Mach Learn
srivastava, 2012, Multimodal learning with deep Boltzmann machines, Proc 28th Int Conf Neural Inf Process Syst, 2949
taylor, 2012, Dynamic units of visual speech, Proc 28th Annu Conf Comput Graph Interactive Techn, 275
thomason, 2014, Integrating language and vision to generate natural language descriptions of videos in the wild, Proc 23rd Int Conf Comput Linguistics Posters, 1218
torabi, 2015, Using descriptive video services to create a large data source for video annotation research
van den oord, 2016, Pixel recurrent neural networks, Proc 29th Int Conf Mach Learn, 1747
vendrov, 2016, Order-embeddings of images and language, Proc Int Conf Learn Representations
weston, 2010, Web scale image annotation: Learning to rank with joint word-image embeddings image annotation, Proc Eur Conf Mach Learn, 21, 10.1007/s10994-010-5198-3
wang, 2015, On deep multi-view representation learning, Proc 29th Int Conf Mach Learn, 1083
wang, 2014, Hashing for similarity search: A survey
wang, 2015, Deep multimodal hashing with orthogonal regularization, Proc 7th Int Joint Conf Artif Intell, 2291
li, 2011, Composing simple image descriptions using web-scale n-grams, Proc 15th Conf Computational Natural Language Learning, 220
lebret, 2015, Phrase-based image captioning, Proc 29th Int Conf Mach Learn, 2085
lienhart, 1998, Comparison of automatic shot boundary detection algorithms, Proc SPIE, 290, 10.1117/12.333848
lu, 2016, Hierarchical co-attention for visual question answering, Proc 28th Int Conf Neural Inf Process Syst, 1
wöllmer, 2010, Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling, Proc Annu Conf Int Speech Commun Assoc, 2362
weston, 2011, WSABIE: Scaling up to large vocabulary image annotation, Proc 7th Int Joint Conf Artif Intell, 2764
xu, 2015, Jointly modeling deep video and compositional text to bridge vision and language in a unified framework, Proc 26th AAAI Conf Artif Intell, 2346
xu, 2015, Show, attend and tell: Neural image caption generation with visual attention, Proc 29th Int Conf Mach Learn, 2048
wu, 2005, Multi-level fusion of audio and visual features for speaker identification, Proc Int Conf Adv Biometrics, 493, 10.1007/11608288_66
xu, 2016, Ask, attend and answer: Exploring question-guided spatial attention for visual question answering, Proc Eur Conf Comput Vis, 451
xiong, 2016, Dynamic memory networks for visual and textual question answering, Proc 29th Int Conf Mach Learn, 2397
malmaud, 2015, What’s cookin’? interpreting cooking videos using text, speech and vision, Proc Conf North Amer Chapter Assoc Comput Linguistics Human Language Technol
mansimov, 2016, Generating images from captions with attention, Proc Int Conf Learn Representations
mao, 2015, Deep captioning with multimodal recurrent neural networks (m-RNN), Proc Int Conf Learn Representations
mcfee, 2011, Learning multi-modal similarity, J Mach Learn Res, 12, 491
2016
2017
mei, 2016, Listen, attend, and walk: Neural mapping of navigational instructions to action sequences, Proc 26th AAAI Conf Artif Intell, 2772
yang, 2011, Corpus-guided sentence generation of natural images, Proc Conf Empirical Methods Natural Language Process, 444
yu, 2013, Grounded language learning from video described with sentences, Proc Annual Meeting of the Assoc Computational Linguistics, 53
yu, 2004, On the integration of grounding language and learning objects, Proc 26th AAAI Conf Artif Intell, 488
kingma, 2015, Adam: A method for stochastic optimization, Proc Int Conf Learn Representations
khapra, 2010, Everybody loves a rich cousin: An empirical study of transliteration through bridge languages, Proc Conf North Amer Chapter Assoc Comput Linguistics Human Language Technol, 420
karpathy, 2014, Deep fragment embeddings for bidirectional image sentence mapping, Proc 28th Int Conf Neural Inf Process Syst, 1889
klein, 2015, Fisher vectors derived from hybrid gaussian-laplacian mixture models for image annotation, Proc IEEE Conf Comput Vis Pattern Recognit, 4437
kiros, 2015, Unifying visual-semantic embeddings with multimodal neural language models, Transactions of the Association for Computational Linguistics, 1
zeng, 2017, Leveraging video descriptions to learn video question answering, Proc 26th AAAI Conf Artif Intell, 4334
zhang, 2014, Large-scale supervised multimodal hashing with semantic correlation maximization, Proc 26th AAAI Conf Artif Intell, 2177
zhou, 2009, Canonical time warping for alignment of human behavior, Proc 28th Int Conf Neural Inf Process Syst, 2286
zhou, 2012, Generalized time warping for multi-modal alignment of human motion, Proc IEEE Conf Comput Vis Pattern Recognit, 1282
aytar, 2016, SoundNet: Learning sound representations from unlabeled video, Proc 28th Int Conf Neural Inf Process Syst
bahdanau, 2014, Neural machine translation by jointly learning to align and translate, Proc Int Conf Learn Representations
baltrušaitis, 2013, Dimensional Affect recognition using continuous conditional random fields, Proc 10th IEEE Int Conf Workshops Autom Face Gesture Recognit, 1
barbu, 2012, Video in sentences out, Proc Conf Uncertainty Artif Intell, 102
barnard, 2003, Matching words and pictures, J Mach Learn Res, 3, 1107
kumar, 2011, Learning hash functions for cross-view similarity search, Proc 7th Int Joint Conf Artif Intell, 1360
kuznetsova, 2012, Collective generation of natural image descriptions, Proc Annual Meeting of the Assoc Computational Linguistics, 359
krizhevsky, 2012, ImageNet classification with deep convolutional neural networks, Proc 28th Int Conf Neural Inf Process Syst, 1097
lafferty, 2001, Conditional random fields : Probabilistic models for segmenting and labeling sequence data, Proc 29th Int Conf Mach Learn, 282