Joint embeddings with multimodal cues for video-text retrieval

Niluthpol Chowdhury Mithun1, Juncheng Li2, Florian Metze2, Amit K. Roy–Chowdhury3
1University of California, Riverside, USA
2Carnegie Mellon University, Pittsburgh, USA
3University of California Riverside, USA

Tóm tắt

Từ khóa


Tài liệu tham khảo

Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: International conference on machine learning, pp 1247–1255

Awad G, Butt A, Fiscus J, Joy D, Delgado A, Michel M, Smeaton AF, Graham Y, Kraaij W, Quénot G et al (2017) Trecvid 2017: evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking. In: Proceedings of TRECVID

Aytar Y, Vondrick C, Torralba A (2016) Soundnet: learning sound representations from unlabeled video. In: Advances in neural information processing systems, pp 892–900

Aytar Y, Vondrick C, Torralba A (2017) See, hear, and read: deep aligned representations. arXiv preprint arXiv:1706.00932

Bois R, Vukotić V, Simon AR, Sicre R, Raymond C, Sébillot P, Gravier G (2017) Exploiting multimodality in video hyperlinking to improve target diversity. In: International conference on multimedia modeling, Springer, pp 185–197

Budnik M, Demirdelen M, Gravier G (2018) A study on multimodal video hyperlinking with visual aggregation. In: 2018 IEEE international conference on multimedia and expo, IEEE, pp 1–6

Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 4724–4733

Cha M, Gwon Y, Kung H (2015) Multimodal sparse representation learning and applications. arXiv preprint arXiv:1511.06238

Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Annual meeting of the association for computational linguistics: human language technologies, vol 1, ACL, pp 190–200

Chi J, Peng Y (2018) Dual adversarial networks for zero-shot cross-media retrieval. In: International joint conferences on artificial intelligence, pp 663–669

Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555

Dong J, Li X, Snoek CG (2016) Word2visualvec: Image and video to sentence matching by visual feature prediction. arXiv preprint arXiv:1604.06838

Faghri F, Fleet DJ, Kiros JR, Fidler S (2018) Vse++: improved visual-semantic embeddings. In: British machine vision conference (BMVC)

Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision, Springer, pp 15–29

Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In: ACM multimedia conference, ACM, pp 7–16

Fraz MM, Remagnino P, Hoppe A, Uyyanonvara B, Rudnicka AR, Owen CG, Barman SA (2012) An ensemble classification-based approach applied to retinal blood vessel segmentation. IEEE Trans Biomed Eng 59(9):2538–2548

Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Mikolov T et al (2013) Devise: a deep visual-semantic embedding model. In: Advances in neural information processing systems, pp 2121–2129

Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233

Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 770–778

Henning CA, Ewerth R (2017) Estimating the information gap between textual and visual representations. In: International conference on multimedia retrieval, ACM, pp 14–22

Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899

Huang G, Liu Z, van der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 2261–2269

Huang Y, Wang W, Wang L (2017) Instance-aware image and sentence matching with selective multimodal LSTM. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 2310–2318

Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 3128–3137

Karpathy A, Joulin A, Li FFF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems, pp 1889–1897

Kingma D, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539

Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: Advances in neural information processing systems, pp 3294–3302

Klein B, Lev G, Sadeh G, Wolf L (2015) Associating neural word embeddings with deep image representations using fisher vectors. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 4437–4446

Lee JH (1997) Analyses of multiple evidence combination. In: ACM SIGIR forum, vol 31, ACM, pp 267–276

Ma Z, Lu Y, Foster D (2015) Finding linear structure in large datasets with scalable canonical correlation analysis. In: International conference on machine learning, pp 169–178

Manmatha R, Wu CY, Smola AJ, Krahenbuhl P (2017) Sampling matters in deep embedding learning. In: IEEE international conference on computer vision, IEEE, pp 2859–2867

Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

Mithun NC, Li J, Metze F, Roy-Chowdhury AK (2018) Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: ACM international conference on multimedia retrieval

Mithun NC, Munir S, Guo K, Shelton C (2018) Odds: real-time object detection using depth sensors on embedded gpus. In: ACM/IEEE international conference on information processing in sensor networks, IEEE Press, pp 230–241

Mithun NC, Panda R, Roy-Chowdhury AK (2016) Generating diverse image datasets with limited labeling. In: ACM multimedia conference, ACM, pp 566–570

Mithun NC, Rameswar P, Papalexakis E, Roy-Chowdhury A (2018) Webly supervised joint embedding for cross-modal image-text retrieval. In: ACM international conference on multimedia

Nam H, Ha JW, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 299–307

Otani M, Nakashima Y, Rahtu E, Heikkilä J, Yokoya N (2016) Learning joint representations of videos and sentences with web image search. In: European conference on computer vision, Springer, pp 651–667

Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 4594–4602

Polikar R (2006) Ensemble based systems in decision making. IEEE Circuits Syst Mag 6(3):21–45

Polikar R (2007) Bootstrap inspired techniques in computational intelligence: ensemble of classifiers, incremental learning, data fusion and missing features. IEEE Signal Process Mag 24(4):59–72

TQN Tran, Le Borgne H, Crucianu M (2016) Aggregating image and text quantized correlated components. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2046–2054

Ramanishka V, Das A, Park DH, Venugopalan S, Hendricks LA, Rohrbach M, Saenko K (2016) Multimodal video description. In: ACM multimedia conference, ACM, pp 1092–1096

Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 779–788

Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99

Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 815–823

Socher R, Fei-Fei L (2010) Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 966–973

Torabi A, Tandon N, Sigal L (2016) Learning language-visual embedding for movie understanding with natural-language. arXiv preprint arXiv:1609.08124

Usunier N, Buffoni D, Gallinari P (2009) Ranking with ordered weighted pairwise classification. In: International conference on machine learning, ACM, pp 1057–1064

Vendrov I, Kiros R, Fidler S, Urtasun R (2016) Order-embeddings of images and language. In: International conference on learning representations

Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: IEEE International conference on computer vision, IEEE, pp 4534–4542

Vukotić V, Raymond C, Gravier G (2016) Bidirectional joint representation learning with symmetrical deep neural networks for multimodal and crossmodal applications. In: ACM international conference on multimedia retrieval, ACM, pp 343–346

Vukotić V, Raymond C, Gravier G (2017) Generative adversarial networks for multimodal representation learning in video hyperlinking. In: ACM international conference on multimedia retrieval, ACM, pp 416–419

Vukotić V, Raymond C, Gravier G (2018) A crossmodal approach to multimodal fusion in video hyperlinking. IEEE Multimed 25(2):11–23

Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: ACM multimedia conference, ACM, pp 154–162

Wang L, Li Y, Huang J, Lazebnik S (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell 41(2):394–407

Wang L, Li Y, Lazebnik S (2016) Learning deep structure-preserving image-text embeddings. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 5005–5013

Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: A large video description dataset for bridging video and language. In: IEEE conference on computer vision and pattern recognition, pp 5288–5296

Xu R, Xiong C, Chen W, Corso JJ (2015) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: AAAI, vol 5, p 6

Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 3441–3450

Yan R, Yang J, Hauptmann AG (2004) Learning query-class dependent weights in automatic video retrieval. In: ACM multimedia conference, ACM, pp 548–555

Zhang L, Ma B, Li G, Huang Q, Tian Q (2017) Multi-networks joint learning for large-scale cross-modal retrieval. In: ACM multimedia conference, ACM, pp 907–915

Zhang X, Gao K, Zhang Y, Zhang D, Li J, Tian Q (2017) Task-driven dynamic fusion: reducing ambiguity in video description. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 3713–3721

Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2017) Places: a 10 million image database for scene recognition. IEEE Trans Pattern Anal Mach Intell 40:1452–1464