Learning hierarchical video representation for action recognition

International Journal of Multimedia Information Retrieval - Tập 6 Số 1 - Trang 85-98 - 2017
Qing Li1, Zhaofan Qiu1, Ting Yao2, Tao Mei2, Yong Rui2, Jiebo Luo3
1University of Science & Technology of China, Hefei, People’s Republic of China
2Microsoft Research, Beijing, People's Republic of China
3University of Rochester, New York, USA

Tóm tắt

Từ khóa


Tài liệu tham khảo

Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. In: European conference on computer vision

Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: 2005 IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, IEEE. pp 65–72

Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2014) Long-term recurrent convolutional networks for visual recognition and description. arXiv preprint: arXiv:1411.4389

Graves A, Mohamed A-r, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing, IEEE. pp 6645–6649

Hoai M, Zisserman A (2014) Improving human action recognition using score distribution and ranking. In: Asian conference on computer vision

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

Jhuo I-H, Ye G, Gao S, Liu D, Jiang Y-G, Lee D, Chang S-F (2014) Discovering joint audio-visual codewords for video event detection. Mach Vis Appl 25(1):33–47

Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. arXiv preprint: arXiv:1408.5093

Jiang Y-G, Ye G, Chang S-F, Ellis D, Loui AC (2011) Consumer video understanding: a benchmark database and an evaluation of human and machine performance. In: Proceedings of ACM international conference on multimedia retrieval

Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition

Klaser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. In: BMVC 2008-19th British machine vision conference, British Machine Vision Association, pp 275:1–10

Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems

Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: Proceedings of the IEEE international conference on computer vision

Lan Z, Lin M, Li X, Hauptmann AG, Raj B (2015) Beyond gaussian pyramid: multi-skip feature stacking for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition

Laptev I, Lindeberg T (2003) Space-time interest points. In: Proceedings of the IEEE international conference on computer vision

Li Q, Qiu Z, Yao T, Mei T, Rui Y, Luo J (2016) Action recognition by learning deep multi-granular spatio-temporal video representation. In: Proceedings of ACM international conference on multimedia retrieval

Li W, Zhang Z, Liu Z (2008) Expandable data-driven graphical modeling of human actions based on salient postures. IEEE Trans Circuits Syst Video Technol 18(11):1499–1510

Ma AJ, Yuen PC (2014) Reduced analytic dependency modeling: Robust fusion for visual recognition. Int J Comput Vis 109(3):233–251

Ma Y-F, Hua X-S, Lu L, Zhang H-J (2005) A generic framework of user attention model and its application in video summarization. IEEE Trans MM 7(5):907–919

Ng JY-H, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition

Ngo C-W, Ma Y-F, Zhang H-J (2005) Video summarization and scene detection by graph modeling. IEEE Trans CSVT 15(2):296–305

Pan Y, Li Y, Yao T, Mei T, Li H, Rui Y (2016) Learning deep intrinsic video representation by exploring temporal coherence and graph structure. In: International joint conference on artificial intelligence

Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition

Peng X, Wang L, Wang X, Qiao Y (2014) Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. arXiv preprint: arXiv:1405.4506

Qiu Z, Li Q, Yao T, Mei T, Rui Y (2015) Msr asia msm at thumos challenge 2015. In: CVPR THUMOS challenge workshop

Qiu Z, Yao T, Mei T (2016) Deep quantization: encoding convolutional activations with deep generative model. arXiv preprint: arXiv:1611.09502

Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In: ACM international conference on multimedia, ACM, pp 357–360

Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations

Snoek CGM, van de Sande KEA, de Rooij O et al (2008) The mediamill trecvid 2008 semantic video search engine. In: NIST TRECVID workshop

Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human action classes from videos in the wild. CRCV-TR-12-01

Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using LSTMs. In: Proceedings of international conference on machine learning

Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9

Tran D, Bourdev LD, Fergus R, Torresani L, Paluri M (2014) Learning spatiotemporal features with 3d convolutional networks. arXiv preprint: arXiv:1412.0767

Wang H, Schmid C (2013) Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision, pp 3551–3558

Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition

Wei X-Y, Jiang Y-G, Ngo C-W (2011) Concept-driven multi-modality fusion for video search. IEEE Trans CSVT 21(1):62–73

Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78(10):1550–1560

Wilkins P, Ferguson P, Smeaton AF (2006) Using score distributions for query-time fusion in multimediaretrieval. In: ACM SIGMM international workshop on Multimedia information retrieval

Willems G, Tuytelaars T, Van Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: European conference on computer vision, pp 650–663. Springer,

Wu Z, Jiang Y-G, Wang J, Pu J, Xue X (2014) Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In: ACM international conference on multimedia, pp 167–176. ACM

Yao T, Mei T, Ngo C-W, Li S (2013) Annotation for free: video tagging by mining user search behavior. In: ACM international conference on multimedia

Yao T, Mei T, Rui Y (2016) Highlight detection with pairwise deep ranking for first-person video summarization. In: Proceedings of the IEEE conference on computer vision and pattern recognition

Yao T, Ngo C-W, Mei T (2013) Circular reranking for visual search. IEEE Trans Image Process 22(4):1644–1655

Ye G, Liu D, Jhuo I-H, Chang S-F (2012) Robust late fusion with rank minimization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3021–3028. IEEE

Yuan X, Lai W, Mei T, Hua X-S, Wu X-Q, Li S (2006) Automatic video genre categorization using hierarchical svm. In: 2006 International conference on image processing, pp 2905–2908. IEEE

Zaremba W, Sutskever I (2014) Learning to execute. arXiv preprint: arXiv:1410.4615

Zha S, Luisier F, Andrews W, Srivastava N, Salakhutdinov R. (2015) Exploiting image-trained CNN architectures for unconstrained video classification. arXiv preprint: arXiv:1503.04144