Learning hierarchical video representation for action recognition
Tóm tắt
Từ khóa
Tài liệu tham khảo
Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. In: European conference on computer vision
Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: 2005 IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, IEEE. pp 65–72
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2014) Long-term recurrent convolutional networks for visual recognition and description. arXiv preprint: arXiv:1411.4389
Graves A, Mohamed A-r, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing, IEEE. pp 6645–6649
Hoai M, Zisserman A (2014) Improving human action recognition using score distribution and ranking. In: Asian conference on computer vision
Jhuo I-H, Ye G, Gao S, Liu D, Jiang Y-G, Lee D, Chang S-F (2014) Discovering joint audio-visual codewords for video event detection. Mach Vis Appl 25(1):33–47
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. arXiv preprint: arXiv:1408.5093
Jiang Y-G, Ye G, Chang S-F, Ellis D, Loui AC (2011) Consumer video understanding: a benchmark database and an evaluation of human and machine performance. In: Proceedings of ACM international conference on multimedia retrieval
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition
Klaser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. In: BMVC 2008-19th British machine vision conference, British Machine Vision Association, pp 275:1–10
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: Proceedings of the IEEE international conference on computer vision
Lan Z, Lin M, Li X, Hauptmann AG, Raj B (2015) Beyond gaussian pyramid: multi-skip feature stacking for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Laptev I, Lindeberg T (2003) Space-time interest points. In: Proceedings of the IEEE international conference on computer vision
Li Q, Qiu Z, Yao T, Mei T, Rui Y, Luo J (2016) Action recognition by learning deep multi-granular spatio-temporal video representation. In: Proceedings of ACM international conference on multimedia retrieval
Li W, Zhang Z, Liu Z (2008) Expandable data-driven graphical modeling of human actions based on salient postures. IEEE Trans Circuits Syst Video Technol 18(11):1499–1510
Ma AJ, Yuen PC (2014) Reduced analytic dependency modeling: Robust fusion for visual recognition. Int J Comput Vis 109(3):233–251
Ma Y-F, Hua X-S, Lu L, Zhang H-J (2005) A generic framework of user attention model and its application in video summarization. IEEE Trans MM 7(5):907–919
Ng JY-H, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Ngo C-W, Ma Y-F, Zhang H-J (2005) Video summarization and scene detection by graph modeling. IEEE Trans CSVT 15(2):296–305
Pan Y, Li Y, Yao T, Mei T, Li H, Rui Y (2016) Learning deep intrinsic video representation by exploring temporal coherence and graph structure. In: International joint conference on artificial intelligence
Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Peng X, Wang L, Wang X, Qiao Y (2014) Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. arXiv preprint: arXiv:1405.4506
Qiu Z, Li Q, Yao T, Mei T, Rui Y (2015) Msr asia msm at thumos challenge 2015. In: CVPR THUMOS challenge workshop
Qiu Z, Yao T, Mei T (2016) Deep quantization: encoding convolutional activations with deep generative model. arXiv preprint: arXiv:1611.09502
Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In: ACM international conference on multimedia, ACM, pp 357–360
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations
Snoek CGM, van de Sande KEA, de Rooij O et al (2008) The mediamill trecvid 2008 semantic video search engine. In: NIST TRECVID workshop
Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human action classes from videos in the wild. CRCV-TR-12-01
Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using LSTMs. In: Proceedings of international conference on machine learning
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Tran D, Bourdev LD, Fergus R, Torresani L, Paluri M (2014) Learning spatiotemporal features with 3d convolutional networks. arXiv preprint: arXiv:1412.0767
Wang H, Schmid C (2013) Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision, pp 3551–3558
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Wei X-Y, Jiang Y-G, Ngo C-W (2011) Concept-driven multi-modality fusion for video search. IEEE Trans CSVT 21(1):62–73
Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78(10):1550–1560
Wilkins P, Ferguson P, Smeaton AF (2006) Using score distributions for query-time fusion in multimediaretrieval. In: ACM SIGMM international workshop on Multimedia information retrieval
Willems G, Tuytelaars T, Van Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: European conference on computer vision, pp 650–663. Springer,
Wu Z, Jiang Y-G, Wang J, Pu J, Xue X (2014) Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In: ACM international conference on multimedia, pp 167–176. ACM
Yao T, Mei T, Ngo C-W, Li S (2013) Annotation for free: video tagging by mining user search behavior. In: ACM international conference on multimedia
Yao T, Mei T, Rui Y (2016) Highlight detection with pairwise deep ranking for first-person video summarization. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Yao T, Ngo C-W, Mei T (2013) Circular reranking for visual search. IEEE Trans Image Process 22(4):1644–1655
Ye G, Liu D, Jhuo I-H, Chang S-F (2012) Robust late fusion with rank minimization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3021–3028. IEEE
Yuan X, Lai W, Mei T, Hua X-S, Wu X-Q, Li S (2006) Automatic video genre categorization using hierarchical svm. In: 2006 International conference on image processing, pp 2905–2908. IEEE
Zaremba W, Sutskever I (2014) Learning to execute. arXiv preprint: arXiv:1410.4615