Recent advances in convolutional neural networks

Pattern Recognition - Tập 77 - Trang 354-377 - 2018
Jiuxiang Gu1, Zhenhua Wang2, Jason Kuen2, Lianyang Ma2, Amir Shahroudy2, Bing Shuai2, Ting Liu2, Xingxing Wang2, Gang Wang2, Jianfei Cai3, Tsuhan Chen3
1ROSE Lab, Interdisciplinary Graduate School, Nanyang Technological University, Singapore
2School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore
3School of Computer Science and Engineering, Nanyang Technological University, Singapore

Tài liệu tham khảo

Hubel, 1968, Receptive fields and functional architecture of monkey striate cortex, J. Physiol., 215, 10.1113/jphysiol.1968.sp008455 Fukushima, 1982, Neocognitron: a self-organizing neural network model for a mechanism of visual pattern recognition, 267 Le Cun, 1989, Handwritten digit recognition with a back-propagation network, 396 LeCun, 1998, Gradient-based learning applied to document recognition, Proc. IEEE, 86, 2278, 10.1109/5.726791 Hecht-Nielsen, 1988, Theory of the backpropagation neural network, Neural Networks, 1, 445, 10.1016/0893-6080(88)90469-8 Zhang, 1990, Parallel distributed processing model with local space-invariant interconnections and its optical architecture, Appl. Opt., 29, 4790, 10.1364/AO.29.004790 Niu, 2012, A novel hybrid CNN–SVM classifier for recognizing handwritten digits, Pattern Recognit., 45, 1318, 10.1016/j.patcog.2011.09.021 Russakovsky, 2015, Imagenet large scale visual recognition challenge, Int. J. Conflict Violence (IJCV), 115, 211 Simonyan, 2015, Very deep convolutional networks for large-scale image recognition Szegedy, 2015, Going deeper with convolutions, 1 Zeiler, 2014, Visualizing and understanding convolutional networks, 818 He, 2016, Deep residual learning for image recognition, 770 LeCun, 2012, Efficient backprop, 9 Nair, 2010, Rectified linear units improve restricted Boltzmann machines, 807 Wang, 2012, End-to-end text recognition with convolutional neural networks, 3304 Boureau, 2010, A theoretical analysis of feature pooling in visual recognition, 111 Hinton, 2012, Improving neural networks by preventing co-adaptation of feature detectors, CoRR abs/1207.0580 Lin, 2014, Network in network Tang, 2013, Deep learning using linear support vector machines Madjarov, 2012, An extensive experimental comparison of methods for multi-label learning, Pattern Recognit., 45, 3084, 10.1016/j.patcog.2012.03.004 Wijnhoven, 2010, Fast training of object detection using stochastic gradient descent, 424 Zinkevich, 2010, Parallelized stochastic gradient descent, 2595 Ngiam, 2010, Tiled convolutional neural networks, 1279 Wang, 2015, Encoding time series as images for visual inspection and classification using tiled convolutional neural networks Zheng, 2014, Time series classification using multi-channels deep convolutional neural networks, 298 Zeiler, 2010, Deconvolutional networks, 2528 Zeiler, 2011, Adaptive deconvolutional networks for mid and high level feature learning, 2018 Long, 2017, Fully convolutional networks for semantic segmentation, IEEE Trans. Pattern Anal.Mach.Intell. (PAMI), 39, 640, 10.1109/TPAMI.2016.2572683 Visin, 2015, Reseg: a recurrent neural network for object segmentation Noh, 2015, Learning deconvolution network for semantic segmentation, 1520 Cao, 2015, Look and think twice: capturing top-down visual attention with feedback convolutional neural networks, 2956 Zhang, 2016, Top-down neural attention by excitation backprop, 543 Zhang, 2016, Augmenting supervised neural networks with unsupervised objectives for large-scale image classification, 612 Zhou, 2016, Learning deep features for discriminative localization, 2921 Das, 2016, Human attention in visual question answering: Do humans and deep networks look at the same regions?, 932 Dong, 2016, Image super-resolution using deep convolutional networks, IEEE Trans. Pattern Anal. Mach. Intell. (PAMI), 38, 295, 10.1109/TPAMI.2015.2439281 Yu, 2016, Multi-scale context aggregation by dilated convolutions Kalchbrenner, 2016, Neural machine translation in linear time, CoRR abs/1610.10099 Oord, 2016, Wavenet: a generative model for raw audio, CoRR abs/1609.03499 Sercu, 2016, Dense prediction on sequences with time-dilated convolutions for speech recognition Szegedy, 2016, Rethinking the inception architecture for computer vision, 2818 Szegedy, 2017, Inception-v4, inception-resnet and the impact of residual connections on learning, 4278 Hyvärinen, 2007, Complex cell pooling and the statistics of natural images, Network, 18, 81, 10.1080/09548980701418942 Estrach, 2014, Signal recovery from pooling representations, 307 Wan, 2013, Regularization of neural networks using dropconnect, 1058 Yu, 2014, Mixed pooling for convolutional neural networks, 364 Zeiler, 2013, Stochastic pooling for regularization of deep convolutional neural networks Rippel, 2015, Spectral representations for convolutional neural networks, 2449 Mathieu, 2014, Fast training of convolutional networks through FFTs He, 2015, Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE Trans. Pattern Anal. Mach.Intell. (PAMI), 37, 1904, 10.1109/TPAMI.2015.2389824 Singh, 2012, Unsupervised discovery of mid-level discriminative patches, 73 Gong, 2014, Multi-scale orderless pooling of deep convolutional activation features, 392 Jégou, 2012, Aggregating local image descriptors into compact codes, IEEE Trans. Pattern Anal.Mach.Intell. (PAMI), 34, 1704, 10.1109/TPAMI.2011.235 Maas, 2013, Rectifier nonlinearities improve neural network acoustic models, 30 Zeiler, 2013, On rectified linear units for speech processing, 3517 He, 2015, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, 1026 Xu, 2015, Empirical evaluation of rectified activations in convolutional network Clevert, 2016, Fast and accurate deep network learning by exponential linear units (elus) Goodfellow, 2013, Maxout networks, 1319 Springenberg, 2013, Improving deep neural networks with probabilistic maxout units, CoRR abs/1312.6116 Zhang, 2004, Solving large scale linear prediction problems using stochastic gradient descent algorithms Deng, 2012, The mnist database of handwritten digit images for machine learning research, IEEE Signal Process. Mag., 29, 141, 10.1109/MSP.2012.2211477 Liu, 2016, Large-margin softmax loss for convolutional neural networks, 507 Bromley, 1993, Signature verification using a siamese time delay neural network, Int. J. Pattern Recognit. Artif. Intell. (IJPRAI), 7, 669, 10.1142/S0218001493000339 Chopra, 2005, Learning a similarity metric discriminatively, with application to face verification, 539 Hadsell, 2006, Dimensionality reduction by learning an invariant mapping, 1735 Shaham, 2017, Learning by coincidence: siamese networks and common variable learning, Pattern Recognit. Lin, 2017, Deephash: getting regularization, depth and fine-tuning right, 133 Schroff, 2015, Facenet: a unified embedding for face recognition and clustering, 815 Liu, 2016, Deep relative distance learning: tell the difference between similar vehicles, 2167 Ding, 2015, Deep feature learning with relative distance comparison for person re-identification, Pattern Recognit., 48, 2993, 10.1016/j.patcog.2015.04.005 Liu, 2016, Deepfashion: powering robust clothes recognition and retrieval with rich annotations, 1096 Kingma, 2014, Auto-encoding variational bayes Im, 2017, Denoising criterion for variational auto-encoding framework, 2059 Kingma, 2014, Semi-supervised learning with deep generative models, 3581 Goodfellow, 2014, Generative adversarial nets, 2672 Mirza, 2014, Conditional generative adversarial nets, CoRR abs/1411.1784 Vincent, 2008, Extracting and composing robust features with denoising autoencoders, 1096 Ng, 2016, Dual autoencoders features for imbalance classification problem, Pattern Recognit., 60, 875, 10.1016/j.patcog.2016.06.013 Mehta, 2017, Rodeo: robust de-aliasing autoencoder for real-time medical image reconstruction, Pattern Recognit., 63, 499, 10.1016/j.patcog.2016.09.022 Olshausen, 1996, Emergence of simple-cell receptive field properties by learning a sparse code for natural images, Nature, 381, 607, 10.1038/381607a0 Lee, 2006, Efficient sparse coding algorithms, 801 Eslami, 2016, Attend, infer, repeat: fast scene understanding with generative models, 3225 Sohn, 2015, Learning structured output representation using deep conditional generative models, 3483 Reed, 2016, Generative adversarial text to image synthesis, 1060 Denton, 2015, Deep generative image models using a Laplacian pyramid of adversarial networks, 1486 Salimans, 2016, Improved techniques for training GANs, 2226 Dosovitskiy, 2016, Generating images with perceptual similarity metrics based on deep networks, 658 Tikhonov, 1943, On the stability of inverse problems, 39, 195 Wang, 2013, Fast dropout training, 118 Ba, 2013, Adaptive dropout for training deep neural networks, 3084 Tompson, 2015, Efficient object localization using convolutional networks, 648 Yang, 2015, Mirror, mirror on the wall, tell me, is the error small?, 4685 Xie, 2015, Holistically-nested edge detection, 1395 Salamon, 2017, Deep convolutional neural networks and data augmentation for environmental sound classification, Signal Process. Lett. (SPL), 24, 279, 10.1109/LSP.2017.2657381 Eigen, 2015, Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture, 2650 Paulin, 2014, Transformation pursuit for image classification, 3646 Hauberg, 2016, Dreaming more data: Class-dependent distributions over diffeomorphisms for learned data augmentation, 342 Xie, 2015, Hyper-class augmented and regularized deep learning for fine-grained image classification, 2645 Xu, 2015, Augmenting strong supervision using web data for fine-grained categorization, 2524 Choromanska, 2015, The loss surfaces of multilayer networks Mishkin, 2016, All you need is a good init Sutskever, 2013, On the importance of initialization and momentum in deep learning, 1139 Glorot, 2010, Understanding the difficulty of training deep feedforward neural networks, 249 Saxe, 2014, Exact solutions to the nonlinear dynamics of learning in deep linear neural networks Doersch, 2015, Unsupervised visual representation learning by context prediction, 1422 Agrawal, 2015, Learning to see by moving, 37 Qian, 1999, On the momentum term in gradient descent learning algorithms, Neural Netw., 12, 145, 10.1016/S0893-6080(98)00116-6 Kingma, 2015, Adam: A method for stochastic optimization Loshchilov, 2017, Sgdr: Stochastic gradient descent with warm restarts Schaul, 2013, No more pesky learning rates, 343 Zhang, 2015, Deep learning with elastic averaging SGD, 685 Recht, 2011, Hogwild: a lock-free approach to parallelizing stochastic gradient descent, 693 Dean, 2012, Large scale distributed deep networks, 1232 Paine, 2011, GPU asynchronous stochastic gradient descent to speed up neural network training, CoRR Zhuang, 2013, A fast parallel SGD for matrix factorization in shared memory systems, 249 Yao, 2007, On early stopping in gradient descent learning, Constructive Approx., 26, 289, 10.1007/s00365-006-0663-2 Prechelt, 2012, Early stopping - but when?, 53 Zhang, 2017, Understanding deep learning requires rethinking generalization Ioffe, 2015, Batch normalization: accelerating deep network training by reducing internal covariate shift, J. Mach. Learn. Res. (JMLR), 448 Hochreiter, 1997, Long short-term memory, Neural Comput., 9, 1735, 10.1162/neco.1997.9.8.1735 Srivastava, 2015, Training very deep networks, 2377 He, 2016, Identity mappings in deep residual networks, 630 Shen, 2016, Weighted residuals for very deep networks, 936 Zagoruyko, 2016, Wide residual networks, 87.1 Singh, 2016, Swapout: learning an ensemble of deep architectures, 28 Targ, 2016, Resnet in resnet: generalizing residual architectures, CoRR Zhang, 2016, Residual networks of residual networks: multilevel residual networks, IEEE Trans. Circuits Syst. Video Technol. (TCSVT), PP, 1 Huang, 2016, Densely connected convolutional networks, 4700 S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, E. Shelhamer, Cudnn: efficient primitives for deep learningabs/1410.0759 (2014). Vasilache, 2015, Fast convolutional nets with fbfft: aGPU performance evaluation P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, Y. LeCun, Overfeat: integrated recognition, localization and detection using convolutional networks (2014). Lavin, 2016, Fast algorithms for convolutional neural networks, 4013 Sainath, 2013, Low-rank matrix factorization for deep neural network training with high-dimensional output targets, 6655 Xue, 2013, Restructuring of deep neural network acoustic models with singular value decomposition, 2365 Denil, 2013, Predicting parameters in deep learning, 2148 Denton, 2014, Exploiting linear structure within convolutional networks for efficient evaluation, 1269 Jaderberg, 2014, Speeding up convolutional neural networks with low rank expansions Novikov, 2015, Tensorizing neural networks, 442 Oseledets, 2011, Tensor-train decomposition, SIAM J. Sci. Comput., 33, 2295, 10.1137/090752286 Le, 2013, Fastfood-approximating kernel expansions in loglinear time, 85 Dasgupta, 2011, Fast locality-sensitive hashing, 1073 Yu, 2014, Circulant binary embedding, 946 Cheng, 2015, An exploration of parameter redundancy in deep networks with circulant projections, 2857 Moczulski, 2016, Acdc: a structured efficient linear layer Han, 2016, Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding Kim, 2016, Bitwise neural networks Rastegari, 2016, Xnor-net: imagenet classification using binary convolutional neural networks, 525 Zhou, 2016, Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients, CoRR Courbariaux, 2016, Binarynet: training deep neural networks with weights and activations constrained to+ 1 or-1 Sullivan, 1996, Efficient scalar quantization of exponential and Laplacian random variables, IEEE Trans. Inf. Theory, 42, 1365, 10.1109/18.532878 Y. Gong, L. Liu, M. Yang, L. Bourdev, Compressing deep convolutional networks using vector quantization, in: arXiv preprint arXiv:1412.6115, volume abs/1412.6115, 2014. Chen, 2010, Approximate nearest neighbor search by residual vector quantization, Sensors, 10, 11259, 10.3390/s101211259 Zhou, 2012, Scalar quantization for large scale image search, 169 Pratt, 1988, Comparing biases for minimal network construction with back-propagation, 177 Han, 2015, Learning both weights and connections for efficient neural network, 1135 Guo, 2016, Dynamic network surgery for efficient DNNs, 1379 Yang, 2016, Designing energy-efficient convolutional neural networks using energy-aware pruning, CoRR abs/1611.05128 H. Hu, R. Peng, Y.-W. Tai, C.-K. Tang, Network trimming: a data-driven neuron pruning approach towards efficient deep architectures, volume abs/1607.03250, 2016. Srinivas, 2015, Data-free parameter pruning for deep neural networks Mariet, 2015, Diversity networks Chen, 2015, Compressing neural networks with the hashing trick, 2285 Shi, 2009, Hash kernels for structured data, J. Mach. Learn. Res. (JMLR), 10, 2615 Weinberger, 2009, Feature hashing for large scale multitask learning, 1113 Liu, 2015, Sparse convolutional neural networks, 806 Wen, 2016, Learning structured sparsity in deep neural networks, 2074 Bagherinezhad, 2017, Lcnn: Lookup-based convolutional neural network Egmont-Petersen, 2002, Image processing with neural networksa review, Pattern Recognit., 35, 2279, 10.1016/S0031-3203(01)00178-9 Nogueira, 2017, Towards better exploiting convolutional neural networks for remote sensing scene classification, Pattern Recognit., 61, 539, 10.1016/j.patcog.2016.07.001 Zuo, 2015, Exemplar based deep discriminative and shareable feature learning for scene image classification, Pattern Recognit., 48, 3004, 10.1016/j.patcog.2015.02.003 Lopes, 2017, Facial expression recognition with convolutional neural networks: coping with few data and the training sample order, Pattern Recognit., 61, 610, 10.1016/j.patcog.2016.07.026 Everingham, 2015, The pascal visual object classes challenge: a retrospective, Int. J. Conflict Violence (IJCV), 111, 98 Tousch, 2012, Semantic hierarchies for image annotation: a survey, Pattern Recognit., 45, 333, 10.1016/j.patcog.2011.05.017 Srivastava, 2013, Discriminative transfer learning with tree-based priors, 2094 Wang, 2015, Learning fine-grained features via a CNN tree for large-scale classification, CoRR abs/1511.04534 Xiao, 2014, Error-driven incremental learning in deep convolutional neural network for large-scale image classification, 177 Z. Yan, V. Jagadeesh, D. DeCoste, W. Di, R. Piramuthu, Hd-cnn: hierarchical deep convolutional neural network for image classification, in: Proceedings of the International Conference on Computer Vision (ICCV), pp. 2740–2748. Berg, 2014, Birdsnap: large-scale fine-grained visual categorization of birds, 2019 Khosla, 2011, Novel dataset for fine-grained image categorization: stanford dogs, 2, 1 Yang, 2015, A large-scale car dataset for fine-grained categorization and verification, 3973 Minervini, 2016, Finely-grained annotated datasets for image-based plant phenotyping, Pattern Recognit. Lett., 81, 80, 10.1016/j.patrec.2015.10.013 Xie, 2017, Lg-cnn: from local parts to global discrimination for fine-grained recognition, Pattern Recognit., 71, 118, 10.1016/j.patcog.2017.06.002 Branson, 2014, Improved bird species recognition using pose normalized deep convolutional nets Zhang, 2014, Part-based r-cnns for fine-grained category detection, 834 Uijlings, 2013, Selective search for object recognition, Int. J. Conflict Violence (IJCV), 104, 154 Lin, 2015, Deep lac: deep localization, alignment and classification for fine-grained recognition, 1666 Pluim, 2003, Mutual-information-based registration of medical images: a survey, IEEE Trans. Med. Imaging, 22, 986, 10.1109/TMI.2003.815867 Krause, 2014, Learning features and parts for fine-grained recognition, 26 Krause, 2015, Fine-grained recognition without part annotations, 5546 Zhang, 2016, Weakly supervised fine-grained categorization with part-based image representation, IEEE Trans. Image Process., 25, 1713, 10.1109/TIP.2016.2531289 Xiao, 2015, The application of two-level attention models in deep convolutional neural network for fine-grained image classification, 842 Lin, 2015, Bilinear CNN models for fine-grained visual recognition, 1449 Nguyen, 2016, Human detection from images and videos: a survey, Pattern Recognit., 51, 148, 10.1016/j.patcog.2015.08.027 Li, 2015, Feature representation for statistical-learning-based object detection: a review, Pattern Recognit., 48, 3542, 10.1016/j.patcog.2015.04.018 Pedersoli, 2015, A coarse-to-fine approach for fast deformable object detection, Pattern Recognit., 48, 1844, 10.1016/j.patcog.2014.11.006 Nowlan, 1994, A convolutional neural network hand tracker, 901 Girshick, 2015, Deformable part models are convolutional neural networks, 437 Vaillant, 1994, Original approach for the localisation of objects in images, IEE Proc.-Vis. Image Signal Process., 141, 245, 10.1049/ip-vis:19941301 Lin, 2014, Microsoft Coco: Common Objects in Context, 740 Endres, 2014, Category independent object proposals, IEEE Trans. Pattern Anal. Mach.Intell. (PAMI), 36, 222, 10.1109/TPAMI.2013.122 Gómez, 2017, Textproposals: a text-specific selective search algorithm for word spotting in the wild, Pattern Recognit., 70, 60, 10.1016/j.patcog.2017.04.027 Girshick, 2014, Rich feature hierarchies for accurate object detection and semantic segmentation, 580 He, 2015, Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE Trans. Pattern Anal. Mach.Intell. (PAMI), 37, 1904, 10.1109/TPAMI.2015.2389824 Ren, 2017, Faster r-CNN: towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Mach.Intell. (PAMI), 39, 1137, 10.1109/TPAMI.2016.2577031 Gidaris, 2015, Object detection via a multi-region and semantic segmentation-aware cnn model, 1134 Yoo, 2015, Attentionnet: Aggregating weak directions for accurate object detection, 2659 Felzenszwalb, 2010, Object detection with discriminatively trained part-based models, IEEE Trans. Pattern Anal. Mach.Intell. (PAMI), 32, 1627, 10.1109/TPAMI.2009.167 Simo-Serra, 2014, Fracking deep convolutional image descriptors, CoRR abs/1412.6537 Shrivastava, 2016, Training region-based object detectors with online hard example mining, 761 Redmon, 2016, You only look once: unified, real-time object detection, 779 Liu, 2016, SSD: single shot multibox detector, 21 Lu, 2016, Adaptive object detection using adjacency and zoom prediction, 2351 Zhang, 2013, Real-time visual tracking via online weighted multiple instance learning, Pattern Recognit., 46, 397, 10.1016/j.patcog.2012.07.013 Zhang, 2013, Sparse coding based visual tracking: review and experimental comparison, Pattern Recognit., 46, 1772, 10.1016/j.patcog.2012.10.006 Zhang, 2015, Multi-target tracking by learning local-to-global trajectory models, Pattern Recognit., 48, 580, 10.1016/j.patcog.2014.08.013 Fan, 2010, Human tracking using convolutional neural networks, IEEE Trans. Neural Netw. (TNN), 21, 1610, 10.1109/TNN.2010.2066286 Li, 2014, Deeptrack: learning discriminative feature representations by convolutional neural networks for visual tracking Chen, 2016, Cnntracker: online discriminative object tracking via deep convolutional neural network, Appl. Soft Comput., 38, 1088, 10.1016/j.asoc.2015.06.048 Hong, 2015, Online tracking by learning discriminative saliency map with convolutional neural network, 597 Patacchiola, 2017, Head pose estimation in the wild using convolutional neural networks and adaptive gradient methods, Pattern Recognit., 71, 132, 10.1016/j.patcog.2017.06.009 Nishi, 2017, Generation of human depth images with body part labels for complex human pose recognition, Pattern Recognit., 10.1016/j.patcog.2017.06.006 Toshev, 2014, Deeppose: human pose estimation via deep neural networks, 1653 Jain, 2014, Learning human pose estimation features with convolutional networks Tompson, 2014, Joint training of a convolutional network and a graphical model for human pose estimation, 1799 X. Chen, A.L. Yuille, Articulated pose estimation by a graphical model with image dependent pairwise relations, in: Proceedings of the Advances in Neural Information Processing Systems (NIPS), 2014, pp. 1736–1744. Chen, 2015, Parsing occluded people by flexible compositions, 3945 Fan, 2015, Combining local appearance and holistic view: dual-source deep neural networks for human pose estimation, 1347 Jain, 2014, Modeep: A deep learning framework using motion features for human pose estimation, 302 Tang, 1996, Automatic document processing: a survey, Pattern Recognit., 29, 1931, 10.1016/S0031-3203(96)00044-1 Vinciarelli, 2002, A survey on off-line cursive word recognition, Pattern Recognit., 35, 1433, 10.1016/S0031-3203(01)00129-7 Jung, 2004, Text information extraction in images and video: a survey, Pattern Recognit., 37, 977, 10.1016/j.patcog.2003.10.012 Eskenazi, 2017, A comprehensive survey of mostly textual document segmentation algorithms since 2008, Pattern Recognit., 64, 1, 10.1016/j.patcog.2016.10.023 Bai, 2017, Text/non-text image classification in the wild with convolutional neural networks, Pattern Recognit., 66, 437, 10.1016/j.patcog.2016.12.005 Gomez, 2017, Improving patch-based scene text script identification with ensembles of conjoined networks, Pattern Recognit., 67, 85, 10.1016/j.patcog.2017.01.032 Delakis, 2008, Text detection with convolutional neural networks, 290 Xu, 2015, Robust seed localization and growing with deep convolutional features for scene text detection, 387 Huang, 2014, Robust scene text detection with convolution neural network induced mser trees, 497 Zhang, 2015, Automatic discrimination of text and non-text natural images, 886 Goodfellow, 2014, Multi-digit number recognition from street view imagery using deep convolutional neural networks Jaderberg, 2015, Deep structured output learning for unconstrained text recognition He, 2016, Reading scene text in deep convolutional sequences, 3501 Gers, 2000, Learning to forget: continual prediction with lstm, Neural Comput., 12, 2451, 10.1162/089976600300015015 Shi, 2015, An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition, CoRR abs/1507.05717 Jaderberg, 2014, Deep features for text spotting, 512 M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Reading text in the wild with convolutional neural networks, volume 116, 2016, pp. 1–20. Wang, 2015, Deep networks for saliency detection via local estimation and global search, 3183 Zhao, 2015, Saliency detection by multi-context deep learning, 1265 Li, 2015, Visual saliency based on multiscale deep features, 5455 Liu, 2015, Predicting eye fixations using convolutional neural networks, 362 He, 2015, Supercnn: a superpixelwise convolutional neural network for salient object detection, Inter. J. Comput. Vis., 115, 330, 10.1007/s11263-015-0822-0 Vig, 2014, Large-scale optimization of hierarchical features for saliency prediction in natural images, 2798 Kümmerer, 2015, Deep gaze i: boosting saliency prediction with feature maps trained on imagenet Pan, 2015, End-to-end convolutional network for saliency prediction, CoRR abs/1507.01422 Guo, 2014, A survey on still image based human action recognition, Pattern Recognit., 47, 3343, 10.1016/j.patcog.2014.04.018 Presti, 2016, 3D skeleton-based human action classification: a survey, Pattern Recognit., 53, 130, 10.1016/j.patcog.2015.11.019 Zhang, 2016, Rgb-d-based action recognition datasets: a survey, Pattern Recognit., 60, 86, 10.1016/j.patcog.2016.05.019 J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Darrell, Decaf: a deep convolutional activation feature for generic visual recognition, 2014. Oquab, 2014, Learning and transferring mid-level image representations using convolutional neural networks, 1717 Gkioxari, 2015, Actions and attributes from wholes and parts, 2470 Pishchulin, 2013, Poselet conditioned pictorial structures, 588 Gkioxari, 2015, Contextual action recognition with r*CNN, 1080 Gkioxari, 2015, Actions and attributes from wholes and parts, 2470 Zhang, 2016, Action recognition in still images with minimum annotation efforts, IEEE Trans. Image Process., 25, 5479, 10.1109/TIP.2016.2605305 Wang, 2017, Three-stream CNNs for action recognition, Pattern Recognit. Lett., 92, 33, 10.1016/j.patrec.2017.04.004 Ji, 2013, 3D convolutional neural networks for human action recognition, IEEE Trans. Pattern Anal. Mach.Intell. (PAMI), 35, 221, 10.1109/TPAMI.2012.59 Tran, 2015, Learning spatiotemporal features with 3d convolutional networks, 4489 Karpathy, 2014, Large-scale video classification with convolutional neural networks, 1725 Simonyan, 2014, Two-stream convolutional networks for action recognition in videos, 568 Chéron, 2015, P-CNN: pose-based CNN features for action recognition, 3218 Donahue, 2017, Long-term recurrent convolutional networks for visual recognition and description, IEEE Trans. Pattern Anal.Mach.Intell. (PAMI), 39, 677, 10.1109/TPAMI.2016.2599174 Fu, 1981, A survey on image segmentation, Pattern Recognit., 13, 3, 10.1016/0031-3203(81)90028-5 Zhou, 2016, Multi-scale context for scene labeling via flexible segmentation graph, Pattern Recognit., 59, 312, 10.1016/j.patcog.2016.03.023 Liu, 2015, CRF learning with CNN features for image segmentation, Pattern Recognit., 48, 2983, 10.1016/j.patcog.2015.04.019 Bu, 2016, Scene parsing using inference embedded deep networks, Pattern Recognit., 59, 188, 10.1016/j.patcog.2016.01.027 Peng, 2013, A survey of graph theoretical approaches to image segmentation, Pattern Recognit., 46, 1020, 10.1016/j.patcog.2012.09.015 Farabet, 2013, Learning hierarchical features for scene labeling, IEEE Trans. Pattern Anal. Mach.Intell. (PAMI), 35, 1915, 10.1109/TPAMI.2012.231 Couprie, 2013, Indoor semantic segmentation using depth information Pinheiro, 2014, Recurrent convolutional neural networks for scene labeling, 82 Shuai, 2015, Integrating parametric and non-parametric models for scene labeling, 4249 B. Shuai, Z. Zuo, W. Gang, Quaddirectional 2d-recurrent neural networks for image labeling 22(11) (2015b) 1990–1994. Shuai, 2016, Dag-recurrent neural networks for scene labeling, 3620 Mostajabi, 2015, Feedforward semantic segmentation with zoom-out features, 3376 Chen, 2015, Semantic image segmentation with deep convolutional nets and fully connected crfs El Ayadi, 2011, Survey on speech emotion recognition: features, classification schemes, and databases, Pattern Recognit., 44, 572, 10.1016/j.patcog.2010.09.020 Deng, 1991, Phonemic hidden ,Markov models with continuous mixture output densities for large vocabulary word recognition, IEEE Trans. Signal Process., 39, 1677, 10.1109/78.134406 Hinton, 2012, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups, IEEE Signal Process. Mag., 29, 82, 10.1109/MSP.2012.2205597 Deng, 2013, Recent advances in deep learning for speech research at Microsoft, 8604 Yao, 2012, Adaptation of context-dependent deep neural networks for automatic speech recognition, 366 Abdel-Hamid, 2012, Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition, 4277 Abdel-Hamid, 2014, Convolutional neural networks for speech recognition Palaz, 2013, Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks, 1766 Hoshen, 2015, Speech acoustic modeling from raw multichannel waveforms, 4624 D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, et al., Deep speech 2: End-to-end speech recognition in english and mandarin, 2016, pp. 173–182. Sercu, 2016, Advances in very deep convolutional neural networks for LVCSR, 3429 Tóth, 2014, Convolutional deep maxout networks for phone recognition., 1078 Sainath, 2013, Improvements to deep convolutional neural networks for LVCSR, 315 Yu, 2016, Deep convolutional neural networks with layer-wise context expansion and attention, 17 Waibel, 1989, Phoneme recognition using time-delay neural networks, IEEE Trans. Acoustics, Speech, Signal Process., 37, 328, 10.1109/29.21701 Chen, 2014, Dnn-based stochastic postfilter for hmm-based speech synthesis., 1954 Uria, 2015, Modelling acoustic feature dependencies with artificial neural networks: Trajectory-rnade, 4465 Huang, 2017, Hierarchical bayesian combination of plug-in maximum a posteriori decoders in deep neural networks-based speech recognition and speaker adaptation, Pattern Recognit. Lett., 10.1016/j.patrec.2017.08.001 van den Oord, 2016, Pixel recurrent neural networks, 1747 Jozefowicz, 2016, Exploring the limits of language modeling Kim, 2016, Character-aware neural language models, 2741 J. Gu, C. Jianfei, G. Wang, T. Chen, Stack-captioning: coarse-to-fine learning for image captioning, volume abs/1709.03376, 2017. Wang, 2015, gen cnn: a convolutional architecture for word sequence prediction, 1567 Gu, 2017, An empirical study of language CNN for image captioning Yann N. Dauphin, 2017, Language modeling with gated convolutional networks, 933 Collobert, 2008, A unified architecture for natural language processing: deep neural networks with multitask learning, 160 Yu, 2014, Deep learning for answer sentence selection Kalchbrenner, 2014, A convolutional neural network for modelling sentences, 655 Kim, 2014, Convolutional neural networks for sentence classification, 1746 Yin, 2015, Multichannel variable-size convolution for sentence classification, 204 Collobert, 2011, Natural language processing (almost) from scratch, J. Mach. Learn. Res. (JMLR), 12, 2493 Conneau, 2016, Very deep convolutional networks for natural language processing, CoRR abs/1606.01781 Huang, 2016, Deep networks with stochastic depth, 646