Nhận diện trộm xe từ video giám sát dựa trên cơ chế chú ý không gian-thời gian

Springer Science and Business Media LLC - Tập 51 - Trang 2128-2143 - 2020
Lijun He1, Shuai Wen1, Liejun Wang2, Fan Li1
1School of Information and Communications Engineering, Xi’an Jiaotong University, Xi’an, China
2College of information Science and Engineering, Xinjiang University, Urumqi, China

Tóm tắt

Sự gia tăng liên tục của các vụ trộm xe có tác động vô cùng tiêu cực đến an toàn công cộng. Nhờ vào các thiết bị giám sát phân bố khắp thành phố, có một số lượng lớn video có thể được sử dụng để nhận diện các vụ trộm xe. Tuy nhiên, hành vi trộm xe có những đặc điểm như là mục tiêu phạm tội nhỏ và chuyển động hạn chế. Do đó, các thuật toán nhận diện hành động hiện có không thể được áp dụng trực tiếp để nhận diện trộm xe. Trong bài báo này, chúng tôi đề xuất một phương pháp nhận diện trộm xe dựa trên cơ chế chú ý không gian-thời gian. Đầu tiên, một cơ sở dữ liệu vụ trộm xe được thiết lập bằng cách thu thập video từ Internet và một tập dữ liệu hiện có. Sau đó, chúng tôi thiết lập một mạng nhận diện trộm xe và giới thiệu cơ chế chú ý không gian-thời gian để áp dụng khi trích xuất các đặc trưng không gian-thời gian của hành vi trộm. Thông qua việc học trọng số đặc trưng thích ứng, các đặc trưng đóng góp lớn nhất vào việc nhận diện được nhấn mạnh. Các thí nghiệm mô phỏng cho thấy thuật toán được đề xuất của chúng tôi có thể đạt độ chính xác 97.04% trên cơ sở dữ liệu vụ trộm xe đã thu thập.

Từ khóa

#trộm xe #nhận diện #video giám sát #cơ chế chú ý không gian-thời gian #thuật toán nhận diện

Tài liệu tham khảo

Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267. https://doi.org/10.1109/34.910878 Wang Y, Huang K, Tan T (2007) Human activity recognition based on R transform. IEEE Comput Soc Conf Comput Vis Pattern Recog:1–8. https://doi.org/10.1109/CVPR.2007.383505 Chen HS, Chen HT, Chen YW, Lee S (2006) Human action recognition using star skeleton. VSSN '06: Proc 4th ACM Int Workshop Video Surveill Sensor Networks 171–178. https://doi.org/10.1145/1178782.1178808 Wang L, Suter D (2006) Informative shape representations for human action recognition. 18th Int Conf Pattern Recog (ICPR'06), Hong Kong 1266–1269. https://doi.org/10.1109/ICPR.2006.711 Harris C, Stephens M (1988) A combined corner and edge detector. Proc Alvey Vis Conf 147–151. https://doi.org/10.5244/C.2.23 Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123. https://doi.org/10.1007/s11263-005-1838-7 Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. 2005 IEEE Int Workshop Visual Surveill Perform Eval Track Surveill, Beijing 65–72. https://doi.org/10.1109/VSPETS.2005.1570899 Willems G, Tuytelaars T, Van Gool LJ (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. ECCV '08: Proceedings of the 10th European Conference on Computer Vision: Part II 650–663. https://doi.org/10.1007/978-3-540-88688-4_48 Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local Spatio-temporal features for action recognition. British Mach Vis Conf 124–135. https://doi.org/10.5244/C.23.124 Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. IEEE Comput Soc Conf Comput Vis Pattern Recog 886–893. https://doi.org/10.1109/CVPR.2005.177 Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. ECCV'06: Proceedings of the 9th European conference on Computer Vision - Volume Part II 428–441. https://doi.org/10.1007/11744047_33 Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. IEEE Conf Comput Vis Pattern Recog 1–8. https://doi.org/10.1109/CVPR.2008.4587756 Fujiyoshi H, Lipton AJ, Kanade T (2004) Real-time human motion analysis by image skeletonization. IEICE Trans Inf Syst 87(1):113–120 Yang X, Tian YL (2014) Effective 3d action recognition using eigenjoints. J Vis Commun Image Represent 25(1):2–11. https://doi.org/10.1016/j.jvcir.2013.03.001 Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. IEEE Conf Comput Vision Pattern Recog 3169–3176. https://doi.org/10.1109/CVPR.2011.5995407 Wang H, Schmid C (2013) Action recognition with improved trajectories. IEEE Int Conf Comput Vis 3551–3558. https://doi.org/10.1109/ICCV.2013.441 LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551. https://doi.org/10.1162/neco.1989.1.4.541 Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 60:1097–1105. https://doi.org/10.1145/3065386 Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Computer Ence. https://core.ac.uk/reader/25056064 Szegedy C, Liu W, Jia Y, Sermanet P, Reed S (2015) Going deeper with convolutions. IEEE Conf Comput Vis Pattern Recog 1–9. https://doi.org/10.1109/cvpr.2015.7298594 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. IEEE Conf Comput Vis Pattern Recog 770–778. https://doi.org/10.1109/cvpr.2016.90 Pérez-Hernández F, Tabik S, Lamas A, Olmos R, Fujita H, Herrera F (2020) Object detection binary classifiers methodology based on deep learning to identify small objects handled similarly: application in video surveillance. Knowl-Based Syst 194:105590. https://doi.org/10.1016/j.knosys.2020.105590 Theagarajan R, Thakoor N, Bhanu B (2019) Physical features and deep learning-based appearance features for vehicle classification from rear view videos. IEEE Trans Intell Transp Syst 21(3):1096–1108. https://doi.org/10.1109/TITS.2019.2902312 Yao Y, Wang X, Xu M, Pu Z, Crandall D (2020) When, where, and what? A new dataset for anomaly detection in driving videos. arXiv preprint arXiv:2004.03044 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. NIPS'14: Proceedings of the 27th International Conference on Neural Information Processing Systems 568–576. https://doi.org/10.1002/14651858.CD001941.pub3 Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159 Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. Eur Conf Comput Vis 20–36. https://doi.org/10.1007/978-3-319-46484-8_2 Zhou B, Andonian A, Oliva A, Torralba A (2018) Temporal relational reasoning in videos. Eur Conf Comput Vision 803–818. https://doi.org/10.1007/978-3-030-01246-5_49 Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal residual networks for video action recognition. IEEE Conf Comput Vis Pattern Recog 3468–3476. https://doi.org/10.1109/CVPR.2017.787 Fernando B, Anderson P, Hutter M, Gould S (2016) Discriminative hierarchical rank pooling for activity recognition. IEEE Conf Comput Vis Pattern Recog 1924–1932. https://doi.org/10.1109/CVPR.2016.212 Fernando B, Gould S (2016) Learning end-to-end video classification with RankPooling. Proc 33rd Int Conf Int Conf Mach Learn 48:1187–1196 Bilen H, Fernando B, Gavves E, Vedaldi A (2018) Action recognition with dynamic image networks. IEEE Trans Pattern Anal Mach Intell 40(12):2799–2813. https://doi.org/10.1109/TPAMI.2017.2769085 Sun S, Kuang Z, Sheng L, Ouyang W, Zhang W (2018) Optical flow guided feature: a fast and robust motion representation for video action recognition. IEEE Conf Comput Vis Pattern Recognition 1390–1399. https://doi.org/10.1109/CVPR.2018.00151 Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. IEEE/CVF Int Conf Comput Vis 6202–6211. https://doi.org/10.1109/iccv.2019.00630 Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. IEEE Int Conf Comput Vis 4489–4497. https://doi.org/10.1109/ICCV.2015.510 Tran D, Ray J, Shou Z, Chang S-F, Paluri M (2017) ConvNet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 Varol G, Laptev I, Schmid C (2017) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517. https://doi.org/10.1109/TPAMI.2017.2712608 Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. Proc IEEE Int Conf Comput Vis 4597–4605. https://doi.org/10.1109/ICCV.2015.522 Qiu ZF, Yao T, Mei T (2017) Learning spatiotemporal representation with pseudo-3D residual networks. IEEE Int Conf Comput Vis 5533–5541. https://doi.org/10.1109/ICCV.2017.590 Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. IEEE Conf Comput Vis Pattern Recog 6299–6308. https://doi.org/10.1109/CVPR.2017.502 Tran D, Wang H, Torresani L, Feiszli M (2019) Video classification with channel-separated convolutional networks. IEEE Int Conf Comput Vis 5551–5560. https://doi.org/10.1109/ICCV.2019.00565 Donahue J, Hendricks LA, Guadarrama S et al (2016) Long-term recurrent convolutional networks for visual recognition and description. IEEE Conf Comput Vis Pattern Recognition 39:2625–2634. https://doi.org/10.1109/TPAMI.2016.2599174 Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. IEEE Int Conf Acoustics, Speech Signal Process 6645–6649. https://doi.org/10.1109/ICASSP.2013.6638947 Majd M, Safabakhsh R (2019) A motion-aware ConvLSTM network for action recognition. Appl Intell 49(7):2515–2521. https://doi.org/10.1007/s10489-018-1395-8 Kay W, Carreira J, Simonyan K, Zhang B, Zisserman A (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. Proc IEEE Conf Comput Vis Pattern Recog 7132–7141. https://doi.org/10.1109/CVPR.2018.00745 Xiao XH, Leedham G (1999) Signature verification by neural networks with selective attention. Appl Intell 11(2):213–223. https://doi.org/10.1023/A:1008380515294 Woo S, Park J, Lee J Y, et al (2018) Cbam: convolutional block attention module. Proc Eur Conf Comput Vis 3–19. https://doi.org/10.1007/978-3-030-01234-2_1 Lin J, Gan C, Han S (2019) Tsm: temporal shift module for efficient video understanding. Proc IEEE Int Conf Comput Vis:7083–7093. https://doi.org/10.1109/iccv.2019.00718 Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? Proc IEEE Conf Comput Vis Pattern Recognition:6546-6555. https://doi.org/10.1109/cvpr.2018.00685 Sultani W, Chen C, Shah M (2018) Real-world anomaly detection in surveillance videos. Proc IEEE Conf Comput Vis Pattern Recognition:6479-6488. https://doi.org/10.1109/CVPR.2018.00678 He L, Wen S, Wang L, Li F (2020), Vehicle theft dataset. https://drive.google.com/drive/folders/19c2KNVotM15bLlU9FHAvqORTA00sV5lE Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In 2011 International conference on computer vision, pp 2556-2563. https://doi.org/10.1109/ICCV.2011.6126543