A system for detection of moving caption text in videos: a news use case
Tóm tắt
Extraction of news text captions aims at a digital understanding of what is happening in a specific region during a certain period that helps in better communication between different nations because we can easily translate the plain text from one language to another. Moving text captions causes blurry effects that are a significant cause of text quality impairments in the news channels. Most of the existing text caption detection models do not address this problem in a way that captures the different dynamic motion of captions, gathers a full news story among several frames in the sequence, resolves the blurring effect of text motion, offers a language-independent model, or provides it as an end-to-end solution for the community to use. We process the frames coming in sequence and extract edge features using either the Hough transform or our color-based technique. We verify text existence using a Convolutional Neural Network (CNN) text detection pre-trained model. We analyze the caption motion status using hybrid pre-trained Recurrent Neural Network (RNN) of Long Short-Term Memory (LSTM) type model and correlation-based model. In case the motion is determined to be horizontal rotation, there are two problems. First, it means that text keeps rotating with no stop resulting in a high blurring effect that affects the text quality and consequently resulting in low character recognition accuracy. Second, there are successive news stories which are separated by the channel logo or long spaces. We managed to solve the first problem by deblurring the text image using either Bicubic Spline Interpolation (BSI) technique or the Denoising Autoencoder Neural Network (DANN). We solved the second problem using a Point Feature Matching (PFM) technique to match the existing channel logo with the channels’ logo database (ground truth). We evaluate our framework using Abbyy® SDK as a standalone tool used for text recognition supporting different languages.
Tài liệu tham khảo
Alabed A (2018) Text and moving objects segmentation in video files. Dissertation, Cairo University
Chen L, Su C (2018) Video caption extraction using spatio-temporal slices. Int J Image Graph 18(02):83–88
Farsiu S, Robinson D, Elad M, Milanfar P (2005) Fast and robust super-resolution. Dissertation, University of California
https://www.mathworks.com/help/deeplearning/ug/long-short-term-memory-networks.html (n.d.) Accessed 10 June 2019
Lat A, Jawahar C (2018) Enhancing OCR accuracy with super resolution. In: 24th International conference on pattern recognition. https://doi.org/10.1007/978-3-030-57058-3_11
Li H, Doermann D (2000) Super-resolution-based enhancement of text in digital video. Proc IEEE Int Conf Pattern Recognit 1:847–850
Lu X, Ma C, Ni B, Yang X (2019) Adaptive region proposal with channel regularization for robust object tracking. In: IEEE Trans Circ Syst Video Technol, doi: https://doi.org/10.1109/TCSVT.2019.2944654, 1
Lu X, Ma C, Ni B, Yang X, Reid I, Yang M (2018) Deep regression tracking with shrinkage loss. Comput Vis Found 1–17
Luo B, Tang X, Liu J, Zhang H (2003) Video caption detection and extraction using temporal information. Int Conf Image Process 1:297–300. https://doi.org/10.1109/ICIP.2003.1246957
Majeed S, Mansoor Y, Qabil S, Majeed F, Khan B, (2020) Comparative analysis of the denoising effect of unstructured vs. convolutional autoencoders. Int Conf Emerg Trends Smart Technol (ICETST), pp 1-5
Mirza A, Fayyaz M, Seher Z, Siddiqi I (2018) Urdu caption text detection using textural features. Proceedings of the 2nd Mediterranean conference on pattern recognition and artificial intelligence, https://doi.org/10.1145/3177148.3180098, pp 70–75
Moradi M, Mozaffari S, Orouji A (2010) Farsi/Arabic text extraction from video images by corner detection. Proc Iranian Mach Vis Image Process, pp 1-6
Pandey R, Vignesh K, Ramakrishnan A, Chandrahasa B (2018) Binary document image super resolution for improved readability and OCR performance. arXiv:1812.02475v1
Shi Y, Li B, Wang B, Qi Z, Liu J (2019) Unsupervised single-image super-resolution with multi-gram loss. Multidiscipl Digit Publ Inst 8(833):1–17
Su X, Xu H, Kang Y, Hao X, Gao G, Zhang Y (2019) Improving text image resolution using a deep generative adversarial network for optical character recognition. International conference on document analysis and recognition (ICDAR)
Wang W, Lu X, Shen J, Crandall D, Shao L (2020) Zero-shot video object segmentation via attentive graph neural networks. Computer Vision and Pattern Recognition, pp 9236–9245. https://arxiv.org/abs/2001.06807
Wongso R, Luwinda F, Williem (2018) Evaluation of deep super resolution methods for textual images. 3rd International Conference on Computer Science and Computational Intelligence 135:331-337. https://doi.org/10.1016/j.procs.2018.08.181
Ye Q, Doermann D (2015) Text detection recognition in imagery: a survey. IEEE Trans Pattern Anal Mach Intell 37(7):1480–1500. https://doi.org/10.1109/TPAMI.2014.2366765
Zedan K, Emary M (2016) An innovative method for key frames extraction in news videos. Proc 2nd Int Conf Intell Syst Inf (AISI) 533:383–394
Zhang H, Liu D, Xiong Z (2017) CNN-based text image super-resolution tailored for OCR. IEEE Visual Communications and Image Processing (VCIP). https://doi.org/10.1109/VCIP.2017.8305127
Zhong Y, Zhang H, Jain A (1999) Automatic caption localization in compressed video. Int Conf Image Process 2:96–100
