A system for detection of moving caption text in videos: a news use case

Multimedia Tools and Applications - Tập 80 - Trang 25607-25631 - 2021
Hossam Elshahaby1, Mohsen Rashwan1
1Cairo University, Cairo, Egypt

Tóm tắt

Extraction of news text captions aims at a digital understanding of what is happening in a specific region during a certain period that helps in better communication between different nations because we can easily translate the plain text from one language to another. Moving text captions causes blurry effects that are a significant cause of text quality impairments in the news channels. Most of the existing text caption detection models do not address this problem in a way that captures the different dynamic motion of captions, gathers a full news story among several frames in the sequence, resolves the blurring effect of text motion, offers a language-independent model, or provides it as an end-to-end solution for the community to use. We process the frames coming in sequence and extract edge features using either the Hough transform or our color-based technique. We verify text existence using a Convolutional Neural Network (CNN) text detection pre-trained model. We analyze the caption motion status using hybrid pre-trained Recurrent Neural Network (RNN) of Long Short-Term Memory (LSTM) type model and correlation-based model. In case the motion is determined to be horizontal rotation, there are two problems. First, it means that text keeps rotating with no stop resulting in a high blurring effect that affects the text quality and consequently resulting in low character recognition accuracy. Second, there are successive news stories which are separated by the channel logo or long spaces. We managed to solve the first problem by deblurring the text image using either Bicubic Spline Interpolation (BSI) technique or the Denoising Autoencoder Neural Network (DANN). We solved the second problem using a Point Feature Matching (PFM) technique to match the existing channel logo with the channels’ logo database (ground truth). We evaluate our framework using Abbyy® SDK as a standalone tool used for text recognition supporting different languages.

Tài liệu tham khảo

Alabed A (2018) Text and moving objects segmentation in video files. Dissertation, Cairo University Chen L, Su C (2018) Video caption extraction using spatio-temporal slices. Int J Image Graph 18(02):83–88 Farsiu S, Robinson D, Elad M, Milanfar P (2005) Fast and robust super-resolution. Dissertation, University of California https://www.mathworks.com/help/deeplearning/ug/long-short-term-memory-networks.html (n.d.) Accessed 10 June 2019 Lat A, Jawahar C (2018) Enhancing OCR accuracy with super resolution. In: 24th International conference on pattern recognition. https://doi.org/10.1007/978-3-030-57058-3_11 Li H, Doermann D (2000) Super-resolution-based enhancement of text in digital video. Proc IEEE Int Conf Pattern Recognit 1:847–850 Lu X, Ma C, Ni B, Yang X (2019) Adaptive region proposal with channel regularization for robust object tracking. In: IEEE Trans Circ Syst Video Technol, doi: https://doi.org/10.1109/TCSVT.2019.2944654, 1 Lu X, Ma C, Ni B, Yang X, Reid I, Yang M (2018) Deep regression tracking with shrinkage loss. Comput Vis Found 1–17 Luo B, Tang X, Liu J, Zhang H (2003) Video caption detection and extraction using temporal information. Int Conf Image Process 1:297–300. https://doi.org/10.1109/ICIP.2003.1246957 Majeed S, Mansoor Y, Qabil S, Majeed F, Khan B, (2020) Comparative analysis of the denoising effect of unstructured vs. convolutional autoencoders. Int Conf Emerg Trends Smart Technol (ICETST), pp 1-5 Mirza A, Fayyaz M, Seher Z, Siddiqi I (2018) Urdu caption text detection using textural features. Proceedings of the 2nd Mediterranean conference on pattern recognition and artificial intelligence, https://doi.org/10.1145/3177148.3180098, pp 70–75 Moradi M, Mozaffari S, Orouji A (2010) Farsi/Arabic text extraction from video images by corner detection. Proc Iranian Mach Vis Image Process, pp 1-6 Pandey R, Vignesh K, Ramakrishnan A, Chandrahasa B (2018) Binary document image super resolution for improved readability and OCR performance. arXiv:1812.02475v1 Shi Y, Li B, Wang B, Qi Z, Liu J (2019) Unsupervised single-image super-resolution with multi-gram loss. Multidiscipl Digit Publ Inst 8(833):1–17 Su X, Xu H, Kang Y, Hao X, Gao G, Zhang Y (2019) Improving text image resolution using a deep generative adversarial network for optical character recognition. International conference on document analysis and recognition (ICDAR) Wang W, Lu X, Shen J, Crandall D, Shao L (2020) Zero-shot video object segmentation via attentive graph neural networks. Computer Vision and Pattern Recognition, pp 9236–9245. https://arxiv.org/abs/2001.06807 Wongso R, Luwinda F, Williem (2018) Evaluation of deep super resolution methods for textual images. 3rd International Conference on Computer Science and Computational Intelligence 135:331-337. https://doi.org/10.1016/j.procs.2018.08.181 Ye Q, Doermann D (2015) Text detection recognition in imagery: a survey. IEEE Trans Pattern Anal Mach Intell 37(7):1480–1500. https://doi.org/10.1109/TPAMI.2014.2366765 Zedan K, Emary M (2016) An innovative method for key frames extraction in news videos. Proc 2nd Int Conf Intell Syst Inf (AISI) 533:383–394 Zhang H, Liu D, Xiong Z (2017) CNN-based text image super-resolution tailored for OCR. IEEE Visual Communications and Image Processing (VCIP). https://doi.org/10.1109/VCIP.2017.8305127 Zhong Y, Zhang H, Jain A (1999) Automatic caption localization in compressed video. Int Conf Image Process 2:96–100