Enhancing medical text detection with vision-language pre-training and efficient segmentation

Tianyang Li1,2, Jinxu Bai2, Qingzhu Wang2
1Jiangxi New Energy Technology Institute, Xinyu, China
2College of Computer Science and Technology, Northeast Electric Power University, Jilin, China

Tóm tắt

Detecting text within medical images presents a formidable challenge in the domain of computer vision due to the intricate nature of textual backgrounds, the dense text concentration, and the possible existence of extreme aspect ratios. This paper introduces an effective and precise text detection system tailored to address these challenges. The system incorporates an optimized segmentation module, a trainable post-processing method, and leverages a vision-language pre-training model (oCLIP). Specifically, our segmentation head integrates three essential components: the Feature Pyramid Network (FPN) module, which combines a residual structure and channel attention mechanism; the Efficient Feature Enhancement Module (EFEM); and the Multi-Scale Feature Fusion with RSEConv (MSFM-RSE), designed specifically for multi-scale feature fusion based on RSEConv. By introducing a residual structure and channel attention mechanism into the FPN module, the convolutional layers are replaced with RSEConv layers that employ a channel attention mechanism, further augmenting the representational capacity of the feature maps. The EFEM, designed as a cascaded U-shaped module, incorporates a spatial attention mechanism to introduce multi-level information, thereby enhancing segmentation performance. Subsequently, the MSFM-RSE adeptly amalgamates features from various depths and scales of the EFEM to generate comprehensive final features tailored for segmentation purposes. Additionally, a post-processing module employs a differentiable binarization strategy, allowing the segmentation network to dynamically determine the binarization threshold. Building on the system’s improvement, we introduce a vision-language pre-training model that undergoes extensive training on various visual language understanding tasks. This pre-trained model acquires detailed visual and semantic representations, further reinforcing both the accuracy and robustness in text detection when integrated with the segmentation module. The performance of our proposed model was evaluated through experiments on medical text image datasets, demonstrating excellent results. Multiple benchmark experiments validate its superior performance in comparison to existing methods. Code is available at: https://github.com/csworkcode/VLDBNet .

Từ khóa


Tài liệu tham khảo

Baek Y, Lee B, Han D, et al (2019) Character region awareness for text detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9365–9374 Chen YC, Li L, Yu L, et al (2020) Uniter: Universal image-text representation learning. In: European conference on computer vision, Springer, pp 104–120 Ch’ng CK, Chan CS (2017) Total-text: A comprehensive dataset for scene text detection and recognition. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), IEEE, pp 935–942 CMedOCR (2022) Medical Inventory Invoice OCR Element Extraction Task . https://tianchi.aliyun.com/dataset/131815 Dai J, Li Y, He K, et al (2016) R-fcn: Object detection via region-based fully convolutional networks. Advances in neural information processing systems 29 Deng D, Liu H, Li X, et al (2018) Pixellink: Detecting scene text via instance segmentation. In: Proceedings of the AAAI conference on artificial intelligence Fan DP, Cheng MM, Liu JJ, et al (2018a) Salient objects in clutter: Bringing salient object detection to the foreground. In: Proceedings of the European conference on computer vision (ECCV), pp 186–202 Fan DP, Gong C, Cao Y, et al (2018b) Enhanced-alignment measure for binary foreground map evaluation. arXiv preprint arXiv:1805.10421 Fan DP, Wang W, Cheng MM, et al (2019) Shifting more attention to video salient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8554–8564 Gupta A, Vedaldi A, Zisserman A (2016) Synthetic data for text localisation in natural images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2315–2324 Han Q, Yin Q, Zheng X, et al (2021) Remote sensing image building detection method based on mask r-cnn. Complex & Intelligent Systems pp 1–9 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 He K, Gkioxari G, Dollár P, et al (2017a) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969 He M, Liao M, Yang Z, et al (2021) Most: A multi-oriented scene text detector with localization refinement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8813–8822 He P, Huang W, He T, et al (2017b) Single shot text detector with regional attention. IEEE He W, Zhang XY, Yin F, et al (2017c) Deep direct regression for multi-oriented scene text detection. In: Proceedings of the IEEE international conference on computer vision, pp 745–753 Hu H, Zhang C, Luo Y, et al (2017) Wordsup: Exploiting word annotations for character based text detection. arXiv e-prints Jia C, Yang Y, Xia Y, et al (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning, PMLR, pp 4904–4916 Karatzas D, Gomez-Bigorda L, Nicolaou A, et al (2015) Icdar 2015 competition on robust reading. In: 2015 13th international conference on document analysis and recognition (ICDAR), IEEE, pp 1156–1160 Li G, Duan N, Fang Y, et al (2020) Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In: Proceedings of the AAAI conference on artificial intelligence, pp 11336–11344 Li LH, Yatskar M, Yin D, et al (2019) Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 Liao M, Shi B, Bai X, et al (2017) Textboxes: A fast text detector with a single deep neural network. In: Proceedings of the AAAI conference on artificial intelligence Liao M, Shi B, Bai X (2018) Textboxes++: A single-shot oriented scene text detector. IEEE Trans Image Process 27(8):3676–3690 Liao M, Shi B, Bai X (2018) Textboxes++: A single-shot oriented scene text detector. IEEE Trans Image Process 27(8):3676–3690 Liao M, Zhu Z, Shi B, et al (2018c) Rotation-sensitive regression for oriented scene text detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5909–5918 Liao M, Wan Z, Yao C, et al (2020) Real-time scene text detection with differentiable binarization. In: Proceedings of the AAAI conference on artificial intelligence, pp 11474–11481 Liao M, Zou Z, Wan Z et al (2022) Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(1):919–931 Lin J, Jiang J, Yan Y, et al (2022) Dptnet: A dual-path transformer architecture for scene text detection. arXiv preprint arXiv:2208.09878 Liu W, Anguelov D, Erhan D, et al (2016) Ssd: Single shot multibox detector. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, Springer, pp 21–37 Liu Z, Lin G, Yang S, et al (2018) Learning markov clustering networks for scene text detection. arXiv preprint arXiv:1805.08365 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440 Long S, Ruan J, Zhang W, et al (2018) Textsnake: A flexible representation for detecting text of arbitrary shapes. In: Proceedings of the European conference on computer vision (ECCV), pp 20–36 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 Lyu P, Liao M, Yao C, et al (2018a) Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In: Proceedings of the European conference on computer vision (ECCV), pp 67–83 Lyu P, Yao C, Wu W, et al (2018b) Multi-oriented scene text detection via corner localization and region segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7553–7563 Ma J, Shao W, Ye H et al (2018) Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans Multimedia 20(11):3111–3122 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR, pp 8748–8763 Ren S, He K, Girshick R, et al (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 Su W, Zhu X, Cao Y, et al (2019) Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 Tan H, Bansal M (2019) Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 Tian Z, Huang W, He T, et al (2016) Detecting text in natural image with connectionist text proposal network. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, Springer, pp 56–72 Tian Z, Shu M, Lyu P, et al (2019) Learning shape-aware embedding for scene text detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4234–4243 Wang F, Jiang M, Qian C, et al (2017) Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164 Wang P, Zhang C, Qi F, et al (2019a) A single-shot arbitrarily-shaped text detector based on context attended multi-task learning. Proceedings of the 27th ACM International Conference on Multimedia Wang W, Xie E, Li X, et al (2019b) Shape robust text detection with progressive scale expansion network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9336–9345 Wang W, Xie E, Song X, et al (2019c) Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8440–8449 Wang Y, Xie H, Zha ZJ, et al (2020) Contournet: Taking a further step toward accurate arbitrary-shaped scene text detection. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11753–11762 Xie E, Zang Y, Shao S, et al (2019) Scene text detection with supervised pyramid context network. In: Proceedings of the AAAI conference on artificial intelligence, pp 9038–9045 Xue C, Zhang W, Hao Y et al (2022) Language matters: A weakly supervised vision-language pre-training approach for scene text detection and spotting. Springer, Cham Yao C, Bai X, Liu W, et al (2012) Detecting texts of arbitrary orientations in natural images. In: 2012 IEEE conference on computer vision and pattern recognition, IEEE, pp 1083–1090 Yao C, Bai X, Liu W (2014) A unified framework for multioriented text detection and recognition. IEEE Trans Image Process 23(11):4737–4749 Yuliang L, Lianwen J, Shuaitao Z, et al (2017) Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170 Zhang C, Liang B, Huang Z, et al (2019) Look more than once: An accurate detector for text of arbitrary shapes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10552–10561 Zhang H, Zu K, Lu J, et al (2022) Epsanet: An efficient pyramid squeeze attention block on convolutional neural network. In: Proceedings of the Asian Conference on Computer Vision, pp 1161–1177 Zhang SX, Zhu X, Hou JB, et al (2020) Deep relational reasoning graph network for arbitrary shape text detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9699–9708 Zhang SX, Zhu X, Yang C, et al (2021) Adaptive boundary proposal network for arbitrary shape text detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1305–1314 Zhang Z, Zhang C, Shen W, et al (2016) Multi-oriented text detection with fully convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4159–4167 Zhao H, Shi J, Qi X, et al (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2881–2890 Zhou X, Yao C, Wen H, et al (2017a) East: an efficient and accurate scene text detector. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 5551–5560 Zhou X, Yao C, Wen H, et al (2017b) East: an efficient and accurate scene text detector. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 5551–5560 Zhu X, Hu H, Lin S, et al (2019a) Deformable convnets v2: More deformable, better results. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9308–9316 Zhu X, Hu H, Lin S, et al (2019b) Deformable convnets v2: More deformable, better results. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)