Segmentation-free optical character recognition for printed Urdu text
Tóm tắt
This paper presents a segmentation-free optical character recognition system for printed Urdu Nastaliq font using ligatures as units of recognition. The proposed technique relies on statistical features and employs Hidden Markov Models for classification. A total of 1525 unique high-frequency Urdu ligatures from the standard Urdu Printed Text Images (UPTI) database are considered in our study. Ligatures extracted from text lines are first split into primary (main body) and secondary (dots and diacritics) ligatures and multiple instances of the same ligature are grouped into clusters using a sequential clustering algorithm. Hidden Markov Models are trained separately for each ligature using the examples in the respective cluster by sliding right-to-left the overlapped windows and extracting a set of statistical features. Given the query text, the primary and secondary ligatures are separately recognized and later associated together using a set of heuristics to recognize the complete ligature. The system evaluated on the standard UPTI Urdu database reported a ligature recognition rate of 92% on more than 6000 query ligatures.
Tài liệu tham khảo
S Shabbir, I Siddiqi, Optical character recognition system for Urdu words in nastaliq font. Int. J. Adv. Comput. Sci. Appl.7(5), 567–76 (2016).
S Naz, AI Umar, SB Ahmed, SH Shirazi, MI Razzak, I Siddiqi, in Multi-Topic Conference (INMIC), 2014 IEEE 17th International. An OCR system for printed nasta’liq script: a segmentation based approach (IEEE, Pakistan, 2014), pp. 255–259.
ST Javed, Investigation into a segmentation based OCR for the nastaleeq writing system, Master’s thesis, National University of Computer and Emerging Sciences Lahore, Pakistan (2007).
DA Satti, Offline Urdu nastaliq ocr for printed text using analytical approach, upublished master’s thesis, Quaid-i-Azam University Islamabad, Pakistan (2013).
N Sabbour, F Shafait, in IS&T/SPIE Electronic Imaging. A segmentation-free approach to Arabic and Urdu OCR (International Society for Optics and Photonics, USA, 2013), pp. 86580–86580.
U Pal, A Sarkar, in proceedings of the 7th International Conference on Document Analysis and Recognition (ICDAR’03). Recognition of printed Urdu Script (UK, 2003), pp. 1183–1187.
I Shamsher, Z Ahmad, JK Orakzai, A Adnan, OCR for printed Urdu script using feed forward neural network. Proc. World Acad. Sci. Eng. Technol. 23:, 172–175 (2007).
J Tariq, U Nauman, MU Naru, in Computer Engineering and Technology (ICCET), 2010 2nd International Conference On, 3. Softconverter: a novel approach to construct OCR for printed Urdu isolated characters (IEEE, China, 2010), pp. V3–495.
S Sardar, A Wahab, in Information and Emerging Technologies (ICIET), 2010 International Conference On. Optical character recognition system for Urdu (IEEE, Pakistan, 2010), pp. 1–5.
ST Javed, S Hussain, A Maqbool, S Asloob, S Jamil, H Moin, Segmentation free nastalique Urdu OCR. World Acad. Sci. Eng. Technol. 46:, 456–461 (2010).
Z Ahmad, JK Orakzai, I Shamsher, A Adnan, in Proceedings of World Academy of Science, Engineering and Technology, 26. Urdu nastaleeq optical character recognition (Citeseer, 2007), pp. 249–252.
T Nawaz, S Naqvi, H ur Rehman, A Faiz, Optical character recognition system for Urdu (naskh font) using pattern matching technique. Int. J. Image Process. (IJIP). 3(3), 92 (2009).
QUA Akram, S Hussain, A Niazi, U Anjum, F Irfan, in Document Analysis Systems (DAS), 2014 11th IAPR International Workshop On. Adapting tesseract for complex scripts: an example for Urdu nastalique (IEEE, France, 2014), pp. 191–195.
Z Ahmad, JK Orakzai, I Shamsher, in Computer Science and Information Technology, 2009. ICCSIT 2009. 2nd IEEE International Conference On. Urdu compound character recognition using feed forward neural networks (IEEE, China, 2009), pp. 457–462.
H Malik, MA Fahiem, in Visualisation, 2009. VIZ’09. Second International Conference In. Segmentation of printed Urdu scripts using structural features (IEEE, 2009), pp. 191–195.
A Ul-Hasan, SB Ahmed, F Rashid, F Shafait, TM Breuel, in 2013 12th International Conference on Document Analysis and Recognition. Offline printed Urdu nastaleeq script recognition with bidirectional LSTM networks (IEEE, USA, 2013), pp. 1061–1065.
S Naz, AI Umar, R Ahmad, SB Ahmed, SH Shirazi, I Siddiqi, MI Razzak, Offline cursive Urdu-nastaliq script recognition using multidimensional recurrent neural networks. Neurocomputing. 177:, 228–241 (2016).
S Naz, AI Umar, R Ahmad, SB Ahmed, SH Shirazi, MI Razzak, Urdu nastaliq text recognition system based on multi-dimensional recurrent neural network and statistical features. Neural Comput. Appl. 28(2), 1–13 (2015).
ST Javed, S Hussain, in Iberoamerican Congress on Pattern Recognition. Segmentation based Urdu nastalique OCR (Springer, Cuba, 2013), pp. 41–49.
Line and ligature segmentation in printed Urdu document images. J. Appl. Environ. Biol. Sc. 6(3S), 114–120 (2016).
S Hussain, S Ali, QU Akram, Nastalique segmentation-based approach for Urdu OCR. Int. J. Doc. Anal. Recognit. (IJDAR). 18(4), 357–374 (2015).
SB Ahmed, S Naz, MI Razzak, SF Rashid, MZ Afzal, TM Breuel, Evaluation of cursive and non-cursive scripts using recurrent neural networks. Neural Comput. Appl. 27(3), 603–613 (2016).
MR Yousefi, MR Soheili, TM Breuel, E Kabir, D Stricker, in Document Analysis and Recognition (ICDAR), 2015 13th International Conference On. Binarization-free OCR for historical documents using LSTM networks (IEEE, France, 2015), pp. 1121–1125.
A Ul-Hasan, SS Bukhari, A Dengel, in 2016 12th IAPR Workshop on Document Analysis Systems (DAS). Ocroract: a sequence learning OCR system trained on isolated characters (Greece, 2016), pp. 174–179.
R Messina, J Louradour, in Document Analysis and Recognition (ICDAR), 2015 13th International Conference On. Segmentation-free handwritten Chinese text recognition with LSTM-RNN (IEEE, France, 2015), pp. 171–175.
A Ray, S Rajeswar, S Chaudhury, in Advances in Pattern Recognition (ICAPR), 2015 Eighth International Conference On. Text recognition using deep BLSTM networks (IEEE, India, 2015), pp. 1–6.
M Akram, S Hussain, in Proceedings of the 8th Workshop on Asian Language Resources. Word segmentation for Urdu OCR system (Beijing, 2010), pp. 88–94.
Q Akram, S Hussain, F Adeeba, S Rehman, M Saeed, in the Proceedings of Conference on Language and Technology. (CLT 14). Framework of Urdu nastalique optical character recognition system (Karachi, 2014).
IU Khattak, I Siddiqi, S Khalid, C Djeddi, in Document Analysis and Recognition (ICDAR), 2015 13th International Conference On. Recognition of Urdu ligatures-a holistic approach (IEEE, France, 2015), pp. 71–75.
MW Sagheer, CL He, N Nobile, CY Suen, in Pattern Recognition (ICPR), 2010 20th International Conference On. Holistic urdu handwritten word recognition using support vector machine (IEEE, Turkey, 2010), pp. 1900–1903.
SA Sattar, S Haque, MK Pathan, in Proceedings of the 46th Annual Southeast Regional Conference on XX. Nastaliq optical character recognition (ACM, USA, 2008), pp. 329–331.
R Hussain, HA Khan, I Siddiqi, K Khurshid, A Masood, in 2015 11th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS). Keyword based information retrieval system for Urdu document images (IEEE, Thailand, 2015), pp. 27–33.
GS Lehal, in Proceeding of the Workshop on Document Analysis and Recognition. Choice of recognizable units for Urdu OCR (ACM, India, 2012), pp. 79–85.
A Bensefia, T Paquet, L Heutte, A writer identification and verification system. Pattern Recognit. Lett. 26(13), 2080–2092 (2005).
I Siddiqi, N Vincent, Text independent writer recognition using redundant writing patterns with contour-based orientation and curvature features. Pattern Recognit. 43(11), 3853–3865 (2010).
CW Ng, S Ranganath, Real-time gesture recognition system and application. Image Vis. Comput. 20(13), 993–1007 (2002).
J Triesch, C von der Malsburg, Classification of hand postures against complex backgrounds using elastic graph matching. Image Vis. Comput. 20(13), 937–943 (2002).
HS Yoon, J Soh, YJ Bae, HS Yang, Hand gesture recognition using combined features of location, angle and velocity. Pattern Recognit. 34(7), 1491–1501 (2001).
XD Huang, Y Ariki, MA Jack, Hidden Markov Models for Speech Recognition, vol. 2004 (Edinburgh university press, Edinburgh, 1990).
E Kavallieratou, E Stamatatos, N Fakotakis, G Kokkinakis, in International Conference on Pattern Recognition, 15. Handwritten character segmentation using transformation-based learning (Spain, 2000), pp. 63–637.
B Pardo, W Birmingham, in Proceeding of the National Conference on Artificial Intelligence, 20. Modeling form for on-line following of musical performances (USA, 2005), p. 1018.
T Plotz, GA Fink, Markov models for offline handwriting recognition: a survey. Int. J. Document Anal. Recognit. (IJDAR). 12(4), 269–298 (2009).
A Khemiri, AK Echi, A Belaid, M Elloumi, in Document Analysis and Recognition (ICDAR), 2015 13th International Conference On. Arabic handwritten words offline recognition based on HMMS and DBNS (IEEE, France, 2015), pp. 51–55.
E Chammas, C Mokbel, L Likforman-Sulem, in Document Analysis and Recognition (ICDAR), 2015 13th International Conference On. Arabic handwritten document preprocessing and recognition (IEEE, France, 2015), pp. 451–455.
M-K Hu, Visual pattern recognition by moment invariants. IRE Trans. Inf. Theory. 8(2), 179–187 (1962).
D Yu, H Yan, Separation of touching handwritten multi-numeral strings based on morphological structural features. Pattern Recognit. 34(3), 587–599 (2001).
A Tahmasbi, F Saki, SB Shokouhi, Classification of benign and malignant masses based on Zernike moments. J. Comput. Biol. Med. 41(8), 726–735 (2011).
F Saki, A Tahmasbi, H Soltanian-Zadeh, SB Shokouhi, Fast opposite weight learning rules with application in breast cancer diagnosis. J. Comput. Biol. Med. 43(1), 32–41 (2013).
GS Lehal, in Document Analysis and Recognition (ICDAR), 2013 12th International Conference On. Ligature segmentation for Urdu OCR (IEEE, USA, 2013), pp. 1130–1134.