MOrpho-LEXical analysis for correcting OCR-generated Arabic words (MOLEX)

T. Sari1, M. Sellami1
1Laboratoire LRI, Département dE28099Informatique, Université Badji Mokhtar Annaba, Algeria

Tóm tắt

In this paper we present a contextual-based method for correcting Arabic words generated by OCR systems. This technique operates as a post-processor and it wants to be universal. It corrects substitution and rejection errors. The Arabic language properties are very useful in morpho-lexical analysis and therefore they are strongly exploited in the development of the method. The substitution errors, the most frequently committed ones by the OCR systems, are rewritten in production rules to be used by a rule-based system for correcting Arabic words. The first version of the developed method operates only at the morpho-lexical level, the extension to the other levels of language analysis is considered in perspectives.

Từ khóa

#Optical character recognition software #Dictionaries #Error correction #Hidden Markov models #Speech recognition #Production systems #Knowledge based systems #Natural language processing #Acoustics #Heart

Tài liệu tham khảo

ho, 1991, Word recognition with multi-level contextual knowledge, Proceed ICDAR'91, 905 10.1016/0306-4573(83)90045-6 10.1109/IJCNN.1991.155584 10.1002/(SICI)1097-4571(198703)38:2<133::AID-ASI8>3.0.CO;2-P trenkel, 1995, Arabic character recognition, Proceedings of the Symposium on Document Image Understanding Technology, 191 10.1016/0031-3203(90)90078-Y 10.1016/0031-3203(90)90071-R fink, 1986, The correction of ill-formed input using history-based expectation with application to speech understanding, Computer Linguist, 12, 13 contant, 1992, Exploratexte: Un Analyseur a? l'affu?t des erreurs grammaticales, Actes du Colloque Lexiquesgrammaires Compares de brucq, 1996, Repre?sentation de chai?nes de caracte?res par des chai?nes induites de Markov, Actes RFIA'96, 651 cheriet, 1998, Visual aspect of cursive arabic handwriting recognition, Proced Vision Interface VI'98, 262 10.1007/BF01889984 10.1016/S0031-3203(96)00078-7 jones, 1991, Integrating multiple knowledge sources in a bayesian ocr postprocessing, Proceed ICDAR'91, 925 kukick, 1988, Variations on a back-propagation name recognition net, Proceed Advanced Techn Conf, 2, 722 10.1145/146370.146380 laskri, 1995, Traitement automatique de la langue arabe en vue d'une traduction automatique des textes vers la langue franc?aise, Proc 3e?me JADT'95, 25 lefevre, 1992, Logiciel d'acce?s par voisinage a? un dictionnaire automatique du franc?ais courant, Actes de CNED'92, 200 10.1016/0031-3203(94)90166-X miled, 1997, Une me?thode rapide de reconnaissance de l'e?criture arabe manuscrite, 16e?me Colloque Trait sari, 2001, Proble?matique de la reconnaissance et de la correction des mots arabes, Actes Confe?rence Internationale sur l'Automatisation du Tre?sor de la Langue Arabe ATLA'01, 23 sellami, 1998, Contribution a? la reconnaissance de mots arabes manuscrits, CARI'98 Colloque Africain de Recherche en Informatique, 122 al-suwaiyel, 1991, On the entropy of arabic, The Arabian Journal of Science and Engineering, 16, 559 al badr, 1995, Survey and bibliography of arabic optical text recognition, Signal Process, 41, 49, 10.1016/0165-1684(94)00090-M 10.1016/S0262-8856(96)01119-5 abuhaiba, 1991, Cluster number estimation and skeleton refining algorithms for arabic characters, Arabian Journal for Science An Engineering (ASJE), 16, 519 10.1016/0031-3203(90)90070-2 10.1109/21.44052 amin, 1986, Machine recognition of multifont printed arabic texts, Proceed of ICPR'86, 1, 392 souilem, 1989, Un systeme d'enseignement assiste par ordinateur de la grammaire arbe S.E.A.G.A, Actes du IV Colloque International de Linguistique Linguistique Arabe et Informatique, 209 amin, 1982, Machine recognition of hand written arabic words by the irac ii system, Proc of 6th ICPR, 1, 34 souici, 0, Global recognition system for arabic literal amounts, ICCTA'99 10.1109/34.149585 ben amara, 1997, Application des phmms pour la reconnaissance de l'e?criture arabe imprime?e, JST'97 Francil, 389 10.1016/S0031-3203(97)00084-8