OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

Journal of Computational Social Science - Tập 5 Số 1 - Trang 861-882 - 2022
Thomas Hegghammer1
1Norwegian Defence Research Establishment (FFI), Kjeller, Norway

Tóm tắt

AbstractOptical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans (n = 322) and Arabic-language article scans (n = 100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) performed substantially better than Tesseract, especially on noisy documents. Accuracy for English was considerably higher than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available “Noisy OCR Dataset” (NOD) for reuse in future benchmarking studies.

Từ khóa


Tài liệu tham khảo

Alghamdi, Mansoor A., Alkhazi, Ibrahim S., & Teahan, William J. (2016). “Arabic OCR Evaluation Tool.” In 2016 7th International Conference on Computer Science and Information Technology (CSIT), 1–6. IEEE.

Barcha, Pedro. (2017). Old Books Dataset. GitHub: GitHub Repository. https://github.com/PedroBarcha/old-books-dataset.

Bieniecki, W., Grabowski, S., & Rozenberg, W. (2007). “Image Preprocessing for Improving Ocr Accuracy.” In 2007 International Conference on Perspective Technologies and Methods in MEMS Design, 75–80. IEEE.

Boiangiu, C.-A., Ioanitescu, Radu, & Dragomir, Razvan-Costin. (2016). Voting-Based OCR System. The Proceedings of Journal ISOM, 10, 470–86.

Carrasco, R. C. (2014). “An Open-Source OCR Evaluation Tool.” In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, 179–84.

Colavizza, G. (2021). Is your OCR good enough? Probably so. Results from an assessment of the impact of OCR quality on downstream tasks. KB Lab Blog. https://lab.kb.nl/about-us/blog/your-ocr-good-enough-probably-so-results-assessment-impact-ocr-quality-downstream.

Dengel, A., Hoch, R., Hönes, F., Jäger, T., Malburg, M., Weigel, A. (1997) “Techniques for Improving OCR Results.” In Handbook of Character Recognition and Document Image Analysis, 227–58. World Scientific.

Doush, I. Abu, A., Faisal, & Gharibeh, A. H. (2018). “Yarmouk Arabic OCR Dataset.” In 2018 8th International Conference on Computer Science and Information Technology (CSIT), 150–54. IEEE.

Grant, P., Sebastian, R., Allassonnière-Tang, M., & Cosemans, S. (2021). Topic modelling on archive documents from the 1970s: global policies on refugees. Digital Scholarship in the Humanities, March.https://doi.org/10.1093/llc/fqab018

Gupta, A., Gutierrez-Osuna, R., Christy, M., Capitanu, B., Auvil, L., Grumbach, L., Furuta, R., & Mandell, L. (2015). “Automatic Assessment of OCR Quality in Historical Documents.” In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 29. 1.

Hamdi, A., Jean-Caurant, A., Sidere, N., Coustaty, M., & Doucet, A. (2019). “An Analysis of the Performance of Named Entity Recognition over OCRed Documents.” In 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 333–34. IEEE. https://ieeexplore.ieee.org/document/8791217.

Hegghammer, T. (2021). Noisy OCR Dataset. Repository details TBC.

Holley, R. (2009). How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs. D-Lib Magazine, 15(3/4)

Jain, M., Mathew, M., & Jawahar, C. V. (2017). Unconstrained Scene Text and Video Text Recognition for Arabic Script. arXiv:1711.02396.

Journet, Nicholas, Visani, Muriel, Mansencal, Boris, Van-Cuong, Kieu, & Billy, Antoine. (2017). Doccreator: a new software for creating synthetic ground-truthed document images. Journal of Imaging, 3(4), 62.

Kanungo, T., Marton, G. A., & Bulbul, O. (1999). Performance Evaluation of Two Arabic OCR Products. In 27th AIPR Workshop: Advances in Computer-Assisted Recognition, 3584:76–83. International Society for Optics; Photonics.

Kissos, I., & Dershowitz, N. (2016). “OCR Error Correction Using Character Correction and Feature-Based Word Classification.” In 2016 12th IAPR Workshop on Document Analysis Systems (DAS), 198–203. IEEE.

Krishnan, R., & Babu, D. R. R. (2012). A Language Independent Characterization of Document Image Noise in Historical Scripts. International Journal of Computer Applications, 50(9), 11–18.

Lat, A., & Jawahar, C. V. (2018). “Enhancing Ocr Accuracy with Super Resolution.” In 2018 24th International Conference on Pattern Recognition (ICPR), 3162–67. IEEE.

Levenshtein, V. I, and others. (1966). Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. In Soviet Physics Doklady, 10:707–10. 8. Soviet Union.

Lins, R. D., Banergee, S., & Thielo, M. (2010). “Automatically Detecting and Classifying Noises in Document Images.” In Proceedings of the 2010 ACM Symposium on Applied Computing, 33–39.

Lopresti, D. (2009). Optical Character Recognition Errors and Their Effects on Natural Language Processing. International Journal on Document Analysis and Recognition (IJDAR) 12 (3): 141–51. http://www.cse.lehigh.edu/~lopresti/tmp/AND08journal.pdf.

Mariner, M. C. (2017). Optical Character Recognition (OCR). Encyclopedia of Computer Science and Technology (pp. 622–29). CRC Press.

Miller, D., Boisen, S., Schwartz, R., Stone, R., & Weischedel, R. (2000). “Named Entity Extraction from Noisy Input: Speech and OCR.” In Sixth Applied Natural Language Processing Conference, 316–24. https://aclanthology.org/A00-1044.pdf.

Murata, M., Busagala, L. S. P., Ohyama, W., Wakabayashi, T., & Kimura, F. (2006). The Impact of OCR Accuracy and Feature Transformation on Automatic Text Classification. In Document Analysis Systems VII, edited by Horst Bunke and A. Lawrence Spitz, 506–17. Berlin, Heidelberg: Springer Berlin Heidelberg. https://link.springer.com/chapter/10.1007/11669487_45.

Mutuvi, S., Doucet, A., Odeo, M., & Jatowt, A. (2018). “Evaluating the Impact of OCR Errors on Topic Modeling.” In International Conference on Asian Digital Libraries, 3–14. Springer.

Patel, C., Patel, A., & Patel, D. (2012). Optical character recognition by open source OCR tool Tesseract: a case study. International Journal of Computer Applications, 55(10), 50–56.

Reffle, U., & Ringlstetter, C. (2013). Unsupervised Profiling of OCRed Historical Documents. Pattern Recognition, 46(5), 1346–57.

Reul, C., Springmann, U., Wick, C., & Puppe, F. (2018). “Improving OCR Accuracy on Early Printed Books by Utilizing Cross Fold Training and Voting.” In 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), 423–28. IEEE.

Rice, S. V, & Nartker, T. A. (1996). The ISRI Analytic Tools for OCR Evaluation. UNLV/Information Science Research Institute, TR-96 2.

Santos, E. A. (2019). “OCR Evaluation Tools for the 21st Century.” In Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers), 23–27. Honolulu: Association for Computational Linguistics. https://www.aclweb.org/anthology/W19-6004.

Shen, Z., Zhang, R., Dell, M., Lee, B. C. G., Carlson, J., & Li, W. (Eds.). (2021). LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis. arXiv PreprintarXiv:2103.15348.

Springmann, U., Najock, D., Morgenroth, H., Schmid, H., Gotscharek, A., & Fink, F. (2014). “OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress.” In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, 71–75.

Stein, S. S., Argamon, S., & Frieder, O. (2006). “The Effect of OCR Errors on Stylistic Text Classification.” In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 701–2. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.67.6791&rep=rep1&type=pdf.

Strohmaier, C. M, Ringlstetter, C., Schulz, K. U., & Mihov, S. (2003). Lexical Postcorrection of OCR-Results: The Web as a Dynamic Secondary Dictionary? In Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings., 3:1133–33. Citeseer.

Su, J., Boydell, O., Greene, D., & Lynch, G. (2015). Topic Stability over Noisy Sources. arXiv PreprintarXiv:1508.01067. https://noisy-text.github.io/2016/pdf/WNUT09.pdf.

Tafti, A. P., Baghaie, A., Assefi, M., Arabnia, H. R., Zeyun, Y., & Peissig, P. (2016). “OCR as a Service: An Experimental Evaluation of Google Docs OCR, Tesseract, ABBYY FineReader, and Transym.” In International Symposium on Visual Computing, 735–46. Springer.

tesseract-ocr. (2019). Tesseract OCR 4.1.1. GitHub Repository. GitHub.https://github.com/tesseract-ocr/tesseract.

Thompson, P., McNaught, J., & Ananiadou, S. (2015). “Customised OCR Correction for Historical Medical Text.” In 2015 Digital Heritage, 1:35–42. IEEE.

van Strien, D., Beelen, K., Ardanuy, M., Hosseini, Kasra, McGillivray, B., & Colavizza, G. (2020). “Assessing the Impact of OCR Quality on Downstream NLP Tasks.” INSTICC; SciTePress. https://doi.org/10.5220/0009169004840496.

Vijayarani, S., & Sakila, A. (2015). Performance Comparison of OCR Tools. International Journal of UbiComp (IJU), 6(3), 19–30.

Volk, Martin, Furrer, Lenz, & Sennrich, Rico. (2011). Strategies for Reducing and Correcting OCR Errors. Language Technology for Cultural Heritage (pp. 3–22). Springer.

Walker, J., Fujii, Y., & Popat, A. C. (2018). “A Web-Based Ocr Service for Documents.” In Proceedings of the 13th IAPR International Workshop on Document Analysis Systems (DAS), Vienna, Austria. Vol. 1.

Wemhoener, D., Yalniz, I. Z., & Manmatha, R. (2013). “Creating an Improved Version Using Noisy OCR from Multiple Editions.” In 2013 12th International Conference on Document Analysis and Recognition, 160–64. IEEE.

Wick, C., Reul, C., & Puppe, F. (2018). Comparison of OCR Accuracy on Early Printed Books Using the Open Source Engines Calamari and OCRopus. J. Lang. Technol. Comput. Linguistics, 33(1), 79–96.

Yalniz, I. Z., & Manmatha, R. (2011). “A Fast Alignment Scheme for Automatic OCR Evaluation of Books.” In 2011 International Conference on Document Analysis and Recognition, 754–58. https://doi.org/10.1109/ICDAR.2011.157.

Ye, Peng, & Doermann, David. (2013). “Document Image Quality Assessment: A Brief Survey.” In 2013 12th International Conference on Document Analysis and Recognition, 723–27. IEEE.