Open-Source OCR Libraries: A Comprehensive Study for Low Resource Language

Meharuniza Nazeem, Anitha R, Navaneeth S, Rajeev R. R


Abstract
This paper reviews numerous OCR programs and libraries employed for optical character recognition tasks. Tesser- act OCR, an open-source program that supports multiple lan- guages and image formats, is highlighted for its accuracy and adaptability. Python-based libraries like EasyOCR, MMOCR, and PaddleOCR are also mentioned, which provide user-friendly interfaces and trained models for text extraction, detection, and recognition. EasyOCR emphasizes ease of use and sim- plicity, while MMOCR and PaddleOCR offer comprehensive OCR capabilities and support for a wide range of languages. According to our study, which evaluates various OCR libraries, Tesseract OCR performs remarkably well in terms of accuracy for Indian languages like Malayalam. We focused on five OCR libraries—Tesseract OCR, MMOCR, PaddleOCR, EasyOCR, and Keras OCR—and tested them across several languages, including English, Hindi, Arabic, Tamil, and Malayalam. During our comparison, we found that Tesseract OCR was the only library that supported the Malayalam language. While the other libraries did not support Malayalam, Tesseract OCR performed well across all tested languages, achieving accuracy rates of 92% in English, 93% in Hindi, 78% in Tamil, 74% in Arabic, and 93% in Malayalam.
Anthology ID:
2024.icon-1.52
Volume:
Proceedings of the 21st International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2024
Address:
AU-KBC Research Centre, Chennai, India
Editors:
Sobha Lalitha Devi, Karunesh Arora
Venue:
ICON
SIG:
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
416–421
Language:
URL:
https://preview.aclanthology.org/icon-24-ingestion/2024.icon-1.52/
DOI:
Bibkey:
Cite (ACL):
Meharuniza Nazeem, Anitha R, Navaneeth S, and Rajeev R. R. 2024. Open-Source OCR Libraries: A Comprehensive Study for Low Resource Language. In Proceedings of the 21st International Conference on Natural Language Processing (ICON), pages 416–421, AU-KBC Research Centre, Chennai, India. NLP Association of India (NLPAI).
Cite (Informal):
Open-Source OCR Libraries: A Comprehensive Study for Low Resource Language (Nazeem et al., ICON 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/icon-24-ingestion/2024.icon-1.52.pdf