OCR Quality and NLP Preprocessing
Abstract
We present initial experiments to evaluate the performance of tasks such as Part of Speech Tagging on data corrupted by Optical Character Recognition (OCR). Our results, based on English and German data, using artificial experiments as well as initial real OCRed data indicate that already a small drop in OCR quality considerably increases the error rates, which would have a significant impact on subsequent processing steps.- Anthology ID:
- W19-3633
- Volume:
- Proceedings of the 2019 Workshop on Widening NLP
- Month:
- August
- Year:
- 2019
- Address:
- Florence, Italy
- Editors:
- Amittai Axelrod, Diyi Yang, Rossana Cunha, Samira Shaikh, Zeerak Waseem
- Venue:
- WiNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 102–105
- Language:
- URL:
- https://aclanthology.org/W19-3633
- DOI:
- Cite (ACL):
- Margot Mieskes and Stefan Schmunk. 2019. OCR Quality and NLP Preprocessing. In Proceedings of the 2019 Workshop on Widening NLP, pages 102–105, Florence, Italy. Association for Computational Linguistics.
- Cite (Informal):
- OCR Quality and NLP Preprocessing (Mieskes & Schmunk, WiNLP 2019)