Effective Synthetic Data and Test-Time Adaptation for OCR Correction

Shuhao Guan; Cheng Xu; Moule Lin; Derek Greene

doi:10.18653/v1/2024.emnlp-main.862

Effective Synthetic Data and Test-Time Adaptation for OCR Correction

Shuhao Guan, Cheng Xu, Moule Lin, Derek Greene

Abstract

Post-OCR technology is used to correct errors in the text produced by OCR systems. This study introduces a method for constructing post-OCR synthetic data with different noise levels using weak supervision. We define Character Error Rate (CER) thresholds for “effective” and “ineffective” synthetic data, allowing us to create more useful multi-noise level synthetic datasets. Furthermore, we propose Self-Correct-Noise Test-Time Adaptation (SCN-TTA), which combines self-correction and noise generation mechanisms. SCN-TTA allows a model to dynamically adjust to test data without relying on labels, effectively handling proper nouns in long texts and further reducing CER. In our experiments we evaluate a range of models, including multiple PLMs and LLMs. Results indicate that our method yields models that are effective across diverse text types. Notably, the ByT5 model achieves a CER reduction of 68.67% without relying on manually annotated data

Anthology ID:: 2024.emnlp-main.862
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 15412–15425
Language:
URL:: https://preview.aclanthology.org/Add-Cong-Liu-Florida-Atlantic-University-author-id/2024.emnlp-main.862/
DOI:: 10.18653/v1/2024.emnlp-main.862
Bibkey:
Cite (ACL):: Shuhao Guan, Cheng Xu, Moule Lin, and Derek Greene. 2024. Effective Synthetic Data and Test-Time Adaptation for OCR Correction. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15412–15425, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Effective Synthetic Data and Test-Time Adaptation for OCR Correction (Guan et al., EMNLP 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/Add-Cong-Liu-Florida-Atlantic-University-author-id/2024.emnlp-main.862.pdf

PDF Cite Search Fix data