Abstract
Post-OCR technology is used to correct errors in the text produced by OCR systems. This study introduces a method for constructing post-OCR synthetic data with different noise levels using weak supervision. We define Character Error Rate (CER) thresholds for “effective” and “ineffective” synthetic data, allowing us to create more useful multi-noise level synthetic datasets. Furthermore, we propose Self-Correct-Noise Test-Time Adaptation (SCN-TTA), which combines self-correction and noise generation mechanisms. SCN-TTA allows a model to dynamically adjust to test data without relying on labels, effectively handling proper nouns in long texts and further reducing CER. In our experiments we evaluate a range of models, including multiple PLMs and LLMs. Results indicate that our method yields models that are effective across diverse text types. Notably, the ByT5 model achieves a CER reduction of 68.67% without relying on manually annotated data- Anthology ID:
- 2024.emnlp-main.862
- Volume:
- Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 15412–15425
- Language:
- URL:
- https://aclanthology.org/2024.emnlp-main.862
- DOI:
- 10.18653/v1/2024.emnlp-main.862
- Cite (ACL):
- Shuhao Guan, Cheng Xu, Moule Lin, and Derek Greene. 2024. Effective Synthetic Data and Test-Time Adaptation for OCR Correction. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15412–15425, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- Effective Synthetic Data and Test-Time Adaptation for OCR Correction (Guan et al., EMNLP 2024)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2024.emnlp-main.862.pdf