Moule Lin
2025
PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy
Shuhao Guan
|
Moule Lin
|
Cheng Xu
|
Xinyi Liu
|
Jinman Zhao
|
Jiexin Fan
|
Qi Xu
|
Derek Greene
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper introduces PreP-OCR, a two-stage pipeline that combines document image restoration with semantic-aware post-OCR correction to enhance both visual clarity and textual consistency, thereby improving text extraction from degraded historical documents.First, we synthesize document-image pairs from plaintext, rendering them with diverse fonts and layouts and then applying a randomly ordered set of degradation operations. An image restoration model is trained on this synthetic data, using multi-directional patch extraction and fusion to process large images. Second, a ByT5 post-OCR model, fine-tuned on synthetic historical text pairs, addresses remaining OCR errors.Detailed experiments on 13,831 pages of real historical documents in English, French, and Spanish show that the PreP-OCR pipeline reduces character error rates by 63.9-70.3% compared to OCR on raw images. Our pipeline demonstrates the potential of integrating image restoration with linguistic error correction for digitizing historical archives.
2024
Effective Synthetic Data and Test-Time Adaptation for OCR Correction
Shuhao Guan
|
Cheng Xu
|
Moule Lin
|
Derek Greene
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Post-OCR technology is used to correct errors in the text produced by OCR systems. This study introduces a method for constructing post-OCR synthetic data with different noise levels using weak supervision. We define Character Error Rate (CER) thresholds for “effective” and “ineffective” synthetic data, allowing us to create more useful multi-noise level synthetic datasets. Furthermore, we propose Self-Correct-Noise Test-Time Adaptation (SCN-TTA), which combines self-correction and noise generation mechanisms. SCN-TTA allows a model to dynamically adjust to test data without relying on labels, effectively handling proper nouns in long texts and further reducing CER. In our experiments we evaluate a range of models, including multiple PLMs and LLMs. Results indicate that our method yields models that are effective across diverse text types. Notably, the ByT5 model achieves a CER reduction of 68.67% without relying on manually annotated data
Search
Fix author
Co-authors
- Derek Greene 2
- Shuhao Guan 2
- Cheng Xu 2
- Jiexin Fan 1
- Xinyi Liu 1
- show all...
- Qi Xu 1
- Jinman Zhao 1