@inproceedings{soper-etal-2021-bart,
    title = "{BART} for Post-Correction of {OCR} Newspaper Text",
    author = "Soper, Elizabeth  and
      Fujimoto, Stanley  and
      Yu, Yen-Yun",
    editor = "Xu, Wei  and
      Ritter, Alan  and
      Baldwin, Tim  and
      Rahimi, Afshin",
    booktitle = "Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)",
    month = nov,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2021.wnut-1.31/",
    doi = "10.18653/v1/2021.wnut-1.31",
    pages = "284--290",
    abstract = "Optical character recognition (OCR) from newspaper page images is susceptible to noise due to degradation of old documents and variation in typesetting. In this report, we present a novel approach to OCR post-correction. We cast error correction as a translation task, and fine-tune BART, a transformer-based sequence-to-sequence language model pretrained to denoise corrupted text. We are the first to use sentence-level transformer models for OCR post-correction, and our best model achieves a 29.4{\%} improvement in character accuracy over the original noisy OCR text. Our results demonstrate the utility of pretrained language models for dealing with noisy text."
}Markdown (Informal)
[BART for Post-Correction of OCR Newspaper Text](https://preview.aclanthology.org/ingest-emnlp/2021.wnut-1.31/) (Soper et al., WNUT 2021)
ACL
- Elizabeth Soper, Stanley Fujimoto, and Yen-Yun Yu. 2021. BART for Post-Correction of OCR Newspaper Text. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 284–290, Online. Association for Computational Linguistics.