Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction
Laura Manrique-Gomez, Tony Montes, Arturo Rodriguez Herrera, Ruben Manrique
Abstract
This paper presents two significant contributions: First, it introduces a novel dataset of 19th-century Latin American newspaper texts, addressing a critical gap in specialized corpora for historical and linguistic analysis in this region. Second, it develops a flexible framework that utilizes a Large Language Model for OCR error correction and linguistic surface form detection in digitized corpora. This semi-automated framework is adaptable to various contexts and datasets and is applied to the newly created dataset.- Anthology ID:
- 2024.nlp4dh-1.13
- Volume:
- Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
- Month:
- November
- Year:
- 2024
- Address:
- Miami, USA
- Editors:
- Mika Hämäläinen, Emily Öhman, So Miyagawa, Khalid Alnajjar, Yuri Bizzoni
- Venues:
- NLP4DH | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 132–139
- Language:
- URL:
- https://preview.aclanthology.org/fix-sig-urls/2024.nlp4dh-1.13/
- DOI:
- 10.18653/v1/2024.nlp4dh-1.13
- Cite (ACL):
- Laura Manrique-Gomez, Tony Montes, Arturo Rodriguez Herrera, and Ruben Manrique. 2024. Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction. In Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities, pages 132–139, Miami, USA. Association for Computational Linguistics.
- Cite (Informal):
- Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction (Manrique-Gomez et al., NLP4DH 2024)
- PDF:
- https://preview.aclanthology.org/fix-sig-urls/2024.nlp4dh-1.13.pdf