Abstract
Many users consult digital archives daily, but the information they can access is unrepresentative of the diversity of documentary history. The sequence-to-sequence architecture typically used for optical character recognition (OCR) – which jointly learns a vision and language model – is poorly extensible to low-resource document collections, as learning a language-vision model requires extensive labeled sequences and compute. This study models OCR as a character level image retrieval problem, using a contrastively trained vision encoder. Because the model only learns characters’ visual features, it is more sample efficient and extensible than existing architectures, enabling accurate OCR in settings where existing solutions fail. Crucially, it opens new avenues for community engagement in making digital history more representative of documentary history.- Anthology ID:
- 2024.acl-long.440
- Volume:
- Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Lun-Wei Ku, Andre Martins, Vivek Srikumar
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 8105–8115
- Language:
- URL:
- https://aclanthology.org/2024.acl-long.440
- DOI:
- Cite (ACL):
- Jacob Carlson, Tom Bryan, and Melissa Dell. 2024. Efficient OCR for Building a Diverse Digital History. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8105–8115, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- Efficient OCR for Building a Diverse Digital History (Carlson et al., ACL 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2024.acl-long.440.pdf