Cheap Character Noise for OCR-Robust Multilingual Embeddings

Andrianos Michail, Juri Opitz, Yining Wang, Robin Meister, Rico Sennrich, Simon Clematide


Abstract
The large amount of text collections digitized by imperfect OCR systems requires semantic search models that perform robustly on noisy input. Such collections are highly heterogeneous, with varying degrees of OCR quality, spelling conventions and other inconsistencies —all phenomena that are underrepresented in the training data of standard embedding models, with ramifications for their generalization. In our paper, we show that this problem can be alleviated with a simple and inexpensive method that does not require supervision or in-domain training. Specifically, we fine-tune existing multilingual models using noisy texts and a contrastive loss. We show that these models show considerable improvements across different noise conditions. Control experiments indicate minimal, and occasionally positive, impact on standard similarity tasks. These findings suggest that embedding models can be inexpensively adapted for cross-lingual semantic search in heterogeneous, digitized corpora. We publicly release our code, datasets, and models at https://github.com/impresso/ocr-robust-multilingual-embeddings.
Anthology ID:
2025.findings-acl.609
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11705–11716
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.findings-acl.609/
DOI:
Bibkey:
Cite (ACL):
Andrianos Michail, Juri Opitz, Yining Wang, Robin Meister, Rico Sennrich, and Simon Clematide. 2025. Cheap Character Noise for OCR-Robust Multilingual Embeddings. In Findings of the Association for Computational Linguistics: ACL 2025, pages 11705–11716, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Cheap Character Noise for OCR-Robust Multilingual Embeddings (Michail et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.findings-acl.609.pdf