Less is More: Adapting Text Embeddings for Low-Resource Languages with Small Scale Noisy Synthetic Data
Zaruhi Navasardyan, Bagratuni Minsayan, Spartak Bughdaryan, Hrant Davtyan
Abstract
Low-resource languages (LRLs) often lack high-quality, large-scale datasets for training effective text embedding models, hindering their application in tasks like retrieval-augmented generation (RAG) and semantic search. In this work, we challenge the prevailing assumption that effective semantic alignment requires massive datasets or pristine, human-verified translations. Focusing on Armenian (an LRL with a unique script), we introduce a cost-effective adaptation strategy using small scale noisy synthetic data generated by translating English Reddit title-body pairs with open-weights models. We establish a comprehensive evaluation benchmark comprising existing datasets, translated data, and a manually curated dataset. Our experiments reveal a surprising "Less is More" phenomenon: fine-tuning a multilingual encoder (mE5) on just 10,000 noisy synthetic pairs yields 11-12% average improvements across the benchmark with a 20%+ relative improvement in retrieval performance, matching the performance of models trained on ~1 million examples. Furthermore, we demonstrate that neither increasing data scale, improving translation quality via state-of-the-art LLMs, nor diversifying data domains yields significant gains over this minimal baseline. We validate the generalizability of these findings on another LRL with a unique script. Our results suggest that semantic alignment for LRLs saturates early and is highly robust to noise, democratizing high-performance embedding creation for resource-constrained communities. We release the model, data, and the benchmark at this https URL to facilitate further research.- Anthology ID:
- 2026.loreslm-1.31
- Volume:
- Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Hansi Hettiarachchi, Tharindu Ranasinghe, Alistair Plum, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith Premasiri, Fiona Anting Tan, Lasitha Uyangodage
- Venue:
- LoResLM
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 362–370
- Language:
- URL:
- https://preview.aclanthology.org/manual-author-scripts/2026.loreslm-1.31/
- DOI:
- Cite (ACL):
- Zaruhi Navasardyan, Bagratuni Minsayan, Spartak Bughdaryan, and Hrant Davtyan. 2026. Less is More: Adapting Text Embeddings for Low-Resource Languages with Small Scale Noisy Synthetic Data. In Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026), pages 362–370, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- Less is More: Adapting Text Embeddings for Low-Resource Languages with Small Scale Noisy Synthetic Data (Navasardyan et al., LoResLM 2026)
- PDF:
- https://preview.aclanthology.org/manual-author-scripts/2026.loreslm-1.31.pdf