Improving Parallel Sentence Mining for Low-Resource and Endangered Languages

Shu Okabe; Katharina Hämmerl; Alexander Fraser

Improving Parallel Sentence Mining for Low-Resource and Endangered Languages

Shu Okabe, Katharina Hämmerl, Alexander Fraser

Abstract

While parallel sentence mining has been extensively covered for fairly well-resourced languages, pairs involving low-resource languages have received comparatively little attention.To address this gap, we present Belopsem, a benchmark of new datasets for parallel sentence mining on three language pairs where the source side is low-resource and endangered: Occitan-Spanish, Upper Sorbian-German, and Chuvash-Russian. These combinations also reflect varying linguistic similarity within each pair. We compare three language models in an established parallel sentence mining pipeline and apply two types of improvements to one of them, Glot500. We observe better mining quality overall by both applying alignment post-processing with an unsupervised aligner and using a cluster-based isotropy enhancement technique. These findings are crucial for optimising parallel data extraction for low-resource languages in a realistic way.

Anthology ID:: 2025.acl-short.17
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 196–205
Language:
URL:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.acl-short.17/
DOI:
Bibkey:
Cite (ACL):: Shu Okabe, Katharina Hämmerl, and Alexander Fraser. 2025. Improving Parallel Sentence Mining for Low-Resource and Endangered Languages. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 196–205, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Improving Parallel Sentence Mining for Low-Resource and Endangered Languages (Okabe et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.acl-short.17.pdf

PDF Cite Search Fix data