Bilingual Sentence Mining for Low-Resource Languages: a Case Study on Upper and Lower Sorbian

Shu Okabe, Alexander Fraser


Abstract
Parallel sentence mining is crucial for down- stream tasks such as Machine Translation, especially for low-resource languages, where such resources are scarce. In this context, we apply a pipeline approach with contextual embeddings on two endangered Slavic languages spoken in Germany, Upper and Lower Sorbian, to evaluate mining quality. To this end, we compare off-the-shelf multilingual language models and word encoders pre-trained on Upper Sorbian to understand their impact on sentence mining. Moreover, to filter out irrelevant pairs, we experiment with a post-processing of mined sentences through an unsupervised word aligner based on word embeddings. We observe the usefulness of additional pre-training in Upper Sorbian, which leads to direct improvements when mining the same language but also its related language, Lower Sorbian.
Anthology ID:
2025.computel-main.2
Volume:
Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages
Month:
March
Year:
2025
Address:
Honolulu, Hawaii, USA
Editors:
Jordan Lachler, Godfred Agyapong, Antti Arppe, Sarah Moeller, Aditi Chaudhary, Shruti Rijhwani, Daisy Rosenblum
Venues:
ComputEL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11–19
Language:
URL:
https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.computel-main.2/
DOI:
Bibkey:
Cite (ACL):
Shu Okabe and Alexander Fraser. 2025. Bilingual Sentence Mining for Low-Resource Languages: a Case Study on Upper and Lower Sorbian. In Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 11–19, Honolulu, Hawaii, USA. Association for Computational Linguistics.
Cite (Informal):
Bilingual Sentence Mining for Low-Resource Languages: a Case Study on Upper and Lower Sorbian (Okabe & Fraser, ComputEL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.computel-main.2.pdf