Retrieval of Parallelizable Texts Across Church Slavic Variants
Piroska Lendvai, Uwe Reichel, Anna Jouravel, Achim Rabus, Elena Renje
Abstract
The goal of our study is to identify parallelizable texts for Church Slavic, across chronological and regional variants. Next to using a benchmark text, we utilize a recently digitized, large text collection and compile new resources for the retrieval of similar texts: a ground truth dataset holding a small amount of manually aligned sentences in Old Church Slavic and in Old East Slavic, and a large unaligned dataset that has a subset of ground truth (GT) quality texts but contains noise from handwritten text recognition (HTR) for the majority of the collection. We discuss preprocessing challenges in the data and the impact of sentence segmentation on retrieval performance. We evaluate sentence snippets mapped across these two diachronic variants of Church Slavic, expressed by mean reciprocal rank, using embedding representations from large language models (LLMs) as well as classical string similarity based approaches combined with k-nearest neighbor (kNN) search. Experimental results indicate that in the current setup (short text snippets, off-the-shelf multilingual embeddings), classical string similarity based retrieval can still outperform embedding based retrieval.- Anthology ID:
- 2025.vardial-1.8
- Volume:
- Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects
- Month:
- January
- Year:
- 2025
- Address:
- Abu Dhabi, UAE
- Editors:
- Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jorg Tiedemann, Marcos Zampieri
- Venues:
- VarDial | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 105–114
- Language:
- URL:
- https://preview.aclanthology.org/add-emnlp-2024-awards/2025.vardial-1.8/
- DOI:
- Cite (ACL):
- Piroska Lendvai, Uwe Reichel, Anna Jouravel, Achim Rabus, and Elena Renje. 2025. Retrieval of Parallelizable Texts Across Church Slavic Variants. In Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 105–114, Abu Dhabi, UAE. Association for Computational Linguistics.
- Cite (Informal):
- Retrieval of Parallelizable Texts Across Church Slavic Variants (Lendvai et al., VarDial 2025)
- PDF:
- https://preview.aclanthology.org/add-emnlp-2024-awards/2025.vardial-1.8.pdf