Sentence-Alignment in Semi-parallel Datasets

Steffen Frenzel, Manfred Stede


Abstract
In this paper, we are testing sentence alignment on complex, semi-parallel corpora, i.e., different versions of the same text that have been altered to some extent. We evaluate two hypotheses: To make alignment algorithms more efficient, we test the hypothesis that matching pairs can be found in the immediate vicinity of the source sentence and that it is sufficient to search for paraphrases in a ‘context window’. To improve the alignment quality on complex, semi-parallel texts, we test the implementation of a segmentation into Elementary Discourse Units (EDUs) in order to make more precise alignments at this level. Since EDUs are the smallest possible unit for communicating a full proposition, we assume that aligning at this level can improve the overall quality. Both hypotheses are tested and validated with several embedding models on varying degrees of parallel German datasets. The advantages and disadvantages of the different approaches are presented, and our next steps are outlined.
Anthology ID:
2025.latechclfl-1.9
Volume:
Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)
Month:
May
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Anna Kazantseva, Stan Szpakowicz, Stefania Degaetano-Ortlieb, Yuri Bizzoni, Janis Pagel
Venues:
LaTeCHCLfL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
87–96
Language:
URL:
https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.latechclfl-1.9/
DOI:
Bibkey:
Cite (ACL):
Steffen Frenzel and Manfred Stede. 2025. Sentence-Alignment in Semi-parallel Datasets. In Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025), pages 87–96, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Sentence-Alignment in Semi-parallel Datasets (Frenzel & Stede, LaTeCHCLfL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.latechclfl-1.9.pdf