Improving Word Alignment Using Semi-Supervised Learning

Zhongtao Miao, Qiyu Wu, Masaaki Nagata, Yoshimasa Tsuruoka


Abstract
Word alignment plays a crucial role in various natural language processing tasks, such as serving as cross-lingual signals for sentence embedding, reducing hallucination and omission in machine translation, and facilitating the construction of training data for simultaneous speech translation.Current state-of-the-art approaches usually rely on: (1) supervised data and large-scale weakly supervised data constructed from Wikipedia and (2) multilingual Transformer encoder-based models.However, we find that the current state-of-the-art encoder-based method, BinaryAlign, suffers from the issue of insufficient labeled data, and we further improve it with self-training with a small amount of parallel data. In addition, considering the impressive performance of multilingual large language models on many natural language processing tasks, we also explore the possibility of using these decoder-based large language models as word aligners. We observe that although fine-tuning large language models with labeled data produces acceptable results, augmenting the training with pseudo-labeled data further enhances model performance. Based on the findings, we propose a semi-supervised framework to improve the large language model-based word aligners. Experimental results demonstrate that the proposed method with a small amount of parallel data outperforms the current state-of-the-art method on various word alignment datasets.
Anthology ID:
2025.findings-acl.1020
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
19871–19888
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.findings-acl.1020/
DOI:
Bibkey:
Cite (ACL):
Zhongtao Miao, Qiyu Wu, Masaaki Nagata, and Yoshimasa Tsuruoka. 2025. Improving Word Alignment Using Semi-Supervised Learning. In Findings of the Association for Computational Linguistics: ACL 2025, pages 19871–19888, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Improving Word Alignment Using Semi-Supervised Learning (Miao et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.findings-acl.1020.pdf