TransAlign: Machine Translation Encoders are Strong Word Aligners, Too

Benedikt Ebing; Christian Goldschmied; Goran Glavaš

doi:10.18653/v1/2025.findings-emnlp.1129

TransAlign: Machine Translation Encoders are Strong Word Aligners, Too

Benedikt Ebing, Christian Goldschmied, Goran Glavaš

Abstract

In the absence of sizable training data for most world languages and NLP tasks, translation-based strategies such as translate-test—evaluating on noisy source language data translated from the target language—and translate-train—training on noisy target language data translated from the source language—have been established as competitive approaches for cross-lingual transfer (XLT). For token classification tasks, these strategies require label projection: mapping the labels from each token in the original sentence to its counterpart(s) in the translation. To this end, it is common to leverage multilingual word aligners (WAs) derived from encoder language models such as mBERT or LaBSE. Despite obvious associations between machine translation (MT) and WA, research on extracting alignments with MT models is largely limited to exploiting cross-attention in encoder-decoder architectures, yielding poor WA results. In this work, in contrast, we propose TransAlign, a novel word aligner that utilizes the encoder of a massively multilingual MT model. We show that TransAlign not only achieves strong WA performance but substantially outperforms popular WA and state-of-the-art non-WA-based label projection methods in MT-based XLT for token classification.

Anthology ID:: 2025.findings-emnlp.1129
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 20736–20749
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1129/
DOI:: 10.18653/v1/2025.findings-emnlp.1129
Bibkey:
Cite (ACL):: Benedikt Ebing, Christian Goldschmied, and Goran Glavaš. 2025. TransAlign: Machine Translation Encoders are Strong Word Aligners, Too. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 20736–20749, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: TransAlign: Machine Translation Encoders are Strong Word Aligners, Too (Ebing et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1129.pdf
Checklist:: 2025.findings-emnlp.1129.checklist.pdf

PDF Cite Search Checklist Fix data