Probabilistic Bilingual Subword Segmentation with Latent Subword Alignment

Shoto Nishida, Daiki Matsui, Takashi Ninomiya, Isao Goto, Akihiro Tamura


Abstract
This study proposes a method for learning subword correspondences in parallel sentence pairs using the EM algorithm. Conventional neural machine translation typically employs subword segmentation models trained. However, since existing methods do not consider parallel relationships, inconsistencies in word segmentation between source and target languages may hinder translation model training. Our approach leverages direct modeling of subword correspondences in parallel corpora, thereby improving segmentation consistency across languages. Experiments across multiple machine translation tasks confirm that our proposed method improves translation accuracy for many tasks.
Anthology ID:
2026.eacl-srw.40
Volume:
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Selene Baez Santamaria, Sai Ashish Somayajula, Atsuki Yamaguchi
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
528–534
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-srw.40/
DOI:
Bibkey:
Cite (ACL):
Shoto Nishida, Daiki Matsui, Takashi Ninomiya, Isao Goto, and Akihiro Tamura. 2026. Probabilistic Bilingual Subword Segmentation with Latent Subword Alignment. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 528–534, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Probabilistic Bilingual Subword Segmentation with Latent Subword Alignment (Nishida et al., EACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-srw.40.pdf