Probabilistic Bilingual Subword Segmentation with Latent Subword Alignment
Shoto Nishida, Daiki Matsui, Takashi Ninomiya, Isao Goto, Akihiro Tamura
Abstract
This study proposes a method for learning subword correspondences in parallel sentence pairs using the EM algorithm. Conventional neural machine translation typically employs subword segmentation models trained. However, since existing methods do not consider parallel relationships, inconsistencies in word segmentation between source and target languages may hinder translation model training. Our approach leverages direct modeling of subword correspondences in parallel corpora, thereby improving segmentation consistency across languages. Experiments across multiple machine translation tasks confirm that our proposed method improves translation accuracy for many tasks.- Anthology ID:
- 2026.eacl-srw.40
- Volume:
- Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Selene Baez Santamaria, Sai Ashish Somayajula, Atsuki Yamaguchi
- Venue:
- EACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 528–534
- Language:
- URL:
- https://preview.aclanthology.org/ingest-eacl/2026.eacl-srw.40/
- DOI:
- Cite (ACL):
- Shoto Nishida, Daiki Matsui, Takashi Ninomiya, Isao Goto, and Akihiro Tamura. 2026. Probabilistic Bilingual Subword Segmentation with Latent Subword Alignment. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 528–534, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- Probabilistic Bilingual Subword Segmentation with Latent Subword Alignment (Nishida et al., EACL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-eacl/2026.eacl-srw.40.pdf