Kanade Nonomura
2026
Mitigating Language Bias in Multilingual Sentence Embeddings for Cross-Lingual Similarity Estimation
Kanade Nonomura | Keita Fukushima | Risa Kondo | Tomoyuki Kajiwara
Proceedings of the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026)
Kanade Nonomura | Keita Fukushima | Risa Kondo | Tomoyuki Kajiwara
Proceedings of the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026)
We disentangle multilingual sentence embeddings into language-dependent and language-agnostic components, leveraging the latter to improve cross-lingual similarity estimation. Previous studies on this approach have trained disentanglers by combining intra-component constraints, which either align or disalign language-dependent embeddings or language-agnostic embeddings, with inter-component constraints across both embeddings. However, when and how these constraints are effective remains unclear. Our experiments on sentence similarity estimation and machine translation quality estimation revealed that while intra-component constraints and the combination of both constraints are effective for encoder-based multilingual sentence embeddings, inter-component constraints are effective for decoder-based ones. Furthermore, our detailed analysis revealed distinct roles: intra-component constraints improve uniformity within the embedding space, while inter-component constraints enhance cross-lingual alignment between parallel sentences.
Disentangling Meaning and Language Components in Diverse Multilingual Sentence Embeddings
Kanade Nonomura | Keita Fukushima | Risa Kondo | Tomoyuki Kajiwara
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Kanade Nonomura | Keita Fukushima | Risa Kondo | Tomoyuki Kajiwara
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
We disentangle multilingual sentence embeddings into language-dependent and language-agnostic components, leveraging the latter to improve cross-lingual similarity estimation.Previous studies focused on encoder-based approaches that use only the input sentence; in contrast, this study examines the effectiveness of disentanglement methods across a broader range of sentence embeddings, including decoder-based approaches and those that utilize prompts.Experimental results demonstrate that embedding disentanglement is effective for a wide variety of sentence embeddings.