Improving Machine Translation of Idioms: A Spanish–Galician Parallel Dataset and Synthetic Augmentation Approach

Lúa Santamaría Montesinos; Saúl Buján; Daniel Bardanca; Pablo Gamallo

Improving Machine Translation of Idioms: A Spanish–Galician Parallel Dataset and Synthetic Augmentation Approach

Lúa Santamaría Montesinos, Saúl Buján, Daniel Bardanca, Pablo Gamallo

Abstract

Idiomatic expressions are a well-known challenge for neural machine translation, including both traditional sequence-to-sequence models and large language models (LLMs). This paper presents a systematic approach to improve idiom translation between Spanish and Galician. First, we build a high-quality parallel dataset of idioms manually aligned across both languages. Then, we automatically extend this dataset into a large synthetic parallel corpus using LLMs, following a strategy that prioritizes the most frequent idioms observed in authentic corpora. This augmented dataset is used to retrain a seq2seq translation model. We evaluate the resulting system and compare it both to the baseline model without idiom data and to state-of-the-art LLM-based translators such as SalamandraTA. Results show that the translation of idioms improves significantly after the training, alongside a slight boost in the model’s overall performance.

Anthology ID:: 2026.propor-1.99
Volume:: Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Month:: April
Year:: 2026
Address:: Salvador, Brazil
Editors:: Marlo Souza, Iria de-Dios-Flores, Diana Santos, Larissa Freitas, Jackson Wilke da Cruz Souza, Eugénio Ribeiro
Venue:: PROPOR
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 980–987
Language:
URL:: https://preview.aclanthology.org/ingest-dnd/2026.propor-1.99/
DOI:
Bibkey:
Cite (ACL):: Lúa Santamaría Montesinos, Saúl Buján, Daniel Bardanca, and Pablo Gamallo. 2026. Improving Machine Translation of Idioms: A Spanish–Galician Parallel Dataset and Synthetic Augmentation Approach. In Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1, pages 980–987, Salvador, Brazil. Association for Computational Linguistics.
Cite (Informal):: Improving Machine Translation of Idioms: A Spanish–Galician Parallel Dataset and Synthetic Augmentation Approach (Montesinos et al., PROPOR 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-dnd/2026.propor-1.99.pdf

PDF Cite Search Fix data