A Dataset for Probing Translationese Preferences in English-to-Swedish Translation

Jenny Kunz, Anja Jarochenko, Marcel Bollmann


Abstract
Translations often carry traces of the source language, a phenomenon known as translationese. We introduce the first freely available English-to-Swedish dataset contrasting translationese sentences with idiomatic alternatives, designed to probe intrinsic preferences of language models. It includes error tags and descriptions of the problems in the original translations. In experiments evaluating smaller Swedish and multilingual LLMs with our dataset, we find that they often favor the translationese phrasing. Human alternatives are chosen more often when the English source sentence is omitted, indicating that exposure to the source biases models toward literal translations, although even without context models often prefer the translationese variant. Our dataset and findings provide a resource and benchmark for developing models that produce more natural, idiomatic output in non-English languages.
Anthology ID:
2026.lrec-main.690
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
8767–8779
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.690/
DOI:
Bibkey:
Cite (ACL):
Jenny Kunz, Anja Jarochenko, and Marcel Bollmann. 2026. A Dataset for Probing Translationese Preferences in English-to-Swedish Translation. International Conference on Language Resources and Evaluation, main:8767–8779.
Cite (Informal):
A Dataset for Probing Translationese Preferences in English-to-Swedish Translation (Kunz et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.690.pdf