“Be My Cheese?”: Cultural Nuance Benchmarking for Machine Translation in Multilingual LLMs
Madison Van Doren, Casey Ford, Jennifer Barajas, Riley VanMeter, Cory Holland
Abstract
We present a large-scale human evaluation benchmark for assessing cultural localisation in machine translation produced by state-of-the-art multilingual large language models (LLMs). Existing MT benchmarks emphasise token-level and grammatical accuracy, but often overlook pragmatic and culturally grounded competencies required for real-world localisation. Building on a pilot study of 87 translations across 20 languages, we evaluate 7 multilingual LLMs across 15 target languages with 5 native-speaker raters per language. Raters scored both full-text translations and segment-level instances of culturally nuanced language (idioms, puns, holidays, and culturally embedded concepts) on an ordinal 0–3 quality scale; segment ratings additionally included an NA option for untranslated segments. Across full-text evaluations, mean overall quality is modest (1.68/3): GPT-5 (2.10/3), Claude Sonnet 3.7 (1.97/3), and Mistral Medium 3.1 (1.84/3) form the strongest tier with fewer catastrophic failures. Segment-level results show sharp category effects: holidays (2.20/3) and cultural concepts (2.19/3) translate substantially better than idioms (1.65/3) and puns (1.45/3), and idioms are most likely to be left untranslated. These findings demonstrate a persistent gap between grammatical adequacy and cultural resonance. To our knowledge, this is the first multilingual, human-annotated benchmark focused explicitly on cultural nuance in translation and localisation, highlighting the need for culturally informed training data, improved cross-lingual pragmatics, and evaluation paradigms that better reflect real-world communicative competence.- Anthology ID:
- 2026.gem-main.6
- Volume:
- Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, USA
- Editors:
- Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
- Venues:
- GEM | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 52–76
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.6/
- DOI:
- Cite (ACL):
- Madison Van Doren, Casey Ford, Jennifer Barajas, Riley VanMeter, and Cory Holland. 2026. “Be My Cheese?”: Cultural Nuance Benchmarking for Machine Translation in Multilingual LLMs. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 52–76, San Diego, California, USA. Association for Computational Linguistics.
- Cite (Informal):
- “Be My Cheese?”: Cultural Nuance Benchmarking for Machine Translation in Multilingual LLMs (Van Doren et al., GEM 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.6.pdf