Abstract
In translation, a concept represented by a single word in a source language can have multiple variations in a target language. The task of lexical selection requires using context to identify which variation is most appropriate for a source text. We work with native speakers of nine languages to create DTAiLS, a dataset of 1,377 sentence pairs that exhibit cross-lingual concept variation when translating from English. We evaluate recent LLMs and neural machine translation systems on DTAiLS, with the best-performing model, GPT-4, achieving from 67 to 85% accuracy across languages. Finally, we use language models to generate English rules describing target-language concept variations. Providing weaker models with high-quality lexical rules improves accuracy substantially, in some cases reaching or outperforming GPT-4.- Anthology ID:
- 2024.emnlp-main.278
- Volume:
- Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 4837–4848
- Language:
- URL:
- https://aclanthology.org/2024.emnlp-main.278
- DOI:
- 10.18653/v1/2024.emnlp-main.278
- Cite (ACL):
- Josh Barua, Sanjay Subramanian, Kayo Yin, and Alane Suhr. 2024. Using Language Models to Disambiguate Lexical Choices in Translation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4837–4848, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- Using Language Models to Disambiguate Lexical Choices in Translation (Barua et al., EMNLP 2024)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2024.emnlp-main.278.pdf