Théo Dehaze


2025

pdf bib
Command-A-Translate: Raising the Bar of Machine Translation with Difficulty Filtering
Tom Kocmi | Arkady Arkhangorodsky | Alexandre Berard | Phil Blunsom | Samuel Cahyawijaya | Théo Dehaze | Marzieh Fadaee | Nicholas Frosst | Matthias Galle | Aidan Gomez | Nithya Govindarajan | Wei-Yin Ko | Julia Kreutzer | Kelly Marchisio | Ahmet Üstün | Sebastian Vincent | Ivan Zhang
Proceedings of the Tenth Conference on Machine Translation

We present Command A Translate, an LLMbased machine translation model built off Cohere’s Command A. It reaches state-of-the-art machine translation quality via direct preference optimization. Our meticulously designed data preparation pipeline emphasizes robust quality control and a novel difficulty filtering – a key innovation that distinguishes Command A Translate. Furthermore, we extend our model and participate at WMT with a system (CommandA-WMT) that uses two models and post-editing steps of step-by-step reasoning and limited Minimum Bayes Risk decoding.

2024

pdf bib
Understanding and Mitigating Language Confusion in LLMs
Kelly Marchisio | Wei-Yin Ko | Alexandre Berard | Théo Dehaze | Sebastian Ruder
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

We investigate a surprising limitation of LLMs: their inability to consistently generate text in a user’s desired language. We create the Language Confusion Benchmark (LCB) to evaluate such failures, covering 15 typologically diverse languages with existing and newly-created English and multilingual prompts. We evaluate a range of LLMs on monolingual and cross-lingual generation reflecting practical use cases, finding that Llama Instruct and Mistral models exhibit high degrees of language confusion and even the strongest models fail to consistently respond in the correct language. We observe that base and English-centric instruct models are more prone to language confusion, which is aggravated by complex prompts and high sampling temperatures. We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning. We release our language confusion benchmark, which serves as a first layer of efficient, scalable multilingual evaluation.