GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation

Yunsu Kim; Kaden Uhlig; Joern Wuebker

GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation

Abstract

Agent benchmarks remain largely English-centric, while their multilingual versions are often built with machine translation (MT) and limited post-editing. We argue that, for agentic tasks, this minimal workflow can easily break benchmark validity through query-answer misalignment or culturally off-target context. We propose a refined workflow for adapting English benchmarks into multiple languages with explicit functional alignment, cultural alignment, and difficulty calibration using both automated checks and human review. Using this workflow, we introduce GAIA-v2-LILT, a re-audited multilingual extension of GAIA covering five non-English languages. In experiments, our workflow improves agent success rates by up to 32.7% over minimally translated versions, bringing the closest audited setting to within 3.1% of English performance while substantial gaps remain in many other cases. This indicates that a substantial share of the multilingual performance gap is benchmark-induced measurement error, motivating task-level alignment when adapting English benchmarks across languages. The data is available as part of the MAPS package. We also release the code used in our experiments.

Anthology ID:: 2026.mellm-1.13
Volume:: Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026)
Month:: July
Year:: 2026
Address:: San Diego, United States
Editors:: Kaiyu Huang, Fengran Mo, Pinzhen Chen, Meng Jiang
Venues:: MeLLM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 140–148
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.mellm-1.13/
DOI:
Bibkey:
Cite (ACL):: Yunsu Kim, Kaden Uhlig, and Joern Wuebker. 2026. GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation. In Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026), pages 140–148, San Diego, United States. Association for Computational Linguistics.
Cite (Informal):: GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation (Kim et al., MeLLM 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.mellm-1.13.pdf

PDF Cite Search Fix data